Monitor Kubernetes cluster with Prometheus Operator for comprehensive observability

Intermediate 45 min Apr 22, 2026 13 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Set up complete cluster monitoring using Prometheus Operator with automated metrics collection, custom dashboards, and intelligent alerting for production Kubernetes environments.

Prerequisites

  • Kubernetes cluster with kubectl access
  • Helm 3 installed
  • At least 8GB RAM available in cluster
  • Storage class configured for persistent volumes

What this solves

Prometheus Operator simplifies monitoring Kubernetes clusters by automating the deployment and management of Prometheus, Grafana, and Alertmanager. It provides custom resource definitions (CRDs) that make it easy to configure monitoring for your applications and infrastructure without manually managing configuration files.

Step-by-step installation

Update system packages and install prerequisites

Start by ensuring your system is up to date and install required tools for Kubernetes management.

sudo apt update && sudo apt upgrade -y
sudo apt install -y curl wget git
sudo dnf update -y
sudo dnf install -y curl wget git

Install Helm package manager

Helm will help us deploy Prometheus Operator and manage its configuration easily.

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm version

Add Prometheus Community Helm repository

Add the official repository that contains the kube-prometheus-stack chart.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Create monitoring namespace

Create a dedicated namespace for all monitoring components to keep them organized.

kubectl create namespace monitoring

Create Prometheus Operator values file

Configure the installation with custom settings for production use.

prometheus:
  prometheusSpec:
    retention: 30d
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: standard
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi
    resources:
      requests:
        memory: 2Gi
        cpu: 500m
      limits:
        memory: 4Gi
        cpu: 2

grafana:
  adminPassword: SecureGrafanaPass123!
  persistence:
    enabled: true
    storageClassName: standard
    size: 10Gi
  resources:
    requests:
      memory: 256Mi
      cpu: 100m
    limits:
      memory: 512Mi
      cpu: 200m

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: standard
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi
    resources:
      requests:
        memory: 256Mi
        cpu: 100m
      limits:
        memory: 512Mi
        cpu: 200m

kubeStateMetrics:
  enabled: true

nodeExporter:
  enabled: true

prometheusNodeExporter:
  hostRootFsMount:
    enabled: false

Install Prometheus Operator with Helm

Deploy the complete monitoring stack using the configuration file we created.

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values prometheus-values.yaml \
  --version 65.1.0

Verify Prometheus Operator installation

Check that all monitoring components are running properly.

kubectl get pods -n monitoring
kubectl get svc -n monitoring

Create ServiceMonitor for application monitoring

Configure automatic discovery and scraping of application metrics using ServiceMonitor CRD.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nginx-service-monitor
  namespace: monitoring
  labels:
    app: nginx
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app: nginx
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
  namespaceSelector:
    matchNames:
    - default
    - production
kubectl apply -f servicemonitor-example.yaml

Create PodMonitor for pod-level monitoring

Set up direct pod monitoring for applications that expose metrics on specific ports.

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: app-pod-monitor
  namespace: monitoring
  labels:
    app: myapp
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app: myapp
  podMetricsEndpoints:
  - port: metrics
    interval: 30s
    path: /metrics
  namespaceSelector:
    matchNames:
    - default
    - production
kubectl apply -f podmonitor-example.yaml

Configure Grafana access

Set up port forwarding to access Grafana dashboard locally.

kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80

Configure custom Grafana dashboard

Create a custom dashboard for cluster overview with key metrics.

{
  "dashboard": {
    "id": null,
    "title": "Kubernetes Cluster Overview",
    "tags": ["kubernetes", "cluster"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "CPU Usage by Node",
        "type": "graph",
        "targets": [
          {
            "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "{{instance}}"
          }
        ]
      },
      {
        "id": 2,
        "title": "Memory Usage by Node",
        "type": "graph",
        "targets": [
          {
            "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
            "legendFormat": "{{instance}}"
          }
        ]
      },
      {
        "id": 3,
        "title": "Pod Count by Namespace",
        "type": "stat",
        "targets": [
          {
            "expr": "count by (namespace) (kube_pod_info)",
            "legendFormat": "{{namespace}}"
          }
        ]
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "30s"
  }
}

Create Alertmanager configuration

Set up email notifications and alerting rules for critical cluster events.

apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-kube-prometheus-stack-alertmanager
  namespace: monitoring
stringData:
  alertmanager.yml: |
    global:
      smtp_smarthost: 'mail.example.com:587'
      smtp_from: 'alerts@example.com'
      smtp_auth_username: 'alerts@example.com'
      smtp_auth_password: 'your-email-password'
    
    route:
      group_by: ['alertname', 'instance']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 1h
      receiver: 'web.hook'
      routes:
      - match:
          severity: critical
        receiver: email-critical
      - match:
          severity: warning
        receiver: email-warning
    
    receivers:
    - name: 'web.hook'
      webhook_configs:
      - url: 'http://127.0.0.1:5001/'
    - name: 'email-critical'
      email_configs:
      - to: 'devops@example.com'
        subject: '[CRITICAL] {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}
    - name: 'email-warning'
      email_configs:
      - to: 'monitoring@example.com'
        subject: '[WARNING] {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}
kubectl apply -f alertmanager-config.yaml

Create custom PrometheusRule for alerting

Define specific alerting rules for your cluster monitoring needs.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: custom-cluster-alerts
  namespace: monitoring
  labels:
    app: kube-prometheus-stack
    release: kube-prometheus-stack
spec:
  groups:
  - name: cluster.rules
    rules:
    - alert: HighCPUUsage
      expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High CPU usage on {{ $labels.instance }}"
        description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}"
    
    - alert: HighMemoryUsage
      expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High memory usage on {{ $labels.instance }}"
        description: "Memory usage is above 85% for more than 5 minutes on {{ $labels.instance }}"
    
    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Pod {{ $labels.pod }} is crash looping"
        description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is restarting frequently"
    
    - alert: NodeNotReady
      expr: kube_node_status_condition{condition="Ready",status="true"} == 0
      for: 10m
      labels:
        severity: critical
      annotations:
        summary: "Node {{ $labels.node }} is not ready"
        description: "Node {{ $labels.node }} has been not ready for more than 10 minutes"
kubectl apply -f custom-alerts.yaml

Configure persistent storage for metrics

Set up proper storage classes and volume claims for long-term metrics retention.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: prometheus-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
kubectl apply -f storage-class.yaml

Configure Grafana dashboards for Kubernetes cluster visualization

Access Grafana interface

Open your browser and navigate to the Grafana interface using the port forward we set up earlier.

# Access Grafana at http://localhost:3000

Username: admin

Password: SecureGrafanaPass123! (from our values file)

Import pre-built Kubernetes dashboards

Import community dashboards for comprehensive cluster visualization. These provide immediate insights into cluster health.

# Popular dashboard IDs to import:

315 - Kubernetes cluster monitoring

8588 - 1 Node Exporter for Prometheus Dashboard

7249 - Kubernetes Cluster

6417 - Kubernetes cluster overview

Create custom dashboard for application metrics

Build a specialized dashboard for monitoring your specific applications and services running in the cluster.

{
  "dashboard": {
    "title": "Application Metrics Dashboard",
    "panels": [
      {
        "title": "HTTP Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{service}}"
          }
        ]
      },
      {
        "title": "HTTP Request Duration",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
            "legendFormat": "95th percentile - {{service}}"
          }
        ]
      },
      {
        "title": "Database Connection Pool",
        "type": "stat",
        "targets": [
          {
            "expr": "database_connections_active / database_connections_max * 100",
            "legendFormat": "Pool Usage %"
          }
        ]
      }
    ]
  }
}

Set up Alertmanager rules and notifications

Configure Slack notifications

Set up Slack webhook integration for real-time alerts to your team channels.

apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-slack-config
  namespace: monitoring
stringData:
  alertmanager.yml: |
    global:
      slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
    
    route:
      group_by: ['alertname']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 1h
      receiver: 'slack-notifications'
      routes:
      - match:
          severity: critical
        receiver: 'slack-critical'
    
    receivers:
    - name: 'slack-notifications'
      slack_configs:
      - channel: '#monitoring'
        title: 'Kubernetes Alert'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
    
    - name: 'slack-critical'
      slack_configs:
      - channel: '#critical-alerts'
        title: 'CRITICAL: Kubernetes Alert'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\nDescription: {{ .Annotations.description }}{{ end }}'
        color: 'danger'
kubectl apply -f alertmanager-slack.yaml

Test alerting configuration

Verify that alerts are properly configured and firing when conditions are met.

# Check Alertmanager status
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093

Access Alertmanager UI at http://localhost:9093

Check active alerts and verify routing configuration

Create runbook annotations

Add detailed runbook links and troubleshooting steps to your alerting rules.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: runbook-alerts
  namespace: monitoring
  labels:
    app: kube-prometheus-stack
    release: kube-prometheus-stack
spec:
  groups:
  - name: runbook.rules
    rules:
    - alert: KubernetesPodNotReady
      expr: kube_pod_status_ready{condition="false"} == 1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.pod }} not ready in namespace {{ $labels.namespace }}"
        description: "Pod has been in a non-ready state for more than 5 minutes"
        runbook_url: "https://runbooks.example.com/kubernetes/pod-not-ready"
        action: "Check pod logs: kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }}"
    
    - alert: KubernetesNodeDiskPressure
      expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Node {{ $labels.node }} has disk pressure"
        description: "Node is experiencing disk pressure which may affect pod scheduling"
        runbook_url: "https://runbooks.example.com/kubernetes/node-disk-pressure"
        action: "Check disk usage: kubectl describe node {{ $labels.node }}"
kubectl apply -f runbook-alerts.yaml

Verify your setup

# Check all monitoring components
kubectl get pods -n monitoring
kubectl get svc -n monitoring
kubectl get servicemonitors -n monitoring
kubectl get podmonitors -n monitoring
kubectl get prometheusrules -n monitoring

Verify Prometheus targets

kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090

Access http://localhost:9090/targets

Check Grafana dashboards

kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80

Access http://localhost:3000

Verify Alertmanager

kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093

Access http://localhost:9093

Common issues

SymptomCauseFix
Pods stuck in Pending stateInsufficient cluster resourcesReduce resource requests or add more nodes
ServiceMonitor not discovering targetsLabel selector mismatchVerify labels match between ServiceMonitor and Service
Grafana dashboard shows no dataPrometheus not scraping metricsCheck Prometheus targets page for scraping errors
Alerts not firingPrometheusRule labels missingEnsure PrometheusRule has correct release label
Persistent volumes not mountingStorageClass not availableCreate appropriate StorageClass or use existing one
High memory usage in PrometheusToo many time series or long retentionReduce retention period or add resource limits

Next steps

Running this in production?

Need this managed? Setting this up once is straightforward. Keeping it patched, monitored, backed up and performant across environments is the harder part. See how we run infrastructure like this for European teams.

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.