Set up comprehensive Kubernetes monitoring using the Prometheus Operator and Grafana with persistent storage, RBAC, ServiceMonitors, and custom dashboards for complete cluster observability.

Prerequisites

Running Kubernetes cluster with kubectl access
Helm 3 installed
Persistent storage available in cluster
Basic understanding of Kubernetes resources

What this solves

Kubernetes clusters generate massive amounts of metrics from nodes, pods, services, and applications that are essential for maintaining cluster health and performance. Without proper monitoring, you're blind to resource usage, performance bottlenecks, and potential failures. This tutorial shows you how to deploy the Prometheus Operator with Grafana to collect, store, and visualize all Kubernetes metrics using ServiceMonitor and PodMonitor resources for automated discovery and monitoring of your workloads.

Step-by-step installation

Install kubectl and Helm

Install kubectl to interact with your Kubernetes cluster and Helm to deploy the monitoring stack.

sudo apt update
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

sudo dnf update -y
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

Create monitoring namespace

Create a dedicated namespace for all monitoring components to keep them organized and apply consistent policies.

kubectl create namespace monitoring

Add Prometheus community Helm repository

Add the official Prometheus community Helm repository that contains the kube-prometheus-stack chart.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Create Grafana configuration

Create a values file to configure Grafana with persistent storage, RBAC, and custom settings.

grafana:
  adminPassword: "secure-admin-password-123"
  persistence:
    enabled: true
    size: 10Gi
    storageClassName: "standard"
  serviceAccount:
    create: true
    name: grafana
  rbac:
    create: true
  service:
    type: ClusterIP
    port: 3000
  ingress:
    enabled: false
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
      - name: 'default'
        orgId: 1
        folder: ''
        type: file
        disableDeletion: false
        editable: true
        options:
          path: /var/lib/grafana/dashboards/default
  datasources:
    datasources.yaml:
      apiVersion: 1
      datasources:
      - name: Prometheus
        type: prometheus
        url: http://prometheus-kube-prometheus-prometheus:9090
        access: proxy
        isDefault: true

Create Prometheus Operator values

Configure the complete monitoring stack with Prometheus Operator, AlertManager, and all necessary components.

prometheus:
  prometheusSpec:
    retention: "15d"
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: "standard"
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi
    resources:
      requests:
        memory: "2Gi"
        cpu: "1000m"
      limits:
        memory: "4Gi"
        cpu: "2000m"
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false
    ruleSelectorNilUsesHelmValues: false
    serviceMonitorSelector: {}
    podMonitorSelector: {}
    ruleSelector: {}

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: "standard"
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi
    resources:
      requests:
        memory: "256Mi"
        cpu: "100m"
      limits:
        memory: "512Mi"
        cpu: "200m"

grafana:
  adminPassword: "secure-admin-password-123"
  persistence:
    enabled: true
    size: 10Gi
    storageClassName: "standard"
  resources:
    requests:
      memory: "256Mi"
      cpu: "100m"
    limits:
      memory: "512Mi"
      cpu: "200m"

nodeExporter:
  enabled: true

kubeStateMetrics:
  enabled: true

prometheusNodeExporter:
  enabled: true

Deploy the monitoring stack

Install the complete kube-prometheus-stack using Helm with your custom configuration.

helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values prometheus-values.yaml \
  --wait \
  --timeout 600s

Create ServiceMonitor for custom applications

Create a ServiceMonitor resource to automatically discover and monitor custom applications that expose metrics.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
  namespace: monitoring
  labels:
    app: custom-app
spec:
  selector:
    matchLabels:
      app: custom-app
  endpoints:
  - port: metrics
    path: /metrics
    interval: 30s
    scrapeTimeout: 10s
  namespaceSelector:
    matchNames:
    - default
    - production

kubectl apply -f app-servicemonitor.yaml

Create PodMonitor for pod-level monitoring

Create a PodMonitor to directly monitor pods that expose metrics without requiring a service.

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: pod-metrics
  namespace: monitoring
  labels:
    app: pod-monitor
spec:
  selector:
    matchLabels:
      metrics: "enabled"
  podMetricsEndpoints:
  - port: metrics
    path: /metrics
    interval: 30s
    scrapeTimeout: 10s
  namespaceSelector:
    matchNames:
    - default
    - production

kubectl apply -f pod-monitor.yaml

Set up port forwarding for Grafana

Create port forwarding to access Grafana dashboard from your local machine.

kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80 &

Set up port forwarding for Prometheus

Create port forwarding to access Prometheus web UI for query testing and configuration verification.

kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-prometheus 9090:9090 &

Create custom Kubernetes dashboard

Import a comprehensive Kubernetes monitoring dashboard into Grafana with cluster overview metrics.

{
  "dashboard": {
    "id": null,
    "title": "Kubernetes Cluster Overview",
    "tags": ["kubernetes", "cluster"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "Cluster CPU Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "(1 - avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Memory Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 80},
                {"color": "red", "value": 95}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
      },
      {
        "id": 3,
        "title": "Pod Count by Namespace",
        "type": "table",
        "targets": [
          {
            "expr": "count(kube_pod_info) by (namespace)",
            "refId": "A",
            "format": "table"
          }
        ],
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
      }
    ],
    "time": {
      "from": "now-6h",
      "to": "now"
    },
    "refresh": "30s"
  }
}

Configure AlertManager rules

Create custom alerting rules for important Kubernetes metrics and resource thresholds.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kubernetes-alerts
  namespace: monitoring
  labels:
    app: kube-prometheus-stack
spec:
  groups:
  - name: kubernetes.rules
    rules:
    - alert: KubernetesPodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total[1h])  60  5 > 0
      for: 15m
      labels:
        severity: critical
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
        description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been restarting {{ $value }} times in the last 5 minutes"
    
    - alert: KubernetesNodeNotReady
      expr: kube_node_status_condition{condition="Ready",status="true"} == 0
      for: 10m
      labels:
        severity: critical
      annotations:
        summary: "Kubernetes Node {{ $labels.node }} is not ready"
        description: "Node {{ $labels.node }} has been not ready for more than 10 minutes"
    
    - alert: KubernetesPodMemoryUsage
      expr: (container_memory_working_set_bytes / container_spec_memory_limit_bytes) * 100 > 90
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} high memory usage"
        description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} memory usage is above 90%"

kubectl apply -f k8s-alerts.yaml

Configure RBAC for monitoring

Create monitoring service account

Create a dedicated service account with proper RBAC permissions for the monitoring stack.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus-monitoring
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus-monitoring
rules:
apiGroups: [""]  resources: ["nodes", "nodes/metrics", "services", "endpoints", "pods"]
  verbs: ["get", "list", "watch"]
apiGroups: ["extensions"]  resources: ["ingresses"]
  verbs: ["get", "list", "watch"]
apiGroups: ["networking.k8s.io"]  resources: ["ingresses"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus-monitoring
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus-monitoring
subjects:
kind: ServiceAccount  name: prometheus-monitoring
  namespace: monitoring

kubectl apply -f monitoring-rbac.yaml

Create ingress for external access

Set up ingress resources to access Grafana and Prometheus from outside the cluster with proper TLS.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grafana-ingress
  namespace: monitoring
  annotations:
    kubernetes.io/ingress.class: "nginx"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  tls:
  - hosts:
    - grafana.example.com
    secretName: grafana-tls
  rules:
  - host: grafana.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: monitoring-grafana
            port:
              number: 80
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: prometheus-ingress
  namespace: monitoring
  annotations:
    kubernetes.io/ingress.class: "nginx"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  tls:
  - hosts:
    - prometheus.example.com
    secretName: prometheus-tls
  rules:
  - host: prometheus.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: monitoring-kube-prometheus-prometheus
            port:
              number: 9090

kubectl apply -f monitoring-ingress.yaml

Verify your setup

Check that all monitoring components are running and collecting metrics properly.

kubectl get pods -n monitoring
kubectl get servicemonitors -n monitoring
kubectl get podmonitors -n monitoring
kubectl get prometheusrules -n monitoring

Verify Prometheus targets are being discovered:

kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-prometheus 9090:9090
curl http://localhost:9090/api/v1/targets

Test Grafana access and verify dashboards are loading:

kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80
curl -u admin:secure-admin-password-123 http://localhost:3000/api/health

Check that metrics are being collected:

kubectl exec -n monitoring deployment/monitoring-kube-prometheus-operator -- \
  promtool query instant 'up'
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus

Common issues

Symptom	Cause	Fix
Prometheus pods stuck in pending	Insufficient storage or resources	Check PVC status with `kubectl get pvc -n monitoring` and verify storage class exists
Grafana shows no data source	Prometheus service URL incorrect	Verify service name with `kubectl get svc -n monitoring` and update datasource URL
ServiceMonitor not discovering targets	Label selectors don't match	Check service labels match ServiceMonitor selector with `kubectl describe servicemonitor`
High memory usage in Prometheus	Too many metrics or long retention	Reduce retention period or add resource limits in values.yaml
AlertManager not sending alerts	Missing or incorrect configuration	Check AlertManager config with `kubectl get secret -n monitoring monitoring-kube-prometheus-alertmanager -o yaml`
Node exporter metrics missing	DaemonSet not deployed on all nodes	Check node selector and tolerations with `kubectl get daemonset -n monitoring`

Next steps

Automated install script

Run this to automate the entire setup

install.sh

#!/usr/bin/env bash
set -euo pipefail

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color

# Default values
NAMESPACE="monitoring"
GRAFANA_PASSWORD="${GRAFANA_PASSWORD:-secure-admin-password-123}"
STORAGE_SIZE_PROMETHEUS="${PROMETHEUS_STORAGE:-50Gi}"
STORAGE_SIZE_GRAFANA="${GRAFANA_STORAGE:-10Gi}"
STORAGE_CLASS="${STORAGE_CLASS:-standard}"

# Cleanup function
cleanup() {
    echo -e "${RED}[ERROR] Installation failed. Cleaning up...${NC}"
    helm uninstall kube-prometheus-stack -n "$NAMESPACE" 2>/dev/null || true
    kubectl delete namespace "$NAMESPACE" 2>/dev/null || true
    rm -f /tmp/prometheus-values.yaml /tmp/kubectl 2>/dev/null || true
}

trap cleanup ERR

usage() {
    echo "Usage: $0 [OPTIONS]"
    echo "Options:"
    echo "  -n NAMESPACE          Kubernetes namespace (default: monitoring)"
    echo "  -p PASSWORD           Grafana admin password (default: secure-admin-password-123)"
    echo "  -s STORAGE_CLASS      Storage class name (default: standard)"
    echo "  --prometheus-storage  Prometheus storage size (default: 50Gi)"
    echo "  --grafana-storage     Grafana storage size (default: 10Gi)"
    echo "  -h                    Show this help"
    exit 1
}

# Parse arguments
while [[ $# -gt 0 ]]; do
    case $1 in
        -n) NAMESPACE="$2"; shift 2 ;;
        -p) GRAFANA_PASSWORD="$2"; shift 2 ;;
        -s) STORAGE_CLASS="$2"; shift 2 ;;
        --prometheus-storage) STORAGE_SIZE_PROMETHEUS="$2"; shift 2 ;;
        --grafana-storage) STORAGE_SIZE_GRAFANA="$2"; shift 2 ;;
        -h) usage ;;
        *) echo -e "${RED}Unknown option: $1${NC}"; usage ;;
    esac
done

echo -e "${BLUE}Kubernetes Prometheus/Grafana Monitoring Setup${NC}"
echo "=================================================="

# Detect OS distribution
echo -e "${YELLOW}[1/8] Detecting operating system...${NC}"
if [ -f /etc/os-release ]; then
    . /etc/os-release
    case "$ID" in
        ubuntu|debian) 
            PKG_MGR="apt"
            PKG_UPDATE="apt update"
            PKG_INSTALL="apt install -y"
            ;;
        almalinux|rocky|centos|rhel|ol|fedora) 
            PKG_MGR="dnf"
            PKG_UPDATE="dnf update -y"
            PKG_INSTALL="dnf install -y"
            ;;
        amzn) 
            PKG_MGR="yum"
            PKG_UPDATE="yum update -y"
            PKG_INSTALL="yum install -y"
            ;;
        *) 
            echo -e "${RED}Unsupported distribution: $ID${NC}"
            exit 1
            ;;
    esac
else
    echo -e "${RED}/etc/os-release not found. Cannot detect distribution.${NC}"
    exit 1
fi

echo -e "${GREEN}Detected: $PRETTY_NAME${NC}"

# Check if running as root or with sudo
echo -e "${YELLOW}[2/8] Checking privileges...${NC}"
if [[ $EUID -ne 0 && -z "${SUDO_USER:-}" ]]; then
    echo -e "${RED}This script must be run as root or with sudo${NC}"
    exit 1
fi

# Update package repositories
echo -e "${YELLOW}[3/8] Updating package repositories...${NC}"
$PKG_UPDATE

# Install kubectl
echo -e "${YELLOW}[4/8] Installing kubectl...${NC}"
if ! command -v kubectl &> /dev/null; then
    $PKG_INSTALL curl
    KUBECTL_VERSION=$(curl -L -s https://dl.k8s.io/release/stable.txt)
    curl -LO "https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl"
    install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
    rm -f kubectl
    echo -e "${GREEN}kubectl installed successfully${NC}"
else
    echo -e "${GREEN}kubectl already installed${NC}"
fi

# Install Helm
echo -e "${YELLOW}[5/8] Installing Helm...${NC}"
if ! command -v helm &> /dev/null; then
    curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
    echo -e "${GREEN}Helm installed successfully${NC}"
else
    echo -e "${GREEN}Helm already installed${NC}"
fi

# Verify kubectl connectivity
echo -e "${YELLOW}[6/8] Verifying Kubernetes connectivity...${NC}"
if ! kubectl cluster-info &> /dev/null; then
    echo -e "${RED}Cannot connect to Kubernetes cluster. Please ensure kubectl is configured.${NC}"
    exit 1
fi
echo -e "${GREEN}Kubernetes cluster accessible${NC}"

# Create monitoring namespace
echo -e "${YELLOW}[7/8] Setting up monitoring stack...${NC}"
kubectl create namespace "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -

# Add Prometheus community Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Create Prometheus values file
cat > /tmp/prometheus-values.yaml << EOF
prometheus:
  prometheusSpec:
    retention: "15d"
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: "$STORAGE_CLASS"
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: $STORAGE_SIZE_PROMETHEUS
    resources:
      requests:
        memory: "2Gi"
        cpu: "1000m"
      limits:
        memory: "4Gi"
        cpu: "2000m"
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false
    ruleSelectorNilUsesHelmValues: false
    serviceMonitorSelector: {}
    podMonitorSelector: {}
    ruleSelector: {}

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: "$STORAGE_CLASS"
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi
    resources:
      requests:
        memory: "256Mi"
        cpu: "100m"
      limits:
        memory: "512Mi"
        cpu: "200m"

grafana:
  adminPassword: "$GRAFANA_PASSWORD"
  persistence:
    enabled: true
    size: $STORAGE_SIZE_GRAFANA
    storageClassName: "$STORAGE_CLASS"
  resources:
    requests:
      memory: "256Mi"
      cpu: "100m"
    limits:
      memory: "512Mi"
      cpu: "200m"
  serviceAccount:
    create: true
    name: grafana
  rbac:
    create: true
  service:
    type: ClusterIP
    port: 3000
  datasources:
    datasources.yaml:
      apiVersion: 1
      datasources:
      - name: Prometheus
        type: prometheus
        url: http://kube-prometheus-stack-prometheus:9090
        access: proxy
        isDefault: true
EOF

# Install kube-prometheus-stack
helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
    --namespace "$NAMESPACE" \
    --values /tmp/prometheus-values.yaml \
    --wait \
    --timeout=10m

# Cleanup temporary files
rm -f /tmp/prometheus-values.yaml

echo -e "${YELLOW}[8/8] Verifying installation...${NC}"

# Wait for pods to be ready
kubectl wait --for=condition=ready pod -l "app.kubernetes.io/name=prometheus" -n "$NAMESPACE" --timeout=300s
kubectl wait --for=condition=ready pod -l "app.kubernetes.io/name=grafana" -n "$NAMESPACE" --timeout=300s

# Verify services are running
if kubectl get pods -n "$NAMESPACE" | grep -E "(prometheus|grafana|alertmanager)" | grep -v Running &> /dev/null; then
    echo -e "${RED}Some pods are not running properly${NC}"
    kubectl get pods -n "$NAMESPACE"
    exit 1
fi

echo -e "${GREEN}✓ Installation completed successfully!${NC}"
echo ""
echo -e "${BLUE}Access Information:${NC}"
echo "==================="
echo -e "Namespace: ${GREEN}$NAMESPACE${NC}"
echo -e "Grafana URL: ${GREEN}kubectl port-forward -n $NAMESPACE svc/kube-prometheus-stack-grafana 3000:80${NC}"
echo -e "Grafana Login: ${GREEN}admin / $GRAFANA_PASSWORD${NC}"
echo -e "Prometheus URL: ${GREEN}kubectl port-forward -n $NAMESPACE svc/kube-prometheus-stack-prometheus 9090:9090${NC}"
echo -e "AlertManager URL: ${GREEN}kubectl port-forward -n $NAMESPACE svc/kube-prometheus-stack-alertmanager 9093:9093${NC}"
echo ""
echo -e "${YELLOW}To access the services, run the port-forward commands above in separate terminals.${NC}"

Review the script before running. Execute with: bash install.sh

#kubernetes #prometheus #grafana #monitoring #observability

Monitor Kubernetes clusters with Prometheus and Grafana for container orchestration insights