Set up comprehensive Kubernetes monitoring using the Prometheus Operator and Grafana with persistent storage, RBAC, ServiceMonitors, and custom dashboards for complete cluster observability.
Prerequisites
- Running Kubernetes cluster with kubectl access
- Helm 3 installed
- Persistent storage available in cluster
- Basic understanding of Kubernetes resources
What this solves
Kubernetes clusters generate massive amounts of metrics from nodes, pods, services, and applications that are essential for maintaining cluster health and performance. Without proper monitoring, you're blind to resource usage, performance bottlenecks, and potential failures. This tutorial shows you how to deploy the Prometheus Operator with Grafana to collect, store, and visualize all Kubernetes metrics using ServiceMonitor and PodMonitor resources for automated discovery and monitoring of your workloads.
Step-by-step installation
Install kubectl and Helm
Install kubectl to interact with your Kubernetes cluster and Helm to deploy the monitoring stack.
sudo apt update
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
Create monitoring namespace
Create a dedicated namespace for all monitoring components to keep them organized and apply consistent policies.
kubectl create namespace monitoring
Add Prometheus community Helm repository
Add the official Prometheus community Helm repository that contains the kube-prometheus-stack chart.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Create Grafana configuration
Create a values file to configure Grafana with persistent storage, RBAC, and custom settings.
grafana:
adminPassword: "secure-admin-password-123"
persistence:
enabled: true
size: 10Gi
storageClassName: "standard"
serviceAccount:
create: true
name: grafana
rbac:
create: true
service:
type: ClusterIP
port: 3000
ingress:
enabled: false
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/default
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus-kube-prometheus-prometheus:9090
access: proxy
isDefault: true
Create Prometheus Operator values
Configure the complete monitoring stack with Prometheus Operator, AlertManager, and all necessary components.
prometheus:
prometheusSpec:
retention: "15d"
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: "standard"
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
ruleSelectorNilUsesHelmValues: false
serviceMonitorSelector: {}
podMonitorSelector: {}
ruleSelector: {}
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: "standard"
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "200m"
grafana:
adminPassword: "secure-admin-password-123"
persistence:
enabled: true
size: 10Gi
storageClassName: "standard"
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "200m"
nodeExporter:
enabled: true
kubeStateMetrics:
enabled: true
prometheusNodeExporter:
enabled: true
Deploy the monitoring stack
Install the complete kube-prometheus-stack using Helm with your custom configuration.
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values prometheus-values.yaml \
--wait \
--timeout 600s
Create ServiceMonitor for custom applications
Create a ServiceMonitor resource to automatically discover and monitor custom applications that expose metrics.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-metrics
namespace: monitoring
labels:
app: custom-app
spec:
selector:
matchLabels:
app: custom-app
endpoints:
- port: metrics
path: /metrics
interval: 30s
scrapeTimeout: 10s
namespaceSelector:
matchNames:
- default
- production
kubectl apply -f app-servicemonitor.yaml
Create PodMonitor for pod-level monitoring
Create a PodMonitor to directly monitor pods that expose metrics without requiring a service.
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: pod-metrics
namespace: monitoring
labels:
app: pod-monitor
spec:
selector:
matchLabels:
metrics: "enabled"
podMetricsEndpoints:
- port: metrics
path: /metrics
interval: 30s
scrapeTimeout: 10s
namespaceSelector:
matchNames:
- default
- production
kubectl apply -f pod-monitor.yaml
Set up port forwarding for Grafana
Create port forwarding to access Grafana dashboard from your local machine.
kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80 &
Set up port forwarding for Prometheus
Create port forwarding to access Prometheus web UI for query testing and configuration verification.
kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-prometheus 9090:9090 &
Create custom Kubernetes dashboard
Import a comprehensive Kubernetes monitoring dashboard into Grafana with cluster overview metrics.
{
"dashboard": {
"id": null,
"title": "Kubernetes Cluster Overview",
"tags": ["kubernetes", "cluster"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Cluster CPU Usage",
"type": "stat",
"targets": [
{
"expr": "(1 - avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90}
]
}
}
},
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
},
{
"id": 2,
"title": "Memory Usage",
"type": "stat",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 80},
{"color": "red", "value": 95}
]
}
}
},
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
},
{
"id": 3,
"title": "Pod Count by Namespace",
"type": "table",
"targets": [
{
"expr": "count(kube_pod_info) by (namespace)",
"refId": "A",
"format": "table"
}
],
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
}
],
"time": {
"from": "now-6h",
"to": "now"
},
"refresh": "30s"
}
}
Configure AlertManager rules
Create custom alerting rules for important Kubernetes metrics and resource thresholds.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kubernetes-alerts
namespace: monitoring
labels:
app: kube-prometheus-stack
spec:
groups:
- name: kubernetes.rules
rules:
- alert: KubernetesPodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[1h]) 60 5 > 0
for: 15m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been restarting {{ $value }} times in the last 5 minutes"
- alert: KubernetesNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 10m
labels:
severity: critical
annotations:
summary: "Kubernetes Node {{ $labels.node }} is not ready"
description: "Node {{ $labels.node }} has been not ready for more than 10 minutes"
- alert: KubernetesPodMemoryUsage
expr: (container_memory_working_set_bytes / container_spec_memory_limit_bytes) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} high memory usage"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} memory usage is above 90%"
kubectl apply -f k8s-alerts.yaml
Configure RBAC for monitoring
Create monitoring service account
Create a dedicated service account with proper RBAC permissions for the monitoring stack.
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus-monitoring
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus-monitoring
rules:
- apiGroups: [""]
resources: ["nodes", "nodes/metrics", "services", "endpoints", "pods"]
verbs: ["get", "list", "watch"]
- apiGroups: ["extensions"]
resources: ["ingresses"]
verbs: ["get", "list", "watch"]
- apiGroups: ["networking.k8s.io"]
resources: ["ingresses"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus-monitoring
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus-monitoring
subjects:
- kind: ServiceAccount
name: prometheus-monitoring
namespace: monitoring
kubectl apply -f monitoring-rbac.yaml
Create ingress for external access
Set up ingress resources to access Grafana and Prometheus from outside the cluster with proper TLS.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: grafana-ingress
namespace: monitoring
annotations:
kubernetes.io/ingress.class: "nginx"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
tls:
- hosts:
- grafana.example.com
secretName: grafana-tls
rules:
- host: grafana.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: monitoring-grafana
port:
number: 80
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: prometheus-ingress
namespace: monitoring
annotations:
kubernetes.io/ingress.class: "nginx"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
tls:
- hosts:
- prometheus.example.com
secretName: prometheus-tls
rules:
- host: prometheus.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: monitoring-kube-prometheus-prometheus
port:
number: 9090
kubectl apply -f monitoring-ingress.yaml
Verify your setup
Check that all monitoring components are running and collecting metrics properly.
kubectl get pods -n monitoring
kubectl get servicemonitors -n monitoring
kubectl get podmonitors -n monitoring
kubectl get prometheusrules -n monitoring
Verify Prometheus targets are being discovered:
kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-prometheus 9090:9090
curl http://localhost:9090/api/v1/targets
Test Grafana access and verify dashboards are loading:
kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80
curl -u admin:secure-admin-password-123 http://localhost:3000/api/health
Check that metrics are being collected:
kubectl exec -n monitoring deployment/monitoring-kube-prometheus-operator -- \
promtool query instant 'up'
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus
Common issues
| Symptom | Cause | Fix |
|---|---|---|
| Prometheus pods stuck in pending | Insufficient storage or resources | Check PVC status with kubectl get pvc -n monitoring and verify storage class exists |
| Grafana shows no data source | Prometheus service URL incorrect | Verify service name with kubectl get svc -n monitoring and update datasource URL |
| ServiceMonitor not discovering targets | Label selectors don't match | Check service labels match ServiceMonitor selector with kubectl describe servicemonitor |
| High memory usage in Prometheus | Too many metrics or long retention | Reduce retention period or add resource limits in values.yaml |
| AlertManager not sending alerts | Missing or incorrect configuration | Check AlertManager config with kubectl get secret -n monitoring monitoring-kube-prometheus-alertmanager -o yaml |
| Node exporter metrics missing | DaemonSet not deployed on all nodes | Check node selector and tolerations with kubectl get daemonset -n monitoring |
Next steps
- Configure Prometheus long-term storage with Thanos for unlimited data retention
- Implement custom Prometheus exporters for application metrics collection and monitoring
- Set up Kubernetes alerting with AlertManager and Slack integration
- Monitor Istio service mesh with Prometheus and Grafana dashboards
- Configure Kubernetes monitoring with Jaeger distributed tracing
Automated install script
Run this to automate the entire setup
#!/usr/bin/env bash
set -euo pipefail
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Default values
NAMESPACE="monitoring"
GRAFANA_PASSWORD="${GRAFANA_PASSWORD:-secure-admin-password-123}"
STORAGE_SIZE_PROMETHEUS="${PROMETHEUS_STORAGE:-50Gi}"
STORAGE_SIZE_GRAFANA="${GRAFANA_STORAGE:-10Gi}"
STORAGE_CLASS="${STORAGE_CLASS:-standard}"
# Cleanup function
cleanup() {
echo -e "${RED}[ERROR] Installation failed. Cleaning up...${NC}"
helm uninstall kube-prometheus-stack -n "$NAMESPACE" 2>/dev/null || true
kubectl delete namespace "$NAMESPACE" 2>/dev/null || true
rm -f /tmp/prometheus-values.yaml /tmp/kubectl 2>/dev/null || true
}
trap cleanup ERR
usage() {
echo "Usage: $0 [OPTIONS]"
echo "Options:"
echo " -n NAMESPACE Kubernetes namespace (default: monitoring)"
echo " -p PASSWORD Grafana admin password (default: secure-admin-password-123)"
echo " -s STORAGE_CLASS Storage class name (default: standard)"
echo " --prometheus-storage Prometheus storage size (default: 50Gi)"
echo " --grafana-storage Grafana storage size (default: 10Gi)"
echo " -h Show this help"
exit 1
}
# Parse arguments
while [[ $# -gt 0 ]]; do
case $1 in
-n) NAMESPACE="$2"; shift 2 ;;
-p) GRAFANA_PASSWORD="$2"; shift 2 ;;
-s) STORAGE_CLASS="$2"; shift 2 ;;
--prometheus-storage) STORAGE_SIZE_PROMETHEUS="$2"; shift 2 ;;
--grafana-storage) STORAGE_SIZE_GRAFANA="$2"; shift 2 ;;
-h) usage ;;
*) echo -e "${RED}Unknown option: $1${NC}"; usage ;;
esac
done
echo -e "${BLUE}Kubernetes Prometheus/Grafana Monitoring Setup${NC}"
echo "=================================================="
# Detect OS distribution
echo -e "${YELLOW}[1/8] Detecting operating system...${NC}"
if [ -f /etc/os-release ]; then
. /etc/os-release
case "$ID" in
ubuntu|debian)
PKG_MGR="apt"
PKG_UPDATE="apt update"
PKG_INSTALL="apt install -y"
;;
almalinux|rocky|centos|rhel|ol|fedora)
PKG_MGR="dnf"
PKG_UPDATE="dnf update -y"
PKG_INSTALL="dnf install -y"
;;
amzn)
PKG_MGR="yum"
PKG_UPDATE="yum update -y"
PKG_INSTALL="yum install -y"
;;
*)
echo -e "${RED}Unsupported distribution: $ID${NC}"
exit 1
;;
esac
else
echo -e "${RED}/etc/os-release not found. Cannot detect distribution.${NC}"
exit 1
fi
echo -e "${GREEN}Detected: $PRETTY_NAME${NC}"
# Check if running as root or with sudo
echo -e "${YELLOW}[2/8] Checking privileges...${NC}"
if [[ $EUID -ne 0 && -z "${SUDO_USER:-}" ]]; then
echo -e "${RED}This script must be run as root or with sudo${NC}"
exit 1
fi
# Update package repositories
echo -e "${YELLOW}[3/8] Updating package repositories...${NC}"
$PKG_UPDATE
# Install kubectl
echo -e "${YELLOW}[4/8] Installing kubectl...${NC}"
if ! command -v kubectl &> /dev/null; then
$PKG_INSTALL curl
KUBECTL_VERSION=$(curl -L -s https://dl.k8s.io/release/stable.txt)
curl -LO "https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl"
install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
rm -f kubectl
echo -e "${GREEN}kubectl installed successfully${NC}"
else
echo -e "${GREEN}kubectl already installed${NC}"
fi
# Install Helm
echo -e "${YELLOW}[5/8] Installing Helm...${NC}"
if ! command -v helm &> /dev/null; then
curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
echo -e "${GREEN}Helm installed successfully${NC}"
else
echo -e "${GREEN}Helm already installed${NC}"
fi
# Verify kubectl connectivity
echo -e "${YELLOW}[6/8] Verifying Kubernetes connectivity...${NC}"
if ! kubectl cluster-info &> /dev/null; then
echo -e "${RED}Cannot connect to Kubernetes cluster. Please ensure kubectl is configured.${NC}"
exit 1
fi
echo -e "${GREEN}Kubernetes cluster accessible${NC}"
# Create monitoring namespace
echo -e "${YELLOW}[7/8] Setting up monitoring stack...${NC}"
kubectl create namespace "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -
# Add Prometheus community Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Create Prometheus values file
cat > /tmp/prometheus-values.yaml << EOF
prometheus:
prometheusSpec:
retention: "15d"
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: "$STORAGE_CLASS"
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: $STORAGE_SIZE_PROMETHEUS
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
ruleSelectorNilUsesHelmValues: false
serviceMonitorSelector: {}
podMonitorSelector: {}
ruleSelector: {}
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: "$STORAGE_CLASS"
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "200m"
grafana:
adminPassword: "$GRAFANA_PASSWORD"
persistence:
enabled: true
size: $STORAGE_SIZE_GRAFANA
storageClassName: "$STORAGE_CLASS"
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "200m"
serviceAccount:
create: true
name: grafana
rbac:
create: true
service:
type: ClusterIP
port: 3000
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://kube-prometheus-stack-prometheus:9090
access: proxy
isDefault: true
EOF
# Install kube-prometheus-stack
helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace "$NAMESPACE" \
--values /tmp/prometheus-values.yaml \
--wait \
--timeout=10m
# Cleanup temporary files
rm -f /tmp/prometheus-values.yaml
echo -e "${YELLOW}[8/8] Verifying installation...${NC}"
# Wait for pods to be ready
kubectl wait --for=condition=ready pod -l "app.kubernetes.io/name=prometheus" -n "$NAMESPACE" --timeout=300s
kubectl wait --for=condition=ready pod -l "app.kubernetes.io/name=grafana" -n "$NAMESPACE" --timeout=300s
# Verify services are running
if kubectl get pods -n "$NAMESPACE" | grep -E "(prometheus|grafana|alertmanager)" | grep -v Running &> /dev/null; then
echo -e "${RED}Some pods are not running properly${NC}"
kubectl get pods -n "$NAMESPACE"
exit 1
fi
echo -e "${GREEN}✓ Installation completed successfully!${NC}"
echo ""
echo -e "${BLUE}Access Information:${NC}"
echo "==================="
echo -e "Namespace: ${GREEN}$NAMESPACE${NC}"
echo -e "Grafana URL: ${GREEN}kubectl port-forward -n $NAMESPACE svc/kube-prometheus-stack-grafana 3000:80${NC}"
echo -e "Grafana Login: ${GREEN}admin / $GRAFANA_PASSWORD${NC}"
echo -e "Prometheus URL: ${GREEN}kubectl port-forward -n $NAMESPACE svc/kube-prometheus-stack-prometheus 9090:9090${NC}"
echo -e "AlertManager URL: ${GREEN}kubectl port-forward -n $NAMESPACE svc/kube-prometheus-stack-alertmanager 9093:9093${NC}"
echo ""
echo -e "${YELLOW}To access the services, run the port-forward commands above in separate terminals.${NC}"
Review the script before running. Execute with: bash install.sh