Set up keepalived cluster monitoring with Prometheus alerts and Grafana dashboards

Advanced 45 min Apr 13, 2026 204 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Configure comprehensive monitoring for keepalived VRRP clusters using Prometheus metrics collection, alerting rules for failover events, and Grafana dashboards for high availability visualization.

Prerequisites

  • Two servers for keepalived cluster
  • One server for monitoring stack
  • Basic networking knowledge
  • Root access to all servers

What this solves

Keepalived provides high availability through VRRP (Virtual Router Redundancy Protocol) but lacks built-in monitoring capabilities. This tutorial sets up comprehensive monitoring for keepalived clusters using Prometheus to collect VRRP state metrics, create alerting rules for failover events, and build Grafana dashboards for real-time visualization of your high availability infrastructure.

Prerequisites

You'll need two servers for the keepalived cluster, plus monitoring infrastructure. Ensure you have root access and basic networking knowledge of VRRP concepts.

Step-by-step configuration

Install keepalived cluster

Set up keepalived on both cluster nodes to create a high availability pair with shared virtual IP addresses.

sudo apt update
sudo apt install -y keepalived
sudo dnf update -y
sudo dnf install -y keepalived

Configure primary keepalived node

Create the keepalived configuration for the primary node with VRRP instance and health checking.

global_defs {
    router_id KEEPALIVED_PRIMARY
    script_user keepalived_script
    enable_script_security
}

vrrp_script chk_nginx {
    script "/bin/curl -f http://localhost/ || exit 1"
    interval 2
    weight -2
    fall 3
    rise 2
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 110
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass changeme123
    }
    virtual_ipaddress {
        203.0.113.100/24 dev eth0
    }
    track_script {
        chk_nginx
    }
    notify_master "/etc/keepalived/scripts/notify_master.sh"
    notify_backup "/etc/keepalived/scripts/notify_backup.sh"
    notify_fault "/etc/keepalived/scripts/notify_fault.sh"
}

Configure backup keepalived node

Set up the backup node with lower priority and same virtual IP configuration.

global_defs {
    router_id KEEPALIVED_BACKUP
    script_user keepalived_script
    enable_script_security
}

vrrp_script chk_nginx {
    script "/bin/curl -f http://localhost/ || exit 1"
    interval 2
    weight -2
    fall 3
    rise 2
}

vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass changeme123
    }
    virtual_ipaddress {
        203.0.113.100/24 dev eth0
    }
    track_script {
        chk_nginx
    }
    notify_master "/etc/keepalived/scripts/notify_master.sh"
    notify_backup "/etc/keepalived/scripts/notify_backup.sh"
    notify_fault "/etc/keepalived/scripts/notify_fault.sh"
}

Create keepalived notification scripts

Set up notification scripts that will update metrics files when VRRP state changes occur.

sudo mkdir -p /etc/keepalived/scripts
sudo mkdir -p /var/lib/prometheus/node-exporter
#!/bin/bash
echo "keepalived_vrrp_state{instance=\"VI_1\",state=\"master\"} 2" > /var/lib/prometheus/node-exporter/keepalived.prom
echo "keepalived_vrrp_priority{instance=\"VI_1\"} $(grep priority /etc/keepalived/keepalived.conf | awk '{print $2}')" >> /var/lib/prometheus/node-exporter/keepalived.prom
echo "keepalived_transitions_total{instance=\"VI_1\",type=\"master\"} $(date +%s)" >> /var/lib/prometheus/node-exporter/keepalived.prom
logger "Keepalived: Transitioned to MASTER state"
#!/bin/bash
echo "keepalived_vrrp_state{instance=\"VI_1\",state=\"backup\"} 1" > /var/lib/prometheus/node-exporter/keepalived.prom
echo "keepalived_vrrp_priority{instance=\"VI_1\"} $(grep priority /etc/keepalived/keepalived.conf | awk '{print $2}')" >> /var/lib/prometheus/node-exporter/keepalived.prom
echo "keepalived_transitions_total{instance=\"VI_1\",type=\"backup\"} $(date +%s)" >> /var/lib/prometheus/node-exporter/keepalived.prom
logger "Keepalived: Transitioned to BACKUP state"
#!/bin/bash
echo "keepalived_vrrp_state{instance=\"VI_1\",state=\"fault\"} 0" > /var/lib/prometheus/node-exporter/keepalived.prom
echo "keepalived_vrrp_priority{instance=\"VI_1\"} $(grep priority /etc/keepalived/keepalived.conf | awk '{print $2}')" >> /var/lib/prometheus/node-exporter/keepalived.prom
echo "keepalived_transitions_total{instance=\"VI_1\",type=\"fault\"} $(date +%s)" >> /var/lib/prometheus/node-exporter/keepalived.prom
logger "Keepalived: Transitioned to FAULT state"

Set script permissions and user

Create the keepalived script user and set proper permissions for security.

sudo useradd -r -s /bin/false keepalived_script
sudo chmod 755 /etc/keepalived/scripts/*.sh
sudo chown -R keepalived_script:keepalived_script /etc/keepalived/scripts
sudo chown -R prometheus:prometheus /var/lib/prometheus/node-exporter
Never use chmod 777. It gives every user on the system full access to your files. Instead, fix ownership with chown and use minimal permissions.

Install Prometheus Node Exporter

Install Node Exporter to collect system metrics and expose keepalived custom metrics.

cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar -xzf node_exporter-1.8.2.linux-amd64.tar.gz
sudo cp node_exporter-1.8.2.linux-amd64/node_exporter /usr/local/bin/
sudo useradd -r -s /bin/false prometheus

Configure Node Exporter with text file collector

Enable the text file collector to read keepalived metrics from the notification scripts.

[Unit]
Description=Prometheus Node Exporter
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter \
    --collector.textfile.directory=/var/lib/prometheus/node-exporter \
    --collector.systemd \
    --collector.processes
Restart=always

[Install]
WantedBy=multi-user.target

Install Prometheus server

Set up Prometheus to scrape metrics from your keepalived cluster nodes.

cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.54.1/prometheus-2.54.1.linux-amd64.tar.gz
tar -xzf prometheus-2.54.1.linux-amd64.tar.gz
sudo cp prometheus-2.54.1.linux-amd64/prometheus /usr/local/bin/
sudo cp prometheus-2.54.1.linux-amd64/promtool /usr/local/bin/
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

Configure Prometheus scraping

Configure Prometheus to collect metrics from both keepalived cluster nodes.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "/etc/prometheus/rules/keepalived.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - "localhost:9093"

scrape_configs:
  - job_name: 'keepalived-cluster'
    static_configs:
      - targets:
          - '203.0.113.10:9100'  # Primary node
          - '203.0.113.11:9100'  # Backup node
    scrape_interval: 5s
    metrics_path: '/metrics'
    
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

Create Prometheus alerting rules

Set up alerting rules to detect keepalived failover events and cluster issues.

sudo mkdir -p /etc/prometheus/rules
groups:
  • name: keepalived.rules
rules: - alert: KeepaliveFailover expr: increase(keepalived_transitions_total[5m]) > 0 for: 0m labels: severity: warning annotations: summary: "Keepalived failover detected on {{ $labels.instance }}" description: "Keepalived instance {{ $labels.instance }} has experienced a state transition in the last 5 minutes." - alert: KeepaliveNoMaster expr: sum(keepalived_vrrp_state == 2) == 0 for: 30s labels: severity: critical annotations: summary: "No keepalived master found in cluster" description: "No keepalived instance is currently in MASTER state, indicating a split-brain or cluster failure." - alert: KeepaliveMultipleMasters expr: sum(keepalived_vrrp_state == 2) > 1 for: 30s labels: severity: critical annotations: summary: "Multiple keepalived masters detected" description: "{{ $value }} keepalived instances are in MASTER state, indicating a split-brain condition." - alert: KeepaliveInstanceDown expr: up{job="keepalived-cluster"} == 0 for: 1m labels: severity: warning annotations: summary: "Keepalived node {{ $labels.instance }} is down" description: "Cannot scrape metrics from keepalived node {{ $labels.instance }} for more than 1 minute." - alert: KeepaliveHighFailoverRate expr: rate(keepalived_transitions_total[1h]) > 0.1 for: 5m labels: severity: warning annotations: summary: "High keepalived failover rate on {{ $labels.instance }}" description: "Keepalived instance {{ $labels.instance }} is experiencing frequent state transitions ({{ $value }} per hour)." - alert: KeepaliveFaultState expr: keepalived_vrrp_state == 0 for: 1m labels: severity: critical annotations: summary: "Keepalived instance {{ $labels.instance }} in FAULT state" description: "Keepalived instance {{ $labels.instance }} has been in FAULT state for more than 1 minute."

Install Prometheus Alertmanager

Install Alertmanager to handle alert notifications from Prometheus.

cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
tar -xzf alertmanager-0.27.0.linux-amd64.tar.gz
sudo cp alertmanager-0.27.0.linux-amd64/alertmanager /usr/local/bin/
sudo mkdir -p /etc/alertmanager

Configure Alertmanager

Set up basic Alertmanager configuration for email notifications.

global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alerts@example.com'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'keepalived-alerts'

receivers:
  • name: 'keepalived-alerts'
email_configs: - to: 'admin@example.com' subject: 'Keepalived Alert: {{ .GroupLabels.alertname }}' body: | {{ range .Alerts }} Alert: {{ .Annotations.summary }} Description: {{ .Annotations.description }} Instance: {{ .Labels.instance }} Severity: {{ .Labels.severity }} {{ end }}

Create systemd services

Create systemd service files for Prometheus and Alertmanager.

[Unit]
Description=Prometheus Server
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
    --config.file=/etc/prometheus/prometheus.yml \
    --storage.tsdb.path=/var/lib/prometheus/ \
    --web.console.templates=/etc/prometheus/consoles \
    --web.console.libraries=/etc/prometheus/console_libraries \
    --web.listen-address=0.0.0.0:9090 \
    --web.enable-lifecycle
Restart=always

[Install]
WantedBy=multi-user.target
[Unit]
Description=Prometheus Alertmanager
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/alertmanager \
    --config.file=/etc/alertmanager/alertmanager.yml \
    --storage.path=/var/lib/alertmanager/
Restart=always

[Install]
WantedBy=multi-user.target

Install Grafana

Install Grafana for visualizing keepalived cluster metrics and creating dashboards.

curl -fsSL https://packages.grafana.com/gpg.key | sudo gpg --dearmor -o /usr/share/keyrings/grafana.gpg
echo "deb [signed-by=/usr/share/keyrings/grafana.gpg] https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install -y grafana
sudo dnf install -y https://dl.grafana.com/oss/release/grafana-11.3.0-1.x86_64.rpm

Configure Grafana data source

Add Prometheus as a data source in Grafana for accessing keepalived metrics.

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://localhost:9090
    isDefault: true
    editable: true

Create Grafana keepalived dashboard

Create a comprehensive dashboard for monitoring keepalived cluster status and metrics.

{
  "dashboard": {
    "id": null,
    "title": "Keepalived Cluster Monitoring",
    "tags": ["keepalived", "vrrp", "high-availability"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "VRRP Instance States",
        "type": "stat",
        "targets": [
          {
            "expr": "keepalived_vrrp_state",
            "legendFormat": "{{instance}} - {{state}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "mappings": [
              {"options": {"0": {"text": "FAULT", "color": "red"}}}  ,
              {"options": {"1": {"text": "BACKUP", "color": "yellow"}}},
              {"options": {"2": {"text": "MASTER", "color": "green"}}}
            ]
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Failover Events",
        "type": "graph",
        "targets": [
          {
            "expr": "increase(keepalived_transitions_total[5m])",
            "legendFormat": "{{instance}} - {{type}}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
      },
      {
        "id": 3,
        "title": "Instance Priorities",
        "type": "graph",
        "targets": [
          {
            "expr": "keepalived_vrrp_priority",
            "legendFormat": "{{instance}}"
          }
        ],
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "5s"
  }
}

Set ownership and permissions

Configure proper ownership for all service directories and files.

sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
sudo chown -R prometheus:prometheus /etc/alertmanager
sudo mkdir -p /var/lib/alertmanager
sudo chown -R prometheus:prometheus /var/lib/alertmanager
sudo chown -R grafana:grafana /etc/grafana

Start all services

Enable and start keepalived, monitoring, and visualization services on all nodes.

# On both keepalived nodes
sudo systemctl daemon-reload
sudo systemctl enable --now keepalived node_exporter

On monitoring server

sudo systemctl enable --now prometheus alertmanager grafana-server

Configure firewall rules

Open necessary ports for monitoring communication between cluster nodes.

# On keepalived nodes
sudo ufw allow 9100/tcp  # Node Exporter
sudo ufw allow from 224.0.0.0/8  # VRRP multicast

On monitoring server

sudo ufw allow 9090/tcp # Prometheus sudo ufw allow 9093/tcp # Alertmanager sudo ufw allow 3000/tcp # Grafana
# On keepalived nodes
sudo firewall-cmd --permanent --add-port=9100/tcp
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="224.0.0.0/8" accept'

On monitoring server

sudo firewall-cmd --permanent --add-port=9090/tcp sudo firewall-cmd --permanent --add-port=9093/tcp sudo firewall-cmd --permanent --add-port=3000/tcp sudo firewall-cmd --reload

Verify your setup

Test your keepalived cluster monitoring by checking service status and triggering failover events.

# Check keepalived status on both nodes
sudo systemctl status keepalived
ip addr show  # Look for virtual IP

Check monitoring services

sudo systemctl status prometheus alertmanager grafana-server node_exporter

Test Prometheus targets

curl http://localhost:9090/api/v1/targets

Test Prometheus metrics

curl "http://localhost:9090/api/v1/query?query=keepalived_vrrp_state"

Check Grafana dashboard access

curl -I http://localhost:3000

Test keepalived failover

sudo systemctl stop keepalived # On master node ip addr show # Verify VIP moved to backup

Common issues

SymptomCauseFix
Split-brain conditionNetwork partition or authentication mismatchCheck network connectivity and verify auth_pass matches on both nodes
Virtual IP not movingPriority misconfiguration or script failuresCheck priority values and test health check scripts manually
Prometheus can't scrape metricsNode Exporter not running or firewall blockingVerify systemctl status node_exporter and check firewall rules
No keepalived metrics in PrometheusText file collector not configuredEnsure --collector.textfile.directory flag is set and scripts have write permissions
Grafana dashboard shows no dataPrometheus data source not configuredCheck /etc/grafana/provisioning/datasources/ configuration and restart grafana
Alert notifications not workingAlertmanager configuration or SMTP issuesCheck /etc/alertmanager/alertmanager.yml and test SMTP connectivity

Next steps

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.