Keepalived Cluster Monitoring with Prometheus

Configure comprehensive monitoring for keepalived VRRP clusters using Prometheus metrics collection, alerting rules for failover events, and Grafana dashboards for high availability visualization.

Prerequisites

Two servers for keepalived cluster
One server for monitoring stack
Basic networking knowledge
Root access to all servers

What this solves

Keepalived provides high availability through VRRP (Virtual Router Redundancy Protocol) but lacks built-in monitoring capabilities. This tutorial sets up comprehensive monitoring for keepalived clusters using Prometheus to collect VRRP state metrics, create alerting rules for failover events, and build Grafana dashboards for real-time visualization of your high availability infrastructure.

Prerequisites

You'll need two servers for the keepalived cluster, plus monitoring infrastructure. Ensure you have root access and basic networking knowledge of VRRP concepts.

Step-by-step configuration

Install keepalived cluster

Set up keepalived on both cluster nodes to create a high availability pair with shared virtual IP addresses.

sudo apt update
sudo apt install -y keepalived

sudo dnf update -y
sudo dnf install -y keepalived

Configure primary keepalived node

Create the keepalived configuration for the primary node with VRRP instance and health checking.

global_defs {
    router_id KEEPALIVED_PRIMARY
    script_user keepalived_script
    enable_script_security
}

vrrp_script chk_nginx {
    script "/bin/curl -f http://localhost/ || exit 1"
    interval 2
    weight -2
    fall 3
    rise 2
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 110
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass changeme123
    }
    virtual_ipaddress {
        203.0.113.100/24 dev eth0
    }
    track_script {
        chk_nginx
    }
    notify_master "/etc/keepalived/scripts/notify_master.sh"
    notify_backup "/etc/keepalived/scripts/notify_backup.sh"
    notify_fault "/etc/keepalived/scripts/notify_fault.sh"
}

Configure backup keepalived node

Set up the backup node with lower priority and same virtual IP configuration.

global_defs {
    router_id KEEPALIVED_BACKUP
    script_user keepalived_script
    enable_script_security
}

vrrp_script chk_nginx {
    script "/bin/curl -f http://localhost/ || exit 1"
    interval 2
    weight -2
    fall 3
    rise 2
}

vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass changeme123
    }
    virtual_ipaddress {
        203.0.113.100/24 dev eth0
    }
    track_script {
        chk_nginx
    }
    notify_master "/etc/keepalived/scripts/notify_master.sh"
    notify_backup "/etc/keepalived/scripts/notify_backup.sh"
    notify_fault "/etc/keepalived/scripts/notify_fault.sh"
}

Create keepalived notification scripts

Set up notification scripts that will update metrics files when VRRP state changes occur.

sudo mkdir -p /etc/keepalived/scripts
sudo mkdir -p /var/lib/prometheus/node-exporter

#!/bin/bash
echo "keepalived_vrrp_state{instance=\"VI_1\",state=\"master\"} 2" > /var/lib/prometheus/node-exporter/keepalived.prom
echo "keepalived_vrrp_priority{instance=\"VI_1\"} $(grep priority /etc/keepalived/keepalived.conf | awk '{print $2}')" >> /var/lib/prometheus/node-exporter/keepalived.prom
echo "keepalived_transitions_total{instance=\"VI_1\",type=\"master\"} $(date +%s)" >> /var/lib/prometheus/node-exporter/keepalived.prom
logger "Keepalived: Transitioned to MASTER state"

#!/bin/bash
echo "keepalived_vrrp_state{instance=\"VI_1\",state=\"backup\"} 1" > /var/lib/prometheus/node-exporter/keepalived.prom
echo "keepalived_vrrp_priority{instance=\"VI_1\"} $(grep priority /etc/keepalived/keepalived.conf | awk '{print $2}')" >> /var/lib/prometheus/node-exporter/keepalived.prom
echo "keepalived_transitions_total{instance=\"VI_1\",type=\"backup\"} $(date +%s)" >> /var/lib/prometheus/node-exporter/keepalived.prom
logger "Keepalived: Transitioned to BACKUP state"

#!/bin/bash
echo "keepalived_vrrp_state{instance=\"VI_1\",state=\"fault\"} 0" > /var/lib/prometheus/node-exporter/keepalived.prom
echo "keepalived_vrrp_priority{instance=\"VI_1\"} $(grep priority /etc/keepalived/keepalived.conf | awk '{print $2}')" >> /var/lib/prometheus/node-exporter/keepalived.prom
echo "keepalived_transitions_total{instance=\"VI_1\",type=\"fault\"} $(date +%s)" >> /var/lib/prometheus/node-exporter/keepalived.prom
logger "Keepalived: Transitioned to FAULT state"

Set script permissions and user

Create the keepalived script user and set proper permissions for security.

sudo useradd -r -s /bin/false keepalived_script
sudo chmod 755 /etc/keepalived/scripts/*.sh
sudo chown -R keepalived_script:keepalived_script /etc/keepalived/scripts
sudo chown -R prometheus:prometheus /var/lib/prometheus/node-exporter

Never use chmod 777. It gives every user on the system full access to your files. Instead, fix ownership with chown and use minimal permissions.

Install Prometheus Node Exporter

Install Node Exporter to collect system metrics and expose keepalived custom metrics.

cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar -xzf node_exporter-1.8.2.linux-amd64.tar.gz
sudo cp node_exporter-1.8.2.linux-amd64/node_exporter /usr/local/bin/
sudo useradd -r -s /bin/false prometheus

Configure Node Exporter with text file collector

Enable the text file collector to read keepalived metrics from the notification scripts.

[Unit]
Description=Prometheus Node Exporter
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter \
    --collector.textfile.directory=/var/lib/prometheus/node-exporter \
    --collector.systemd \
    --collector.processes
Restart=always

[Install]
WantedBy=multi-user.target

Install Prometheus server

Set up Prometheus to scrape metrics from your keepalived cluster nodes.

cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.54.1/prometheus-2.54.1.linux-amd64.tar.gz
tar -xzf prometheus-2.54.1.linux-amd64.tar.gz
sudo cp prometheus-2.54.1.linux-amd64/prometheus /usr/local/bin/
sudo cp prometheus-2.54.1.linux-amd64/promtool /usr/local/bin/
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

Configure Prometheus scraping

Configure Prometheus to collect metrics from both keepalived cluster nodes.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "/etc/prometheus/rules/keepalived.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - "localhost:9093"

scrape_configs:
  - job_name: 'keepalived-cluster'
    static_configs:
      - targets:
          - '203.0.113.10:9100'  # Primary node
          - '203.0.113.11:9100'  # Backup node
    scrape_interval: 5s
    metrics_path: '/metrics'
    
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

Create Prometheus alerting rules

Set up alerting rules to detect keepalived failover events and cluster issues.

sudo mkdir -p /etc/prometheus/rules

groups:
name: keepalived.rules  rules:
  - alert: KeepaliveFailover
    expr: increase(keepalived_transitions_total[5m]) > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: "Keepalived failover detected on {{ $labels.instance }}"
      description: "Keepalived instance {{ $labels.instance }} has experienced a state transition in the last 5 minutes."
      
  - alert: KeepaliveNoMaster
    expr: sum(keepalived_vrrp_state == 2) == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "No keepalived master found in cluster"
      description: "No keepalived instance is currently in MASTER state, indicating a split-brain or cluster failure."
      
  - alert: KeepaliveMultipleMasters
    expr: sum(keepalived_vrrp_state == 2) > 1
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "Multiple keepalived masters detected"
      description: "{{ $value }} keepalived instances are in MASTER state, indicating a split-brain condition."
      
  - alert: KeepaliveInstanceDown
    expr: up{job="keepalived-cluster"} == 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Keepalived node {{ $labels.instance }} is down"
      description: "Cannot scrape metrics from keepalived node {{ $labels.instance }} for more than 1 minute."
      
  - alert: KeepaliveHighFailoverRate
    expr: rate(keepalived_transitions_total[1h]) > 0.1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High keepalived failover rate on {{ $labels.instance }}"
      description: "Keepalived instance {{ $labels.instance }} is experiencing frequent state transitions ({{ $value }} per hour)."
      
  - alert: KeepaliveFaultState
    expr: keepalived_vrrp_state == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Keepalived instance {{ $labels.instance }} in FAULT state"
      description: "Keepalived instance {{ $labels.instance }} has been in FAULT state for more than 1 minute."

Install Prometheus Alertmanager

Install Alertmanager to handle alert notifications from Prometheus.

cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
tar -xzf alertmanager-0.27.0.linux-amd64.tar.gz
sudo cp alertmanager-0.27.0.linux-amd64/alertmanager /usr/local/bin/
sudo mkdir -p /etc/alertmanager

Configure Alertmanager

Set up basic Alertmanager configuration for email notifications.

global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alerts@example.com'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'keepalived-alerts'

receivers:
name: 'keepalived-alerts'  email_configs:
  - to: 'admin@example.com'
    subject: 'Keepalived Alert: {{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      Instance: {{ .Labels.instance }}
      Severity: {{ .Labels.severity }}
      {{ end }}

Create systemd services

Create systemd service files for Prometheus and Alertmanager.

[Unit]
Description=Prometheus Server
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
    --config.file=/etc/prometheus/prometheus.yml \
    --storage.tsdb.path=/var/lib/prometheus/ \
    --web.console.templates=/etc/prometheus/consoles \
    --web.console.libraries=/etc/prometheus/console_libraries \
    --web.listen-address=0.0.0.0:9090 \
    --web.enable-lifecycle
Restart=always

[Install]
WantedBy=multi-user.target

[Unit]
Description=Prometheus Alertmanager
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/alertmanager \
    --config.file=/etc/alertmanager/alertmanager.yml \
    --storage.path=/var/lib/alertmanager/
Restart=always

[Install]
WantedBy=multi-user.target

Install Grafana

Install Grafana for visualizing keepalived cluster metrics and creating dashboards.

curl -fsSL https://packages.grafana.com/gpg.key | sudo gpg --dearmor -o /usr/share/keyrings/grafana.gpg
echo "deb [signed-by=/usr/share/keyrings/grafana.gpg] https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install -y grafana

sudo dnf install -y https://dl.grafana.com/oss/release/grafana-11.3.0-1.x86_64.rpm

Configure Grafana data source

Add Prometheus as a data source in Grafana for accessing keepalived metrics.

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://localhost:9090
    isDefault: true
    editable: true

Create Grafana keepalived dashboard

Create a comprehensive dashboard for monitoring keepalived cluster status and metrics.

{
  "dashboard": {
    "id": null,
    "title": "Keepalived Cluster Monitoring",
    "tags": ["keepalived", "vrrp", "high-availability"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "VRRP Instance States",
        "type": "stat",
        "targets": [
          {
            "expr": "keepalived_vrrp_state",
            "legendFormat": "{{instance}} - {{state}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "mappings": [
              {"options": {"0": {"text": "FAULT", "color": "red"}}}  ,
              {"options": {"1": {"text": "BACKUP", "color": "yellow"}}},
              {"options": {"2": {"text": "MASTER", "color": "green"}}}
            ]
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Failover Events",
        "type": "graph",
        "targets": [
          {
            "expr": "increase(keepalived_transitions_total[5m])",
            "legendFormat": "{{instance}} - {{type}}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
      },
      {
        "id": 3,
        "title": "Instance Priorities",
        "type": "graph",
        "targets": [
          {
            "expr": "keepalived_vrrp_priority",
            "legendFormat": "{{instance}}"
          }
        ],
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "5s"
  }
}

Set ownership and permissions

Configure proper ownership for all service directories and files.

sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
sudo chown -R prometheus:prometheus /etc/alertmanager
sudo mkdir -p /var/lib/alertmanager
sudo chown -R prometheus:prometheus /var/lib/alertmanager
sudo chown -R grafana:grafana /etc/grafana

Start all services

Enable and start keepalived, monitoring, and visualization services on all nodes.

# On both keepalived nodes
sudo systemctl daemon-reload
sudo systemctl enable --now keepalived node_exporter

On monitoring server
sudo systemctl enable --now prometheus alertmanager grafana-server

Configure firewall rules

Open necessary ports for monitoring communication between cluster nodes.

# On keepalived nodes
sudo ufw allow 9100/tcp  # Node Exporter
sudo ufw allow from 224.0.0.0/8  # VRRP multicast

On monitoring server
sudo ufw allow 9090/tcp  # Prometheus
sudo ufw allow 9093/tcp  # Alertmanager
sudo ufw allow 3000/tcp  # Grafana

# On keepalived nodes
sudo firewall-cmd --permanent --add-port=9100/tcp
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="224.0.0.0/8" accept'

On monitoring server
sudo firewall-cmd --permanent --add-port=9090/tcp
sudo firewall-cmd --permanent --add-port=9093/tcp
sudo firewall-cmd --permanent --add-port=3000/tcp
sudo firewall-cmd --reload

Verify your setup

Test your keepalived cluster monitoring by checking service status and triggering failover events.

# Check keepalived status on both nodes
sudo systemctl status keepalived
ip addr show  # Look for virtual IP

Check monitoring services
sudo systemctl status prometheus alertmanager grafana-server node_exporter

Test Prometheus targets
curl http://localhost:9090/api/v1/targets

Test Prometheus metrics
curl "http://localhost:9090/api/v1/query?query=keepalived_vrrp_state"

Check Grafana dashboard access
curl -I http://localhost:3000

Test keepalived failover
sudo systemctl stop keepalived  # On master node
ip addr show  # Verify VIP moved to backup

Common issues

Symptom	Cause	Fix
Split-brain condition	Network partition or authentication mismatch	Check network connectivity and verify `auth_pass` matches on both nodes
Virtual IP not moving	Priority misconfiguration or script failures	Check `priority` values and test health check scripts manually
Prometheus can't scrape metrics	Node Exporter not running or firewall blocking	Verify `systemctl status node_exporter` and check firewall rules
No keepalived metrics in Prometheus	Text file collector not configured	Ensure `--collector.textfile.directory` flag is set and scripts have write permissions
Grafana dashboard shows no data	Prometheus data source not configured	Check `/etc/grafana/provisioning/datasources/` configuration and restart grafana
Alert notifications not working	Alertmanager configuration or SMTP issues	Check `/etc/alertmanager/alertmanager.yml` and test SMTP connectivity

Next steps

Automated install script

Run this to automate the entire setup

install.sh

#!/usr/bin/env bash

set -euo pipefail

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color

# Default values
NODE_TYPE=""
INTERFACE="eth0"
VIRTUAL_IP=""
PRIORITY=""
AUTH_PASS="changeme123"

# Usage function
usage() {
    echo "Usage: $0 --node-type [primary|backup] --virtual-ip IP --interface INTERFACE [--priority NUM] [--auth-pass PASS]"
    echo "Example: $0 --node-type primary --virtual-ip 203.0.113.100/24 --interface eth0"
    exit 1
}

# Parse arguments
while [[ $# -gt 0 ]]; do
    case $1 in
        --node-type)
            NODE_TYPE="$2"
            shift 2
            ;;
        --virtual-ip)
            VIRTUAL_IP="$2"
            shift 2
            ;;
        --interface)
            INTERFACE="$2"
            shift 2
            ;;
        --priority)
            PRIORITY="$2"
            shift 2
            ;;
        --auth-pass)
            AUTH_PASS="$2"
            shift 2
            ;;
        -h|--help)
            usage
            ;;
        *)
            echo -e "${RED}Unknown option: $1${NC}"
            usage
            ;;
    esac
done

# Validate arguments
if [[ -z "$NODE_TYPE" || -z "$VIRTUAL_IP" ]]; then
    echo -e "${RED}Error: --node-type and --virtual-ip are required${NC}"
    usage
fi

if [[ "$NODE_TYPE" != "primary" && "$NODE_TYPE" != "backup" ]]; then
    echo -e "${RED}Error: --node-type must be 'primary' or 'backup'${NC}"
    usage
fi

# Set default priorities if not specified
if [[ -z "$PRIORITY" ]]; then
    if [[ "$NODE_TYPE" == "primary" ]]; then
        PRIORITY=110
    else
        PRIORITY=100
    fi
fi

# Cleanup function
cleanup() {
    echo -e "${RED}Installation failed. Cleaning up...${NC}"
    systemctl stop keepalived 2>/dev/null || true
    systemctl disable keepalived 2>/dev/null || true
}

trap cleanup ERR

# Check if running as root
if [[ $EUID -ne 0 ]]; then
    echo -e "${RED}This script must be run as root${NC}"
    exit 1
fi

# Detect distribution
echo -e "${YELLOW}[1/8] Detecting distribution...${NC}"
if [ -f /etc/os-release ]; then
    . /etc/os-release
    case "$ID" in
        ubuntu|debian) 
            PKG_MGR="apt"
            PKG_UPDATE="apt update"
            PKG_INSTALL="apt install -y"
            ;;
        almalinux|rocky|centos|rhel|ol|fedora) 
            PKG_MGR="dnf"
            PKG_UPDATE="dnf update -y --refresh"
            PKG_INSTALL="dnf install -y"
            ;;
        amzn) 
            PKG_MGR="yum"
            PKG_UPDATE="yum update -y"
            PKG_INSTALL="yum install -y"
            ;;
        *) 
            echo -e "${RED}Unsupported distribution: $ID${NC}"
            exit 1
            ;;
    esac
    echo -e "${GREEN}Detected: $PRETTY_NAME${NC}"
else
    echo -e "${RED}Cannot detect distribution${NC}"
    exit 1
fi

# Update package repositories
echo -e "${YELLOW}[2/8] Updating package repositories...${NC}"
$PKG_UPDATE

# Install keepalived and required packages
echo -e "${YELLOW}[3/8] Installing keepalived and dependencies...${NC}"
$PKG_INSTALL keepalived curl

# Create keepalived user for scripts
echo -e "${YELLOW}[4/8] Creating keepalived script user...${NC}"
if ! id -u keepalived_script >/dev/null 2>&1; then
    useradd -r -s /bin/false -d /var/empty keepalived_script
fi

# Create directories with proper permissions
echo -e "${YELLOW}[5/8] Creating directories and notification scripts...${NC}"
mkdir -p /etc/keepalived/scripts
mkdir -p /var/lib/prometheus/node-exporter
chown root:root /etc/keepalived/scripts
chmod 755 /etc/keepalived/scripts
chown nobody:nogroup /var/lib/prometheus/node-exporter
chmod 755 /var/lib/prometheus/node-exporter

# Create notification scripts
cat > /etc/keepalived/scripts/notify_master.sh << 'EOF'
#!/bin/bash
echo "keepalived_vrrp_state{instance=\"VI_1\",state=\"master\"} 2" > /var/lib/prometheus/node-exporter/keepalived.prom
echo "keepalived_vrrp_priority{instance=\"VI_1\"} $(grep priority /etc/keepalived/keepalived.conf | awk '{print $2}')" >> /var/lib/prometheus/node-exporter/keepalived.prom
echo "keepalived_transitions_total{instance=\"VI_1\",type=\"master\"} $(date +%s)" >> /var/lib/prometheus/node-exporter/keepalived.prom
logger "Keepalived: Transitioned to MASTER state"
EOF

cat > /etc/keepalived/scripts/notify_backup.sh << 'EOF'
#!/bin/bash
echo "keepalived_vrrp_state{instance=\"VI_1\",state=\"backup\"} 1" > /var/lib/prometheus/node-exporter/keepalived.prom
echo "keepalived_vrrp_priority{instance=\"VI_1\"} $(grep priority /etc/keepalived/keepalived.conf | awk '{print $2}')" >> /var/lib/prometheus/node-exporter/keepalived.prom
echo "keepalived_transitions_total{instance=\"VI_1\",type=\"backup\"} $(date +%s)" >> /var/lib/prometheus/node-exporter/keepalived.prom
logger "Keepalived: Transitioned to BACKUP state"
EOF

cat > /etc/keepalived/scripts/notify_fault.sh << 'EOF'
#!/bin/bash
echo "keepalived_vrrp_state{instance=\"VI_1\",state=\"fault\"} 0" > /var/lib/prometheus/node-exporter/keepalived.prom
echo "keepalived_vrrp_priority{instance=\"VI_1\"} $(grep priority /etc/keepalived/keepalived.conf | awk '{print $2}')" >> /var/lib/prometheus/node-exporter/keepalived.prom
echo "keepalived_transitions_total{instance=\"VI_1\",type=\"fault\"} $(date +%s)" >> /var/lib/prometheus/node-exporter/keepalived.prom
logger "Keepalived: Transitioned to FAULT state"
EOF

# Set proper permissions for scripts
chmod 755 /etc/keepalived/scripts/*.sh
chown root:root /etc/keepalived/scripts/*.sh

# Configure keepalived
echo -e "${YELLOW}[6/8] Creating keepalived configuration...${NC}"
if [[ "$NODE_TYPE" == "primary" ]]; then
    STATE="MASTER"
    ROUTER_ID="KEEPALIVED_PRIMARY"
else
    STATE="BACKUP"
    ROUTER_ID="KEEPALIVED_BACKUP"
fi

cat > /etc/keepalived/keepalived.conf << EOF
global_defs {
    router_id $ROUTER_ID
    script_user keepalived_script
    enable_script_security
}

vrrp_script chk_nginx {
    script "/bin/curl -f http://localhost/ || exit 1"
    interval 2
    weight -2
    fall 3
    rise 2
}

vrrp_instance VI_1 {
    state $STATE
    interface $INTERFACE
    virtual_router_id 51
    priority $PRIORITY
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass $AUTH_PASS
    }
    virtual_ipaddress {
        $VIRTUAL_IP dev $INTERFACE
    }
    track_script {
        chk_nginx
    }
    notify_master "/etc/keepalived/scripts/notify_master.sh"
    notify_backup "/etc/keepalived/scripts/notify_backup.sh"
    notify_fault "/etc/keepalived/scripts/notify_fault.sh"
}
EOF

chown root:root /etc/keepalived/keepalived.conf
chmod 644 /etc/keepalived/keepalived.conf

# Configure firewall
echo -e "${YELLOW}[7/8] Configuring firewall...${NC}"
case "$ID" in
    ubuntu|debian)
        if command -v ufw >/dev/null 2>&1 && ufw status | grep -q "Status: active"; then
            ufw allow 224.0.0.18
            ufw allow from any to any port 112 proto vrrp
        fi
        ;;
    *)
        if command -v firewall-cmd >/dev/null 2>&1 && systemctl is-active --quiet firewalld; then
            firewall-cmd --permanent --add-rich-rule="rule protocol value='vrrp' accept"
            firewall-cmd --permanent --add-rich-rule="rule destination address='224.0.0.18' accept"
            firewall-cmd --reload
        fi
        ;;
esac

# Enable and start keepalived
echo -e "${YELLOW}[8/8] Starting and enabling keepalived service...${NC}"
systemctl enable keepalived
systemctl start keepalived

# Verification
echo -e "${YELLOW}Verifying installation...${NC}"
sleep 3

if systemctl is-active --quiet keepalived; then
    echo -e "${GREEN}✓ Keepalived service is running${NC}"
else
    echo -e "${RED}✗ Keepalived service is not running${NC}"
    exit 1
fi

if [[ -f /var/lib/prometheus/node-exporter/keepalived.prom ]]; then
    echo -e "${GREEN}✓ Prometheus metrics file created${NC}"
else
    echo -e "${YELLOW}⚠ Prometheus metrics file not yet created (will be created on state change)${NC}"
fi

echo -e "${GREEN}Installation completed successfully!${NC}"
echo -e "${YELLOW}Node type: $NODE_TYPE${NC}"
echo -e "${YELLOW}Virtual IP: $VIRTUAL_IP${NC}"
echo -e "${YELLOW}Interface: $INTERFACE${NC}"
echo -e "${YELLOW}Priority: $PRIORITY${NC}"
echo
echo "Next steps:"
echo "1. Configure the other node with opposite node-type"
echo "2. Set up Prometheus node-exporter to read from /var/lib/prometheus/node-exporter/"
echo "3. Configure Prometheus alerts for keepalived state changes"
echo "4. Create Grafana dashboards for visualization"

Review the script before running. Execute with: bash install.sh

#keepalived #prometheus #grafana #vrrp #high-availability

Set up keepalived cluster monitoring with Prometheus alerts and Grafana dashboards

Prerequisites

What this solves

Prerequisites

Step-by-step configuration

Install keepalived cluster

Configure primary keepalived node

Configure backup keepalived node

Create keepalived notification scripts

Set script permissions and user

Install Prometheus Node Exporter

Configure Node Exporter with text file collector

Install Prometheus server

Configure Prometheus scraping

Create Prometheus alerting rules

Install Prometheus Alertmanager

Configure Alertmanager

Create systemd services

Install Grafana

Configure Grafana data source

Create Grafana keepalived dashboard

Set ownership and permissions

Start all services

On monitoring server

Configure firewall rules

On monitoring server

On monitoring server

Verify your setup

Check monitoring services

Test Prometheus targets

Test Prometheus metrics

Check Grafana dashboard access

Test keepalived failover

Common issues

Next steps

Related tutorials

Configure Consul Connect service mesh monitoring with distributed tracing

Configure OpenTelemetry custom metrics for application monitoring with Prometheus and Grafana

Configure Jaeger with Elasticsearch backend security and encryption

Don't want to manage this yourself?