Alertmanager HA Clustering Setup Guide

Set up a production-grade Alertmanager cluster with gossip protocol for high availability, automatic failover, and load balancing. Ensure your monitoring alerts remain operational even when individual nodes fail.

Prerequisites

Three servers with 2GB RAM minimum
Network connectivity between cluster nodes
Basic understanding of Prometheus alerting
Root or sudo access on all servers

What this solves

Running a single Alertmanager instance creates a critical point of failure in your monitoring infrastructure. When that instance goes down, your entire alert routing system fails, leaving you blind to system issues. This tutorial implements a clustered Alertmanager setup with automatic failover, ensuring your monitoring alerts continue working even when individual nodes fail.

Step-by-step installation

Update system packages

Start by updating your package manager to ensure you get the latest versions of required dependencies.

sudo apt update && sudo apt upgrade -y
sudo apt install -y wget curl systemd

sudo dnf update -y
sudo dnf install -y wget curl systemd

Create Alertmanager user and directories

Create a dedicated system user for Alertmanager and set up the required directory structure with proper permissions.

sudo useradd --no-create-home --shell /bin/false alertmanager
sudo mkdir -p /etc/alertmanager
sudo mkdir -p /var/lib/alertmanager
sudo chown alertmanager:alertmanager /etc/alertmanager
sudo chown alertmanager:alertmanager /var/lib/alertmanager

Download and install Alertmanager

Download the latest Alertmanager release and install it to the system binary directory.

cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xvf alertmanager-0.26.0.linux-amd64.tar.gz
sudo cp alertmanager-0.26.0.linux-amd64/alertmanager /usr/local/bin/
sudo cp alertmanager-0.26.0.linux-amd64/amtool /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager
sudo chown alertmanager:alertmanager /usr/local/bin/amtool
sudo chmod 755 /usr/local/bin/alertmanager
sudo chmod 755 /usr/local/bin/amtool

Configure Alertmanager cluster on node 1

Create the main Alertmanager configuration file for the first cluster node. This includes routing rules, receiver definitions, and cluster settings.

global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alertmanager@example.com'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: 'default'
  routes:
  - match:
      severity: critical
    receiver: 'critical-alerts'
  - match:
      severity: warning
    receiver: 'warning-alerts'

receivers:
name: 'default'  email_configs:
  - to: 'admin@example.com'
    subject: 'Alertmanager notification'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      {{ end }}

name: 'critical-alerts'  email_configs:
  - to: 'critical@example.com'
    subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Severity: {{ .Labels.severity }}
      Instance: {{ .Labels.instance }}
      Description: {{ .Annotations.description }}
      {{ end }}

name: 'warning-alerts'  email_configs:
  - to: 'warnings@example.com'
    subject: 'WARNING: {{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Severity: {{ .Labels.severity }}
      Instance: {{ .Labels.instance }}
      Description: {{ .Annotations.description }}
      {{ end }}

inhibit_rules:
source_match:    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['alertname', 'instance']

Set proper ownership and permissions on the configuration file.

sudo chown alertmanager:alertmanager /etc/alertmanager/alertmanager.yml
sudo chmod 640 /etc/alertmanager/alertmanager.yml

Create systemd service for node 1

Create a systemd service file for the first Alertmanager node with cluster configuration and gossip protocol settings.

[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
WorkingDirectory=/var/lib/alertmanager
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=0.0.0.0:9093 \
  --cluster.listen-address=0.0.0.0:9094 \
  --cluster.advertise-address=203.0.113.10:9094 \
  --log.level=info

Restart=always
RestartSec=3
TimeoutStopSec=30
KillMode=mixed

[Install]
WantedBy=multi-user.target

Note: Replace 203.0.113.10 with your actual server IP address. The cluster.advertise-address must be reachable by other cluster nodes.

Setup node 2 configuration

On the second server, repeat the user creation and directory setup, then create the configuration file.

# Run on node 2
sudo useradd --no-create-home --shell /bin/false alertmanager
sudo mkdir -p /etc/alertmanager
sudo mkdir -p /var/lib/alertmanager
sudo chown alertmanager:alertmanager /etc/alertmanager
sudo chown alertmanager:alertmanager /var/lib/alertmanager

Copy the same Alertmanager configuration file to node 2, then create its systemd service file with cluster peer information.

[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
WorkingDirectory=/var/lib/alertmanager
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=0.0.0.0:9093 \
  --cluster.listen-address=0.0.0.0:9094 \
  --cluster.advertise-address=203.0.113.11:9094 \
  --cluster.peer=203.0.113.10:9094 \
  --log.level=info

Restart=always
RestartSec=3
TimeoutStopSec=30
KillMode=mixed

[Install]
WantedBy=multi-user.target

Note: Replace 203.0.113.11 with node 2's IP address and 203.0.113.10 with node 1's IP address.

Setup node 3 configuration

Configure the third node similarly, ensuring it can connect to the existing cluster members.

[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
WorkingDirectory=/var/lib/alertmanager
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=0.0.0.0:9093 \
  --cluster.listen-address=0.0.0.0:9094 \
  --cluster.advertise-address=203.0.113.12:9094 \
  --cluster.peer=203.0.113.10:9094 \
  --cluster.peer=203.0.113.11:9094 \
  --log.level=info

Restart=always
RestartSec=3
TimeoutStopSec=30
KillMode=mixed

[Install]
WantedBy=multi-user.target

Note: Replace 203.0.113.12 with node 3's IP address. Node 3 connects to both existing peers for faster cluster joining.

Configure firewall rules

Open the required ports for Alertmanager web interface and cluster communication on all nodes.

sudo ufw allow 9093/tcp comment 'Alertmanager web interface'
sudo ufw allow 9094/tcp comment 'Alertmanager cluster communication'
sudo ufw allow 9094/udp comment 'Alertmanager cluster communication'
sudo ufw reload

sudo firewall-cmd --permanent --add-port=9093/tcp
sudo firewall-cmd --permanent --add-port=9094/tcp
sudo firewall-cmd --permanent --add-port=9094/udp
sudo firewall-cmd --reload

Start Alertmanager cluster

Enable and start the Alertmanager service on all nodes, beginning with node 1.

# Start on node 1 first
sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager
sudo systemctl status alertmanager

Wait 30 seconds, then start on node 2
sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager
sudo systemctl status alertmanager

Wait 30 seconds, then start on node 3
sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager
sudo systemctl status alertmanager

Install and configure HAProxy load balancer

Install HAProxy on a separate server to provide load balancing and health checking for the Alertmanager cluster.

sudo apt install -y haproxy

sudo dnf install -y haproxy

Configure HAProxy for Alertmanager

Create an HAProxy configuration that distributes requests across all Alertmanager nodes with health checking.

global
    daemon
    maxconn 4096
    log stdout local0 info
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin
    stats timeout 30s
    user haproxy
    group haproxy

defaults
    mode http
    log global
    option httplog
    option dontlognull
    option http-server-close
    option forwardfor
    option redispatch
    retries 3
    timeout http-request 10s
    timeout queue 1m
    timeout connect 10s
    timeout client 1m
    timeout server 1m
    timeout http-keep-alive 10s
    timeout check 10s
    maxconn 3000

frontend alertmanager_frontend
    bind *:9093
    default_backend alertmanager_backend

backend alertmanager_backend
    balance roundrobin
    option httpchk GET /-/healthy
    http-check expect status 200
    server alertmanager1 203.0.113.10:9093 check inter 10s fall 3 rise 2
    server alertmanager2 203.0.113.11:9093 check inter 10s fall 3 rise 2
    server alertmanager3 203.0.113.12:9093 check inter 10s fall 3 rise 2

frontend stats
    bind *:8404
    stats enable
    stats uri /stats
    stats refresh 30s
    stats admin if TRUE

Enable and start HAProxy

Start the HAProxy service and configure it to start automatically on boot.

sudo systemctl enable --now haproxy
sudo systemctl status haproxy

Configure Prometheus to use the cluster

Update your Prometheus configuration to send alerts to the HAProxy load balancer instead of individual Alertmanager instances. This tutorial builds on the Prometheus Alertmanager integration tutorial.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - "203.0.113.20:9093"  # HAProxy load balancer IP
      timeout: 10s
      api_version: v2

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'alertmanager'
    static_configs:
      - targets:
        - '203.0.113.10:9093'
        - '203.0.113.11:9093'
        - '203.0.113.12:9093'

Restart Prometheus to apply the new configuration.

sudo systemctl restart prometheus

Verify your setup

Check that all Alertmanager nodes are running and properly clustered.

# Check cluster status on each node
curl -s http://203.0.113.10:9093/api/v1/status | jq '.data.cluster.status'
curl -s http://203.0.113.11:9093/api/v1/status | jq '.data.cluster.status'
curl -s http://203.0.113.12:9093/api/v1/status | jq '.data.cluster.status'

Check cluster peers
curl -s http://203.0.113.10:9093/api/v1/status | jq '.data.cluster.peers'

Test HAProxy health
curl -I http://203.0.113.20:9093/-/healthy

View HAProxy stats
curl http://203.0.113.20:8404/stats

Test alert reception through load balancer
curl -X POST http://203.0.113.20:9093/api/v1/alerts \
  -H 'Content-Type: application/json' \
  -d '[{
    "labels": {
      "alertname": "TestAlert",
      "severity": "warning",
      "instance": "test-server"
    },
    "annotations": {
      "summary": "Test alert for cluster verification",
      "description": "This is a test alert to verify cluster functionality"
    }
  }]'

Check the Alertmanager logs to verify cluster formation and gossip protocol operation.

sudo journalctl -u alertmanager -f --since "10 minutes ago"

You should see log entries indicating successful cluster membership and peer discovery.

Common issues

Symptom	Cause	Fix
Nodes can't join cluster	Firewall blocking gossip ports	Ensure ports 9094/tcp and 9094/udp are open between nodes
Split-brain behavior	Network partition between nodes	Check network connectivity and advertise-address configuration
HAProxy shows all backends down	Health check path incorrect	Verify `/-/healthy` endpoint responds with 200 on all nodes
Duplicate alerts being sent	Multiple instances processing same alert	Check cluster gossip is working properly with `/api/v1/status`
Configuration reload fails	Syntax error in YAML	Use `amtool config check /etc/alertmanager/alertmanager.yml`
Peer discovery fails	Wrong advertise-address	Set advertise-address to IP reachable by other cluster members

Next steps

Running this in production?

Want this handled for you? Running this at scale adds a second layer of work: capacity planning, failover drills, cost control, and on-call. Our managed platform covers monitoring, backups and 24/7 response by default.

Automated install script

Run this to automate the entire setup

install.sh

#!/usr/bin/env bash

set -euo pipefail

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'

# Global variables
ALERTMANAGER_VERSION="0.26.0"
CLUSTER_PEERS=""
NODE_ID="node1"
LISTEN_PORT="9093"
CLUSTER_PORT="9094"

# Usage function
usage() {
    echo "Usage: $0 [OPTIONS]"
    echo "Options:"
    echo "  --node-id ID         Node identifier (default: node1)"
    echo "  --cluster-peers IP   Comma-separated list of cluster peer IPs"
    echo "  --listen-port PORT   Alertmanager listen port (default: 9093)"
    echo "  --cluster-port PORT  Cluster communication port (default: 9094)"
    echo "  -h, --help          Show this help message"
    exit 1
}

# Parse command line arguments
while [[ $# -gt 0 ]]; do
    case $1 in
        --node-id)
            NODE_ID="$2"
            shift 2
            ;;
        --cluster-peers)
            CLUSTER_PEERS="$2"
            shift 2
            ;;
        --listen-port)
            LISTEN_PORT="$2"
            shift 2
            ;;
        --cluster-port)
            CLUSTER_PORT="$2"
            shift 2
            ;;
        -h|--help)
            usage
            ;;
        *)
            echo -e "${RED}Unknown option: $1${NC}"
            usage
            ;;
    esac
done

# Cleanup function for rollback
cleanup() {
    echo -e "${YELLOW}Installation failed. Cleaning up...${NC}"
    systemctl stop alertmanager 2>/dev/null || true
    systemctl disable alertmanager 2>/dev/null || true
    rm -f /etc/systemd/system/alertmanager.service
    rm -f /usr/local/bin/alertmanager /usr/local/bin/amtool
    rm -rf /etc/alertmanager /var/lib/alertmanager
    userdel alertmanager 2>/dev/null || true
    systemctl daemon-reload
    echo -e "${YELLOW}Cleanup completed${NC}"
}

trap cleanup ERR

# Check if running as root or with sudo
if [[ $EUID -ne 0 ]]; then
    echo -e "${RED}This script must be run as root or with sudo${NC}"
    exit 1
fi

# Detect distribution
echo -e "${YELLOW}[1/8] Detecting distribution...${NC}"
if [ -f /etc/os-release ]; then
    . /etc/os-release
    case "$ID" in
        ubuntu|debian)
            PKG_MGR="apt"
            PKG_UPDATE="apt update && apt upgrade -y"
            PKG_INSTALL="apt install -y"
            ;;
        almalinux|rocky|centos|rhel|ol)
            PKG_MGR="dnf"
            PKG_UPDATE="dnf update -y"
            PKG_INSTALL="dnf install -y"
            ;;
        fedora)
            PKG_MGR="dnf"
            PKG_UPDATE="dnf update -y"
            PKG_INSTALL="dnf install -y"
            ;;
        amzn)
            PKG_MGR="yum"
            PKG_UPDATE="yum update -y"
            PKG_INSTALL="yum install -y"
            ;;
        *)
            echo -e "${RED}Unsupported distribution: $ID${NC}"
            exit 1
            ;;
    esac
    echo -e "${GREEN}Detected: $PRETTY_NAME${NC}"
else
    echo -e "${RED}Cannot detect distribution${NC}"
    exit 1
fi

# Update system packages
echo -e "${YELLOW}[2/8] Updating system packages...${NC}"
$PKG_UPDATE
$PKG_INSTALL wget curl systemd tar

# Create alertmanager user and directories
echo -e "${YELLOW}[3/8] Creating alertmanager user and directories...${NC}"
useradd --no-create-home --shell /bin/false alertmanager 2>/dev/null || true
mkdir -p /etc/alertmanager
mkdir -p /var/lib/alertmanager
chown alertmanager:alertmanager /etc/alertmanager
chown alertmanager:alertmanager /var/lib/alertmanager
chmod 755 /etc/alertmanager
chmod 755 /var/lib/alertmanager

# Download and install Alertmanager
echo -e "${YELLOW}[4/8] Downloading and installing Alertmanager...${NC}"
cd /tmp
wget -q "https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz"
tar xf "alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz"
cp "alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/alertmanager" /usr/local/bin/
cp "alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/amtool" /usr/local/bin/
chown alertmanager:alertmanager /usr/local/bin/alertmanager
chown alertmanager:alertmanager /usr/local/bin/amtool
chmod 755 /usr/local/bin/alertmanager
chmod 755 /usr/local/bin/amtool
rm -rf "alertmanager-${ALERTMANAGER_VERSION}.linux-amd64"*

# Create Alertmanager configuration
echo -e "${YELLOW}[5/8] Creating Alertmanager configuration...${NC}"
cat > /etc/alertmanager/alertmanager.yml << 'EOF'
global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alertmanager@example.com'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: 'default'
  routes:
  - match:
      severity: critical
    receiver: 'critical-alerts'
  - match:
      severity: warning
    receiver: 'warning-alerts'

receivers:
- name: 'default'
  email_configs:
  - to: 'admin@example.com'
    subject: 'Alertmanager notification'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      {{ end }}

- name: 'critical-alerts'
  email_configs:
  - to: 'critical@example.com'
    subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Severity: {{ .Labels.severity }}
      Instance: {{ .Labels.instance }}
      Description: {{ .Annotations.description }}
      {{ end }}

- name: 'warning-alerts'
  email_configs:
  - to: 'warnings@example.com'
    subject: 'WARNING: {{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Severity: {{ .Labels.severity }}
      Instance: {{ .Labels.instance }}
      Description: {{ .Annotations.description }}
      {{ end }}

inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['alertname', 'instance']
EOF

chown alertmanager:alertmanager /etc/alertmanager/alertmanager.yml
chmod 644 /etc/alertmanager/alertmanager.yml

# Create systemd service
echo -e "${YELLOW}[6/8] Creating systemd service...${NC}"
CLUSTER_LISTEN_ADDRESS="0.0.0.0:${CLUSTER_PORT}"
CLUSTER_ADVERTISE_ADDRESS="$(hostname -I | awk '{print $1}'):${CLUSTER_PORT}"

# Build cluster peer arguments
PEER_ARGS=""
if [[ -n "$CLUSTER_PEERS" ]]; then
    IFS=',' read -ra PEERS <<< "$CLUSTER_PEERS"
    for peer in "${PEERS[@]}"; do
        PEER_ARGS="$PEER_ARGS --cluster.peer=${peer}:${CLUSTER_PORT}"
    done
fi

cat > /etc/systemd/system/alertmanager.service << EOF
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
WorkingDirectory=/var/lib/alertmanager
ExecStart=/usr/local/bin/alertmanager \\
  --config.file=/etc/alertmanager/alertmanager.yml \\
  --storage.path=/var/lib/alertmanager \\
  --web.listen-address=0.0.0.0:${LISTEN_PORT} \\
  --cluster.listen-address=${CLUSTER_LISTEN_ADDRESS} \\
  --cluster.advertise-address=${CLUSTER_ADVERTISE_ADDRESS}${PEER_ARGS}
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

# Configure firewall
echo -e "${YELLOW}[7/8] Configuring firewall...${NC}"
case "$ID" in
    ubuntu|debian)
        if command -v ufw >/dev/null 2>&1; then
            ufw allow ${LISTEN_PORT}/tcp
            ufw allow ${CLUSTER_PORT}/tcp
        fi
        ;;
    almalinux|rocky|centos|rhel|ol|fedora)
        if command -v firewall-cmd >/dev/null 2>&1; then
            firewall-cmd --permanent --add-port=${LISTEN_PORT}/tcp
            firewall-cmd --permanent --add-port=${CLUSTER_PORT}/tcp
            firewall-cmd --reload
        fi
        ;;
esac

# Start and enable service
echo -e "${YELLOW}[8/8] Starting Alertmanager service...${NC}"
systemctl daemon-reload
systemctl enable alertmanager
systemctl start alertmanager

# Verify installation
echo -e "${YELLOW}Verifying installation...${NC}"
sleep 5

if systemctl is-active --quiet alertmanager; then
    echo -e "${GREEN}✓ Alertmanager service is running${NC}"
else
    echo -e "${RED}✗ Alertmanager service failed to start${NC}"
    exit 1
fi

if curl -s http://localhost:${LISTEN_PORT}/-/healthy >/dev/null 2>&1; then
    echo -e "${GREEN}✓ Alertmanager is responding on port ${LISTEN_PORT}${NC}"
else
    echo -e "${RED}✗ Alertmanager is not responding${NC}"
    exit 1
fi

echo -e "${GREEN}Alertmanager cluster installation completed successfully!${NC}"
echo -e "Node ID: ${NODE_ID}"
echo -e "Web UI: http://$(hostname -I | awk '{print $1}'):${LISTEN_PORT}"
echo -e "Configuration: /etc/alertmanager/alertmanager.yml"
echo -e "Logs: journalctl -u alertmanager -f"

trap - ERR

Review the script before running. Execute with: bash install.sh

#alertmanager #clustering #high-availability #prometheus #monitoring

Implement Alertmanager high availability clustering with automatic failover and load balancing

Prerequisites

What this solves

Step-by-step installation

Update system packages

Create Alertmanager user and directories

Download and install Alertmanager

Configure Alertmanager cluster on node 1

Create systemd service for node 1

Setup node 2 configuration

Setup node 3 configuration

Configure firewall rules

Start Alertmanager cluster

Wait 30 seconds, then start on node 2

Wait 30 seconds, then start on node 3

Install and configure HAProxy load balancer

Configure HAProxy for Alertmanager

Enable and start HAProxy

Configure Prometheus to use the cluster

Verify your setup

Check cluster peers

Test HAProxy health

View HAProxy stats

Test alert reception through load balancer

Common issues

Next steps

Running this in production?

Related tutorials

Setup Node.js error tracking with Sentry for production monitoring and debugging

Implement Node.js application monitoring with Prometheus metrics and Grafana dashboards

Configure NTP monitoring with Grafana dashboards and Prometheus alerting

Don't want to manage this yourself?