Implement Alertmanager high availability clustering with automatic failover and load balancing

Advanced 45 min May 30, 2026 162 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Set up a production-grade Alertmanager cluster with gossip protocol for high availability, automatic failover, and load balancing. Ensure your monitoring alerts remain operational even when individual nodes fail.

Prerequisites

  • Three servers with 2GB RAM minimum
  • Network connectivity between cluster nodes
  • Basic understanding of Prometheus alerting
  • Root or sudo access on all servers

What this solves

Running a single Alertmanager instance creates a critical point of failure in your monitoring infrastructure. When that instance goes down, your entire alert routing system fails, leaving you blind to system issues. This tutorial implements a clustered Alertmanager setup with automatic failover, ensuring your monitoring alerts continue working even when individual nodes fail.

Step-by-step installation

Update system packages

Start by updating your package manager to ensure you get the latest versions of required dependencies.

sudo apt update && sudo apt upgrade -y
sudo apt install -y wget curl systemd
sudo dnf update -y
sudo dnf install -y wget curl systemd

Create Alertmanager user and directories

Create a dedicated system user for Alertmanager and set up the required directory structure with proper permissions.

sudo useradd --no-create-home --shell /bin/false alertmanager
sudo mkdir -p /etc/alertmanager
sudo mkdir -p /var/lib/alertmanager
sudo chown alertmanager:alertmanager /etc/alertmanager
sudo chown alertmanager:alertmanager /var/lib/alertmanager

Download and install Alertmanager

Download the latest Alertmanager release and install it to the system binary directory.

cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xvf alertmanager-0.26.0.linux-amd64.tar.gz
sudo cp alertmanager-0.26.0.linux-amd64/alertmanager /usr/local/bin/
sudo cp alertmanager-0.26.0.linux-amd64/amtool /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager
sudo chown alertmanager:alertmanager /usr/local/bin/amtool
sudo chmod 755 /usr/local/bin/alertmanager
sudo chmod 755 /usr/local/bin/amtool

Configure Alertmanager cluster on node 1

Create the main Alertmanager configuration file for the first cluster node. This includes routing rules, receiver definitions, and cluster settings.

global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alertmanager@example.com'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: 'default'
  routes:
  - match:
      severity: critical
    receiver: 'critical-alerts'
  - match:
      severity: warning
    receiver: 'warning-alerts'

receivers:
  • name: 'default'
email_configs: - to: 'admin@example.com' subject: 'Alertmanager notification' body: | {{ range .Alerts }} Alert: {{ .Annotations.summary }} Description: {{ .Annotations.description }} {{ end }}
  • name: 'critical-alerts'
email_configs: - to: 'critical@example.com' subject: 'CRITICAL: {{ .GroupLabels.alertname }}' body: | {{ range .Alerts }} Alert: {{ .Annotations.summary }} Severity: {{ .Labels.severity }} Instance: {{ .Labels.instance }} Description: {{ .Annotations.description }} {{ end }}
  • name: 'warning-alerts'
email_configs: - to: 'warnings@example.com' subject: 'WARNING: {{ .GroupLabels.alertname }}' body: | {{ range .Alerts }} Alert: {{ .Annotations.summary }} Severity: {{ .Labels.severity }} Instance: {{ .Labels.instance }} Description: {{ .Annotations.description }} {{ end }} inhibit_rules:
  • source_match:
severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance']

Set proper ownership and permissions on the configuration file.

sudo chown alertmanager:alertmanager /etc/alertmanager/alertmanager.yml
sudo chmod 640 /etc/alertmanager/alertmanager.yml

Create systemd service for node 1

Create a systemd service file for the first Alertmanager node with cluster configuration and gossip protocol settings.

[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
WorkingDirectory=/var/lib/alertmanager
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=0.0.0.0:9093 \
  --cluster.listen-address=0.0.0.0:9094 \
  --cluster.advertise-address=203.0.113.10:9094 \
  --log.level=info

Restart=always
RestartSec=3
TimeoutStopSec=30
KillMode=mixed

[Install]
WantedBy=multi-user.target
Note: Replace 203.0.113.10 with your actual server IP address. The cluster.advertise-address must be reachable by other cluster nodes.

Setup node 2 configuration

On the second server, repeat the user creation and directory setup, then create the configuration file.

# Run on node 2
sudo useradd --no-create-home --shell /bin/false alertmanager
sudo mkdir -p /etc/alertmanager
sudo mkdir -p /var/lib/alertmanager
sudo chown alertmanager:alertmanager /etc/alertmanager
sudo chown alertmanager:alertmanager /var/lib/alertmanager

Copy the same Alertmanager configuration file to node 2, then create its systemd service file with cluster peer information.

[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
WorkingDirectory=/var/lib/alertmanager
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=0.0.0.0:9093 \
  --cluster.listen-address=0.0.0.0:9094 \
  --cluster.advertise-address=203.0.113.11:9094 \
  --cluster.peer=203.0.113.10:9094 \
  --log.level=info

Restart=always
RestartSec=3
TimeoutStopSec=30
KillMode=mixed

[Install]
WantedBy=multi-user.target
Note: Replace 203.0.113.11 with node 2's IP address and 203.0.113.10 with node 1's IP address.

Setup node 3 configuration

Configure the third node similarly, ensuring it can connect to the existing cluster members.

[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
WorkingDirectory=/var/lib/alertmanager
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=0.0.0.0:9093 \
  --cluster.listen-address=0.0.0.0:9094 \
  --cluster.advertise-address=203.0.113.12:9094 \
  --cluster.peer=203.0.113.10:9094 \
  --cluster.peer=203.0.113.11:9094 \
  --log.level=info

Restart=always
RestartSec=3
TimeoutStopSec=30
KillMode=mixed

[Install]
WantedBy=multi-user.target
Note: Replace 203.0.113.12 with node 3's IP address. Node 3 connects to both existing peers for faster cluster joining.

Configure firewall rules

Open the required ports for Alertmanager web interface and cluster communication on all nodes.

sudo ufw allow 9093/tcp comment 'Alertmanager web interface'
sudo ufw allow 9094/tcp comment 'Alertmanager cluster communication'
sudo ufw allow 9094/udp comment 'Alertmanager cluster communication'
sudo ufw reload
sudo firewall-cmd --permanent --add-port=9093/tcp
sudo firewall-cmd --permanent --add-port=9094/tcp
sudo firewall-cmd --permanent --add-port=9094/udp
sudo firewall-cmd --reload

Start Alertmanager cluster

Enable and start the Alertmanager service on all nodes, beginning with node 1.

# Start on node 1 first
sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager
sudo systemctl status alertmanager

Wait 30 seconds, then start on node 2

sudo systemctl daemon-reload sudo systemctl enable --now alertmanager sudo systemctl status alertmanager

Wait 30 seconds, then start on node 3

sudo systemctl daemon-reload sudo systemctl enable --now alertmanager sudo systemctl status alertmanager

Install and configure HAProxy load balancer

Install HAProxy on a separate server to provide load balancing and health checking for the Alertmanager cluster.

sudo apt install -y haproxy
sudo dnf install -y haproxy

Configure HAProxy for Alertmanager

Create an HAProxy configuration that distributes requests across all Alertmanager nodes with health checking.

global
    daemon
    maxconn 4096
    log stdout local0 info
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin
    stats timeout 30s
    user haproxy
    group haproxy

defaults
    mode http
    log global
    option httplog
    option dontlognull
    option http-server-close
    option forwardfor
    option redispatch
    retries 3
    timeout http-request 10s
    timeout queue 1m
    timeout connect 10s
    timeout client 1m
    timeout server 1m
    timeout http-keep-alive 10s
    timeout check 10s
    maxconn 3000

frontend alertmanager_frontend
    bind *:9093
    default_backend alertmanager_backend

backend alertmanager_backend
    balance roundrobin
    option httpchk GET /-/healthy
    http-check expect status 200
    server alertmanager1 203.0.113.10:9093 check inter 10s fall 3 rise 2
    server alertmanager2 203.0.113.11:9093 check inter 10s fall 3 rise 2
    server alertmanager3 203.0.113.12:9093 check inter 10s fall 3 rise 2

frontend stats
    bind *:8404
    stats enable
    stats uri /stats
    stats refresh 30s
    stats admin if TRUE

Enable and start HAProxy

Start the HAProxy service and configure it to start automatically on boot.

sudo systemctl enable --now haproxy
sudo systemctl status haproxy

Configure Prometheus to use the cluster

Update your Prometheus configuration to send alerts to the HAProxy load balancer instead of individual Alertmanager instances. This tutorial builds on the Prometheus Alertmanager integration tutorial.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - "203.0.113.20:9093"  # HAProxy load balancer IP
      timeout: 10s
      api_version: v2

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'alertmanager'
    static_configs:
      - targets:
        - '203.0.113.10:9093'
        - '203.0.113.11:9093'
        - '203.0.113.12:9093'

Restart Prometheus to apply the new configuration.

sudo systemctl restart prometheus

Verify your setup

Check that all Alertmanager nodes are running and properly clustered.

# Check cluster status on each node
curl -s http://203.0.113.10:9093/api/v1/status | jq '.data.cluster.status'
curl -s http://203.0.113.11:9093/api/v1/status | jq '.data.cluster.status'
curl -s http://203.0.113.12:9093/api/v1/status | jq '.data.cluster.status'

Check cluster peers

curl -s http://203.0.113.10:9093/api/v1/status | jq '.data.cluster.peers'

Test HAProxy health

curl -I http://203.0.113.20:9093/-/healthy

View HAProxy stats

curl http://203.0.113.20:8404/stats

Test alert reception through load balancer

curl -X POST http://203.0.113.20:9093/api/v1/alerts \ -H 'Content-Type: application/json' \ -d '[{ "labels": { "alertname": "TestAlert", "severity": "warning", "instance": "test-server" }, "annotations": { "summary": "Test alert for cluster verification", "description": "This is a test alert to verify cluster functionality" } }]'

Check the Alertmanager logs to verify cluster formation and gossip protocol operation.

sudo journalctl -u alertmanager -f --since "10 minutes ago"

You should see log entries indicating successful cluster membership and peer discovery.

Common issues

SymptomCauseFix
Nodes can't join clusterFirewall blocking gossip portsEnsure ports 9094/tcp and 9094/udp are open between nodes
Split-brain behaviorNetwork partition between nodesCheck network connectivity and advertise-address configuration
HAProxy shows all backends downHealth check path incorrectVerify /-/healthy endpoint responds with 200 on all nodes
Duplicate alerts being sentMultiple instances processing same alertCheck cluster gossip is working properly with /api/v1/status
Configuration reload failsSyntax error in YAMLUse amtool config check /etc/alertmanager/alertmanager.yml
Peer discovery failsWrong advertise-addressSet advertise-address to IP reachable by other cluster members

Next steps

Running this in production?

Want this handled for you? Running this at scale adds a second layer of work: capacity planning, failover drills, cost control, and on-call. Our managed platform covers monitoring, backups and 24/7 response by default.

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.