Set up a production-grade Alertmanager cluster with gossip protocol for high availability, automatic failover, and load balancing. Ensure your monitoring alerts remain operational even when individual nodes fail.
Prerequisites
- Three servers with 2GB RAM minimum
- Network connectivity between cluster nodes
- Basic understanding of Prometheus alerting
- Root or sudo access on all servers
What this solves
Running a single Alertmanager instance creates a critical point of failure in your monitoring infrastructure. When that instance goes down, your entire alert routing system fails, leaving you blind to system issues. This tutorial implements a clustered Alertmanager setup with automatic failover, ensuring your monitoring alerts continue working even when individual nodes fail.
Step-by-step installation
Update system packages
Start by updating your package manager to ensure you get the latest versions of required dependencies.
sudo apt update && sudo apt upgrade -y
sudo apt install -y wget curl systemd
Create Alertmanager user and directories
Create a dedicated system user for Alertmanager and set up the required directory structure with proper permissions.
sudo useradd --no-create-home --shell /bin/false alertmanager
sudo mkdir -p /etc/alertmanager
sudo mkdir -p /var/lib/alertmanager
sudo chown alertmanager:alertmanager /etc/alertmanager
sudo chown alertmanager:alertmanager /var/lib/alertmanager
Download and install Alertmanager
Download the latest Alertmanager release and install it to the system binary directory.
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xvf alertmanager-0.26.0.linux-amd64.tar.gz
sudo cp alertmanager-0.26.0.linux-amd64/alertmanager /usr/local/bin/
sudo cp alertmanager-0.26.0.linux-amd64/amtool /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager
sudo chown alertmanager:alertmanager /usr/local/bin/amtool
sudo chmod 755 /usr/local/bin/alertmanager
sudo chmod 755 /usr/local/bin/amtool
Configure Alertmanager cluster on node 1
Create the main Alertmanager configuration file for the first cluster node. This includes routing rules, receiver definitions, and cluster settings.
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alertmanager@example.com'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'default'
email_configs:
- to: 'admin@example.com'
subject: 'Alertmanager notification'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
- name: 'critical-alerts'
email_configs:
- to: 'critical@example.com'
subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Severity: {{ .Labels.severity }}
Instance: {{ .Labels.instance }}
Description: {{ .Annotations.description }}
{{ end }}
- name: 'warning-alerts'
email_configs:
- to: 'warnings@example.com'
subject: 'WARNING: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Severity: {{ .Labels.severity }}
Instance: {{ .Labels.instance }}
Description: {{ .Annotations.description }}
{{ end }}
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
Set proper ownership and permissions on the configuration file.
sudo chown alertmanager:alertmanager /etc/alertmanager/alertmanager.yml
sudo chmod 640 /etc/alertmanager/alertmanager.yml
Create systemd service for node 1
Create a systemd service file for the first Alertmanager node with cluster configuration and gossip protocol settings.
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
WorkingDirectory=/var/lib/alertmanager
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager \
--web.listen-address=0.0.0.0:9093 \
--cluster.listen-address=0.0.0.0:9094 \
--cluster.advertise-address=203.0.113.10:9094 \
--log.level=info
Restart=always
RestartSec=3
TimeoutStopSec=30
KillMode=mixed
[Install]
WantedBy=multi-user.target
Setup node 2 configuration
On the second server, repeat the user creation and directory setup, then create the configuration file.
# Run on node 2
sudo useradd --no-create-home --shell /bin/false alertmanager
sudo mkdir -p /etc/alertmanager
sudo mkdir -p /var/lib/alertmanager
sudo chown alertmanager:alertmanager /etc/alertmanager
sudo chown alertmanager:alertmanager /var/lib/alertmanager
Copy the same Alertmanager configuration file to node 2, then create its systemd service file with cluster peer information.
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
WorkingDirectory=/var/lib/alertmanager
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager \
--web.listen-address=0.0.0.0:9093 \
--cluster.listen-address=0.0.0.0:9094 \
--cluster.advertise-address=203.0.113.11:9094 \
--cluster.peer=203.0.113.10:9094 \
--log.level=info
Restart=always
RestartSec=3
TimeoutStopSec=30
KillMode=mixed
[Install]
WantedBy=multi-user.target
Setup node 3 configuration
Configure the third node similarly, ensuring it can connect to the existing cluster members.
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
WorkingDirectory=/var/lib/alertmanager
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager \
--web.listen-address=0.0.0.0:9093 \
--cluster.listen-address=0.0.0.0:9094 \
--cluster.advertise-address=203.0.113.12:9094 \
--cluster.peer=203.0.113.10:9094 \
--cluster.peer=203.0.113.11:9094 \
--log.level=info
Restart=always
RestartSec=3
TimeoutStopSec=30
KillMode=mixed
[Install]
WantedBy=multi-user.target
Configure firewall rules
Open the required ports for Alertmanager web interface and cluster communication on all nodes.
sudo ufw allow 9093/tcp comment 'Alertmanager web interface'
sudo ufw allow 9094/tcp comment 'Alertmanager cluster communication'
sudo ufw allow 9094/udp comment 'Alertmanager cluster communication'
sudo ufw reload
Start Alertmanager cluster
Enable and start the Alertmanager service on all nodes, beginning with node 1.
# Start on node 1 first
sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager
sudo systemctl status alertmanager
Wait 30 seconds, then start on node 2
sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager
sudo systemctl status alertmanager
Wait 30 seconds, then start on node 3
sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager
sudo systemctl status alertmanager
Install and configure HAProxy load balancer
Install HAProxy on a separate server to provide load balancing and health checking for the Alertmanager cluster.
sudo apt install -y haproxy
Configure HAProxy for Alertmanager
Create an HAProxy configuration that distributes requests across all Alertmanager nodes with health checking.
global
daemon
maxconn 4096
log stdout local0 info
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin
stats timeout 30s
user haproxy
group haproxy
defaults
mode http
log global
option httplog
option dontlognull
option http-server-close
option forwardfor
option redispatch
retries 3
timeout http-request 10s
timeout queue 1m
timeout connect 10s
timeout client 1m
timeout server 1m
timeout http-keep-alive 10s
timeout check 10s
maxconn 3000
frontend alertmanager_frontend
bind *:9093
default_backend alertmanager_backend
backend alertmanager_backend
balance roundrobin
option httpchk GET /-/healthy
http-check expect status 200
server alertmanager1 203.0.113.10:9093 check inter 10s fall 3 rise 2
server alertmanager2 203.0.113.11:9093 check inter 10s fall 3 rise 2
server alertmanager3 203.0.113.12:9093 check inter 10s fall 3 rise 2
frontend stats
bind *:8404
stats enable
stats uri /stats
stats refresh 30s
stats admin if TRUE
Enable and start HAProxy
Start the HAProxy service and configure it to start automatically on boot.
sudo systemctl enable --now haproxy
sudo systemctl status haproxy
Configure Prometheus to use the cluster
Update your Prometheus configuration to send alerts to the HAProxy load balancer instead of individual Alertmanager instances. This tutorial builds on the Prometheus Alertmanager integration tutorial.
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- "203.0.113.20:9093" # HAProxy load balancer IP
timeout: 10s
api_version: v2
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'alertmanager'
static_configs:
- targets:
- '203.0.113.10:9093'
- '203.0.113.11:9093'
- '203.0.113.12:9093'
Restart Prometheus to apply the new configuration.
sudo systemctl restart prometheus
Verify your setup
Check that all Alertmanager nodes are running and properly clustered.
# Check cluster status on each node
curl -s http://203.0.113.10:9093/api/v1/status | jq '.data.cluster.status'
curl -s http://203.0.113.11:9093/api/v1/status | jq '.data.cluster.status'
curl -s http://203.0.113.12:9093/api/v1/status | jq '.data.cluster.status'
Check cluster peers
curl -s http://203.0.113.10:9093/api/v1/status | jq '.data.cluster.peers'
Test HAProxy health
curl -I http://203.0.113.20:9093/-/healthy
View HAProxy stats
curl http://203.0.113.20:8404/stats
Test alert reception through load balancer
curl -X POST http://203.0.113.20:9093/api/v1/alerts \
-H 'Content-Type: application/json' \
-d '[{
"labels": {
"alertname": "TestAlert",
"severity": "warning",
"instance": "test-server"
},
"annotations": {
"summary": "Test alert for cluster verification",
"description": "This is a test alert to verify cluster functionality"
}
}]'
Check the Alertmanager logs to verify cluster formation and gossip protocol operation.
sudo journalctl -u alertmanager -f --since "10 minutes ago"
You should see log entries indicating successful cluster membership and peer discovery.
Common issues
| Symptom | Cause | Fix |
|---|---|---|
| Nodes can't join cluster | Firewall blocking gossip ports | Ensure ports 9094/tcp and 9094/udp are open between nodes |
| Split-brain behavior | Network partition between nodes | Check network connectivity and advertise-address configuration |
| HAProxy shows all backends down | Health check path incorrect | Verify /-/healthy endpoint responds with 200 on all nodes |
| Duplicate alerts being sent | Multiple instances processing same alert | Check cluster gossip is working properly with /api/v1/status |
| Configuration reload fails | Syntax error in YAML | Use amtool config check /etc/alertmanager/alertmanager.yml |
| Peer discovery fails | Wrong advertise-address | Set advertise-address to IP reachable by other cluster members |
Next steps
- Configure Prometheus Alertmanager with Slack integration for team notifications
- Set up Alertmanager with email and Slack notifications for monitoring alerts
- Configure Alertmanager webhook integrations with PagerDuty and OpsGenie
- Implement Alertmanager silence management and maintenance windows
- Configure Alertmanager routing trees for multi-tenant environments
Running this in production?
Automated install script
Run this to automate the entire setup
#!/usr/bin/env bash
set -euo pipefail
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
# Global variables
ALERTMANAGER_VERSION="0.26.0"
CLUSTER_PEERS=""
NODE_ID="node1"
LISTEN_PORT="9093"
CLUSTER_PORT="9094"
# Usage function
usage() {
echo "Usage: $0 [OPTIONS]"
echo "Options:"
echo " --node-id ID Node identifier (default: node1)"
echo " --cluster-peers IP Comma-separated list of cluster peer IPs"
echo " --listen-port PORT Alertmanager listen port (default: 9093)"
echo " --cluster-port PORT Cluster communication port (default: 9094)"
echo " -h, --help Show this help message"
exit 1
}
# Parse command line arguments
while [[ $# -gt 0 ]]; do
case $1 in
--node-id)
NODE_ID="$2"
shift 2
;;
--cluster-peers)
CLUSTER_PEERS="$2"
shift 2
;;
--listen-port)
LISTEN_PORT="$2"
shift 2
;;
--cluster-port)
CLUSTER_PORT="$2"
shift 2
;;
-h|--help)
usage
;;
*)
echo -e "${RED}Unknown option: $1${NC}"
usage
;;
esac
done
# Cleanup function for rollback
cleanup() {
echo -e "${YELLOW}Installation failed. Cleaning up...${NC}"
systemctl stop alertmanager 2>/dev/null || true
systemctl disable alertmanager 2>/dev/null || true
rm -f /etc/systemd/system/alertmanager.service
rm -f /usr/local/bin/alertmanager /usr/local/bin/amtool
rm -rf /etc/alertmanager /var/lib/alertmanager
userdel alertmanager 2>/dev/null || true
systemctl daemon-reload
echo -e "${YELLOW}Cleanup completed${NC}"
}
trap cleanup ERR
# Check if running as root or with sudo
if [[ $EUID -ne 0 ]]; then
echo -e "${RED}This script must be run as root or with sudo${NC}"
exit 1
fi
# Detect distribution
echo -e "${YELLOW}[1/8] Detecting distribution...${NC}"
if [ -f /etc/os-release ]; then
. /etc/os-release
case "$ID" in
ubuntu|debian)
PKG_MGR="apt"
PKG_UPDATE="apt update && apt upgrade -y"
PKG_INSTALL="apt install -y"
;;
almalinux|rocky|centos|rhel|ol)
PKG_MGR="dnf"
PKG_UPDATE="dnf update -y"
PKG_INSTALL="dnf install -y"
;;
fedora)
PKG_MGR="dnf"
PKG_UPDATE="dnf update -y"
PKG_INSTALL="dnf install -y"
;;
amzn)
PKG_MGR="yum"
PKG_UPDATE="yum update -y"
PKG_INSTALL="yum install -y"
;;
*)
echo -e "${RED}Unsupported distribution: $ID${NC}"
exit 1
;;
esac
echo -e "${GREEN}Detected: $PRETTY_NAME${NC}"
else
echo -e "${RED}Cannot detect distribution${NC}"
exit 1
fi
# Update system packages
echo -e "${YELLOW}[2/8] Updating system packages...${NC}"
$PKG_UPDATE
$PKG_INSTALL wget curl systemd tar
# Create alertmanager user and directories
echo -e "${YELLOW}[3/8] Creating alertmanager user and directories...${NC}"
useradd --no-create-home --shell /bin/false alertmanager 2>/dev/null || true
mkdir -p /etc/alertmanager
mkdir -p /var/lib/alertmanager
chown alertmanager:alertmanager /etc/alertmanager
chown alertmanager:alertmanager /var/lib/alertmanager
chmod 755 /etc/alertmanager
chmod 755 /var/lib/alertmanager
# Download and install Alertmanager
echo -e "${YELLOW}[4/8] Downloading and installing Alertmanager...${NC}"
cd /tmp
wget -q "https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz"
tar xf "alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz"
cp "alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/alertmanager" /usr/local/bin/
cp "alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/amtool" /usr/local/bin/
chown alertmanager:alertmanager /usr/local/bin/alertmanager
chown alertmanager:alertmanager /usr/local/bin/amtool
chmod 755 /usr/local/bin/alertmanager
chmod 755 /usr/local/bin/amtool
rm -rf "alertmanager-${ALERTMANAGER_VERSION}.linux-amd64"*
# Create Alertmanager configuration
echo -e "${YELLOW}[5/8] Creating Alertmanager configuration...${NC}"
cat > /etc/alertmanager/alertmanager.yml << 'EOF'
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alertmanager@example.com'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'default'
email_configs:
- to: 'admin@example.com'
subject: 'Alertmanager notification'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
- name: 'critical-alerts'
email_configs:
- to: 'critical@example.com'
subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Severity: {{ .Labels.severity }}
Instance: {{ .Labels.instance }}
Description: {{ .Annotations.description }}
{{ end }}
- name: 'warning-alerts'
email_configs:
- to: 'warnings@example.com'
subject: 'WARNING: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Severity: {{ .Labels.severity }}
Instance: {{ .Labels.instance }}
Description: {{ .Annotations.description }}
{{ end }}
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
EOF
chown alertmanager:alertmanager /etc/alertmanager/alertmanager.yml
chmod 644 /etc/alertmanager/alertmanager.yml
# Create systemd service
echo -e "${YELLOW}[6/8] Creating systemd service...${NC}"
CLUSTER_LISTEN_ADDRESS="0.0.0.0:${CLUSTER_PORT}"
CLUSTER_ADVERTISE_ADDRESS="$(hostname -I | awk '{print $1}'):${CLUSTER_PORT}"
# Build cluster peer arguments
PEER_ARGS=""
if [[ -n "$CLUSTER_PEERS" ]]; then
IFS=',' read -ra PEERS <<< "$CLUSTER_PEERS"
for peer in "${PEERS[@]}"; do
PEER_ARGS="$PEER_ARGS --cluster.peer=${peer}:${CLUSTER_PORT}"
done
fi
cat > /etc/systemd/system/alertmanager.service << EOF
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
WorkingDirectory=/var/lib/alertmanager
ExecStart=/usr/local/bin/alertmanager \\
--config.file=/etc/alertmanager/alertmanager.yml \\
--storage.path=/var/lib/alertmanager \\
--web.listen-address=0.0.0.0:${LISTEN_PORT} \\
--cluster.listen-address=${CLUSTER_LISTEN_ADDRESS} \\
--cluster.advertise-address=${CLUSTER_ADVERTISE_ADDRESS}${PEER_ARGS}
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
# Configure firewall
echo -e "${YELLOW}[7/8] Configuring firewall...${NC}"
case "$ID" in
ubuntu|debian)
if command -v ufw >/dev/null 2>&1; then
ufw allow ${LISTEN_PORT}/tcp
ufw allow ${CLUSTER_PORT}/tcp
fi
;;
almalinux|rocky|centos|rhel|ol|fedora)
if command -v firewall-cmd >/dev/null 2>&1; then
firewall-cmd --permanent --add-port=${LISTEN_PORT}/tcp
firewall-cmd --permanent --add-port=${CLUSTER_PORT}/tcp
firewall-cmd --reload
fi
;;
esac
# Start and enable service
echo -e "${YELLOW}[8/8] Starting Alertmanager service...${NC}"
systemctl daemon-reload
systemctl enable alertmanager
systemctl start alertmanager
# Verify installation
echo -e "${YELLOW}Verifying installation...${NC}"
sleep 5
if systemctl is-active --quiet alertmanager; then
echo -e "${GREEN}✓ Alertmanager service is running${NC}"
else
echo -e "${RED}✗ Alertmanager service failed to start${NC}"
exit 1
fi
if curl -s http://localhost:${LISTEN_PORT}/-/healthy >/dev/null 2>&1; then
echo -e "${GREEN}✓ Alertmanager is responding on port ${LISTEN_PORT}${NC}"
else
echo -e "${RED}✗ Alertmanager is not responding${NC}"
exit 1
fi
echo -e "${GREEN}Alertmanager cluster installation completed successfully!${NC}"
echo -e "Node ID: ${NODE_ID}"
echo -e "Web UI: http://$(hostname -I | awk '{print $1}'):${LISTEN_PORT}"
echo -e "Configuration: /etc/alertmanager/alertmanager.yml"
echo -e "Logs: journalctl -u alertmanager -f"
trap - ERR
Review the script before running. Execute with: bash install.sh