Set up comprehensive monitoring for HashiCorp Consul using Prometheus metrics collection and Grafana dashboards. Configure telemetry export, alerting rules, and visualization for service discovery health and performance.
Prerequisites
- Running Consul cluster
- Root or sudo access
- Network access between monitoring components
- Basic understanding of Prometheus and Grafana
What this solves
Consul service discovery requires comprehensive monitoring to ensure healthy service registration, discovery performance, and cluster stability. This tutorial configures Prometheus to collect Consul telemetry metrics and creates Grafana dashboards for visualizing service health, cluster performance, and automated alerting on critical issues.
Prerequisites
You need a running Consul cluster with basic configuration. If you haven't set up Consul yet, follow our Consul installation guide first. You'll also need Prometheus and Grafana installed on your monitoring server.
Step-by-step configuration
Install monitoring stack components
Install Prometheus and Grafana on your monitoring server to collect and visualize Consul metrics.
sudo apt update
sudo apt install -y prometheus grafana curl wget
sudo systemctl enable prometheus grafana-server
Configure Consul telemetry
Enable Consul's built-in telemetry and metrics export to make data available for Prometheus scraping.
telemetry {
prometheus_retention_time = "30s"
disable_hostname = true
enable_hostname_label = true
statsd_address = "127.0.0.1:8125"
statsite_address = "127.0.0.1:8126"
metrics_prefix = "consul"
circonus_api_token = ""
circonus_api_app = "consul"
dogstatsd_addr = ""
dogstatsd_tags = []
}
Enable Consul HTTP API metrics endpoint
Configure Consul to expose metrics through its HTTP API endpoint for Prometheus collection.
ports {
http = 8500
}
ui_config {
enabled = true
metrics_provider = "prometheus"
metrics_proxy {
base_url = "http://localhost:9090"
}
}
connect {
enabled = true
}
log_level = "INFO"
log_json = true
Restart Consul with telemetry configuration
Apply the telemetry configuration by restarting Consul and verify the metrics endpoint is accessible.
sudo systemctl restart consul
sudo systemctl status consul
Test metrics endpoint
curl -s http://localhost:8500/v1/agent/metrics?format=prometheus | head -20
Configure Prometheus to scrape Consul metrics
Add Consul as a scrape target in Prometheus configuration to collect telemetry data every 15 seconds.
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "/etc/prometheus/rules/consul_alerts.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'consul'
static_configs:
- targets: ['localhost:8500']
metrics_path: '/v1/agent/metrics'
params:
format: ['prometheus']
scrape_interval: 15s
scrape_timeout: 10s
- job_name: 'consul-services'
consul_sd_configs:
- server: 'localhost:8500'
services: []
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: service
- source_labels: [__meta_consul_node]
target_label: node
Create Consul alerting rules
Define Prometheus alerting rules for critical Consul events like node failures, service health issues, and cluster problems.
sudo mkdir -p /etc/prometheus/rules
groups:
- name: consul_alerts
rules:
- alert: ConsulNodeDown
expr: consul_health_node_status != 1
for: 30s
labels:
severity: critical
annotations:
summary: "Consul node {{ $labels.node }} is down"
description: "Consul node {{ $labels.node }} has been down for more than 30 seconds"
- alert: ConsulServiceUnhealthy
expr: consul_health_service_status != 1
for: 60s
labels:
severity: warning
annotations:
summary: "Consul service {{ $labels.service_name }} is unhealthy"
description: "Service {{ $labels.service_name }} on node {{ $labels.node }} has been unhealthy for more than 1 minute"
- alert: ConsulLeaderElection
expr: consul_raft_leader != 1
for: 10s
labels:
severity: critical
annotations:
summary: "Consul cluster has no leader"
description: "Consul cluster is without a leader for more than 10 seconds"
- alert: ConsulHighMemoryUsage
expr: consul_runtime_alloc_bytes / consul_runtime_sys_bytes > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Consul memory usage is high"
description: "Consul is using more than 80% of allocated memory"
- alert: ConsulRPCErrors
expr: rate(consul_rpc_request_errors_total[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High Consul RPC error rate"
description: "Consul RPC error rate is {{ $value }} errors per second"
Restart Prometheus with new configuration
Apply the Prometheus configuration and verify it's successfully scraping Consul metrics.
sudo systemctl restart prometheus
sudo systemctl status prometheus
Check if Consul target is being scraped
curl -s http://localhost:9090/api/v1/targets | grep consul
Configure Grafana data source
Add Prometheus as a data source in Grafana to enable dashboard creation for Consul metrics.
sudo systemctl start grafana-server
sudo systemctl status grafana-server
Access Grafana at http://your-server-ip:3000 (admin/admin), then add Prometheus data source at http://localhost:9090.
Create Consul overview dashboard
Import a comprehensive Grafana dashboard for Consul monitoring with panels for cluster health, service discovery, and performance metrics.
{
"dashboard": {
"id": null,
"title": "Consul Service Discovery Monitoring",
"tags": ["consul", "service-discovery"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Consul Cluster Status",
"type": "stat",
"targets": [
{
"expr": "consul_raft_leader",
"legendFormat": "Leader Status"
}
],
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
},
{
"id": 2,
"title": "Active Services",
"type": "stat",
"targets": [
{
"expr": "count(consul_catalog_service_query_count)",
"legendFormat": "Services"
}
],
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
},
{
"id": 3,
"title": "Service Health Status",
"type": "graph",
"targets": [
{
"expr": "consul_health_service_status",
"legendFormat": "{{ service_name }}"
}
],
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
},
{
"id": 4,
"title": "RPC Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(consul_rpc_request_total[5m])",
"legendFormat": "RPC Requests/sec"
}
],
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 16}
},
{
"id": 5,
"title": "Memory Usage",
"type": "graph",
"targets": [
{
"expr": "consul_runtime_alloc_bytes",
"legendFormat": "Allocated Memory"
}
],
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 16}
}
],
"time": {
"from": "now-1h",
"to": "now"
},
"refresh": "5s"
}
}
Import dashboard into Grafana
Use the Grafana API to import the Consul monitoring dashboard programmatically.
curl -X POST \
http://admin:admin@localhost:3000/api/dashboards/db \
-H 'Content-Type: application/json' \
-d @/tmp/consul_dashboard.json
Configure Grafana alerting
Set up Grafana alert notifications to send alerts via email or Slack when Consul issues occur.
[alerting]
enabled = true
execute_alerts = true
max_attempts = 3
min_interval_seconds = 10
[smtp]
enabled = true
host = localhost:587
user = alerts@example.com
password = your-smtp-password
skip_verify = false
from_address = consul-alerts@example.com
from_name = Consul Monitoring
[unified_alerting]
enabled = true
Create service discovery health checks
Configure additional monitoring for service registration and discovery performance metrics.
service {
name = "consul-monitoring"
port = 8500
tags = ["monitoring", "metrics"]
check {
name = "Consul HTTP API"
http = "http://localhost:8500/v1/status/leader"
interval = "10s"
timeout = "3s"
}
check {
name = "Consul Metrics Endpoint"
http = "http://localhost:8500/v1/agent/metrics?format=prometheus"
interval = "30s"
timeout = "5s"
}
}
Restart services and verify monitoring
Apply all configuration changes and verify the complete monitoring pipeline is working correctly.
sudo systemctl restart consul prometheus grafana-server
sudo systemctl status consul prometheus grafana-server
Verify metrics collection
curl -s "http://localhost:9090/api/v1/query?query=consul_raft_leader" | grep -o '"value":\[.*\]'
Verify your monitoring setup
Test that all monitoring components are working correctly and collecting Consul metrics.
# Check Consul metrics endpoint
curl -s http://localhost:8500/v1/agent/metrics?format=prometheus | grep consul_raft_leader
Verify Prometheus is scraping Consul
curl -s "http://localhost:9090/api/v1/targets" | grep consul | head -5
Test Grafana API
curl -s http://admin:admin@localhost:3000/api/health
Check active services in Consul
curl -s http://localhost:8500/v1/catalog/services | jq .
Key metrics to monitor
| Metric | Description | Prometheus Query |
|---|---|---|
| Leader Status | Whether node is cluster leader | consul_raft_leader |
| Service Health | Health status of registered services | consul_health_service_status |
| Node Health | Health status of cluster nodes | consul_health_node_status |
| RPC Requests | Rate of RPC requests | rate(consul_rpc_request_total[5m]) |
| Memory Usage | Current memory allocation | consul_runtime_alloc_bytes |
| KV Store Operations | Key-value store transaction rate | rate(consul_kvs_apply[5m]) |
Common monitoring issues
| Symptom | Cause | Fix |
|---|---|---|
| No metrics in Prometheus | Consul telemetry not enabled | Add telemetry block to Consul config and restart |
| Consul target shows as down | Metrics endpoint not accessible | Check consul_sd_config in prometheus.yml |
| Grafana shows no data | Prometheus data source not configured | Add http://localhost:9090 as Prometheus data source |
| Alerts not firing | Alert rules syntax error | Validate YAML syntax in consul_alerts.yml |
| High memory usage alerts | Normal for large service catalogs | Adjust alert thresholds based on environment |
Performance optimization tips
For production environments, consider these optimizations to reduce monitoring overhead while maintaining visibility:
Adjust scrape intervals for performance
Reduce monitoring overhead by tuning scrape intervals based on your requirements.
# For high-traffic environments
scrape_configs:
- job_name: 'consul'
static_configs:
- targets: ['localhost:8500']
scrape_interval: 30s # Increased from 15s
scrape_timeout: 15s # Increased timeout
This monitoring setup provides comprehensive visibility into your Consul service discovery infrastructure. You can extend it by adding custom metrics for specific services or integrating with your existing monitoring stack. For additional security, consider implementing Consul ACL security to protect metrics endpoints.
Next steps
- Configure Consul backup and disaster recovery with automated snapshots
- Set up Caddy with Consul service discovery for dynamic load balancing
- Configure Prometheus long-term storage with Thanos for unlimited data retention
- Monitor HAProxy and Consul with Prometheus and Grafana dashboards
- Configure Consul Connect service mesh monitoring with distributed tracing
Automated install script
Run this to automate the entire setup
#!/usr/bin/env bash
set -euo pipefail
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m'
# Default values
CONSUL_HOST="${1:-localhost}"
GRAFANA_ADMIN_PASS="${2:-admin}"
# Usage message
usage() {
echo "Usage: $0 [consul_host] [grafana_admin_password]"
echo " consul_host: Consul server hostname/IP (default: localhost)"
echo " grafana_admin_password: Grafana admin password (default: admin)"
exit 1
}
# Error handling
cleanup() {
echo -e "${RED}[ERROR] Installation failed. Cleaning up...${NC}"
systemctl stop prometheus grafana-server 2>/dev/null || true
rm -f /tmp/consul_monitoring_* 2>/dev/null || true
}
trap cleanup ERR
# Check if running as root or with sudo
if [[ $EUID -ne 0 ]]; then
echo -e "${RED}[ERROR] This script must be run as root or with sudo${NC}"
exit 1
fi
# Auto-detect distribution
if [ -f /etc/os-release ]; then
. /etc/os-release
case "$ID" in
ubuntu|debian)
PKG_MGR="apt"
PKG_UPDATE="apt update"
PKG_INSTALL="apt install -y"
PROM_CONFIG="/etc/prometheus/prometheus.yml"
PROM_RULES_DIR="/etc/prometheus/rules"
GRAFANA_CONFIG="/etc/grafana/grafana.ini"
GRAFANA_USER="grafana"
;;
almalinux|rocky|centos|rhel|ol|fedora)
PKG_MGR="dnf"
PKG_UPDATE="dnf update -y"
PKG_INSTALL="dnf install -y"
PROM_CONFIG="/etc/prometheus/prometheus.yml"
PROM_RULES_DIR="/etc/prometheus/rules"
GRAFANA_CONFIG="/etc/grafana/grafana.ini"
GRAFANA_USER="grafana"
;;
amzn)
PKG_MGR="yum"
PKG_UPDATE="yum update -y"
PKG_INSTALL="yum install -y"
PROM_CONFIG="/etc/prometheus/prometheus.yml"
PROM_RULES_DIR="/etc/prometheus/rules"
GRAFANA_CONFIG="/etc/grafana/grafana.ini"
GRAFANA_USER="grafana"
;;
*)
echo -e "${RED}[ERROR] Unsupported distribution: $ID${NC}"
exit 1
;;
esac
else
echo -e "${RED}[ERROR] Cannot detect distribution${NC}"
exit 1
fi
echo -e "${BLUE}[INFO] Installing Consul monitoring with Prometheus and Grafana on $PRETTY_NAME${NC}"
# Step 1: Update packages and install monitoring stack
echo -e "${GREEN}[1/8] Installing monitoring stack components...${NC}"
$PKG_UPDATE
if [[ "$PKG_MGR" == "apt" ]]; then
$PKG_INSTALL prometheus grafana curl wget
else
$PKG_INSTALL prometheus grafana curl wget
fi
# Step 2: Configure Prometheus
echo -e "${GREEN}[2/8] Configuring Prometheus...${NC}"
mkdir -p "$PROM_RULES_DIR"
chown prometheus:prometheus "$PROM_RULES_DIR"
chmod 755 "$PROM_RULES_DIR"
cat > "$PROM_CONFIG" << 'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "/etc/prometheus/rules/consul_alerts.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'consul'
static_configs:
- targets: ['CONSUL_HOST:8500']
metrics_path: '/v1/agent/metrics'
params:
format: ['prometheus']
scrape_interval: 15s
scrape_timeout: 10s
- job_name: 'consul-services'
consul_sd_configs:
- server: 'CONSUL_HOST:8500'
services: []
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: service
- source_labels: [__meta_consul_node]
target_label: node
EOF
sed -i "s/CONSUL_HOST/$CONSUL_HOST/g" "$PROM_CONFIG"
chown prometheus:prometheus "$PROM_CONFIG"
chmod 644 "$PROM_CONFIG"
# Step 3: Create Consul alerting rules
echo -e "${GREEN}[3/8] Creating Consul alerting rules...${NC}"
cat > "$PROM_RULES_DIR/consul_alerts.yml" << 'EOF'
groups:
- name: consul_alerts
rules:
- alert: ConsulNodeDown
expr: consul_health_node_status != 1
for: 30s
labels:
severity: critical
annotations:
summary: "Consul node {{ $labels.node }} is down"
description: "Consul node {{ $labels.node }} has been down for more than 30 seconds"
- alert: ConsulServiceUnhealthy
expr: consul_health_service_status != 1
for: 60s
labels:
severity: warning
annotations:
summary: "Consul service {{ $labels.service_name }} is unhealthy"
description: "Service {{ $labels.service_name }} on node {{ $labels.node }} has been unhealthy for more than 60 seconds"
- alert: ConsulLeaderElection
expr: increase(consul_raft_leader_lastcontact_count[5m]) > 5
for: 2m
labels:
severity: warning
annotations:
summary: "Consul leader election issues"
description: "Consul cluster experiencing leader election issues"
- alert: ConsulHighMemoryUsage
expr: consul_runtime_alloc_bytes / consul_runtime_sys_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage in Consul"
description: "Consul memory usage is above 90%"
EOF
chown prometheus:prometheus "$PROM_RULES_DIR/consul_alerts.yml"
chmod 644 "$PROM_RULES_DIR/consul_alerts.yml"
# Step 4: Configure Grafana
echo -e "${GREEN}[4/8] Configuring Grafana...${NC}"
sed -i "s/;admin_password = admin/admin_password = $GRAFANA_ADMIN_PASS/" "$GRAFANA_CONFIG"
chown root:$GRAFANA_USER "$GRAFANA_CONFIG"
chmod 640 "$GRAFANA_CONFIG"
# Step 5: Configure firewall
echo -e "${GREEN}[5/8] Configuring firewall...${NC}"
if command -v firewall-cmd >/dev/null 2>&1; then
systemctl start firewalld 2>/dev/null || true
firewall-cmd --permanent --add-port=9090/tcp --quiet || true
firewall-cmd --permanent --add-port=3000/tcp --quiet || true
firewall-cmd --reload --quiet || true
elif command -v ufw >/dev/null 2>&1; then
ufw allow 9090/tcp >/dev/null 2>&1 || true
ufw allow 3000/tcp >/dev/null 2>&1 || true
fi
# Step 6: Start and enable services
echo -e "${GREEN}[6/8] Starting monitoring services...${NC}"
systemctl enable prometheus grafana-server
systemctl restart prometheus
systemctl restart grafana-server
# Wait for services to start
sleep 5
# Step 7: Verify services are running
echo -e "${GREEN}[7/8] Verifying services...${NC}"
if ! systemctl is-active --quiet prometheus; then
echo -e "${RED}[ERROR] Prometheus failed to start${NC}"
exit 1
fi
if ! systemctl is-active --quiet grafana-server; then
echo -e "${RED}[ERROR] Grafana failed to start${NC}"
exit 1
fi
# Step 8: Test connectivity and add Prometheus datasource to Grafana
echo -e "${GREEN}[8/8] Configuring Grafana datasource...${NC}"
sleep 10
# Wait for Grafana to be fully ready
for i in {1..30}; do
if curl -s http://localhost:3000/api/health >/dev/null 2>&1; then
break
fi
sleep 2
done
# Add Prometheus as datasource
curl -s -X POST \
-H "Content-Type: application/json" \
-d '{
"name": "Prometheus",
"type": "prometheus",
"url": "http://localhost:9090",
"access": "proxy",
"isDefault": true
}' \
"http://admin:${GRAFANA_ADMIN_PASS}@localhost:3000/api/datasources" >/dev/null || true
echo -e "${GREEN}[SUCCESS] Consul monitoring setup completed!${NC}"
echo -e "${BLUE}Services configured:${NC}"
echo " - Prometheus: http://localhost:9090"
echo " - Grafana: http://localhost:3000 (admin/$GRAFANA_ADMIN_PASS)"
echo ""
echo -e "${YELLOW}[INFO] Next steps:${NC}"
echo "1. Configure Consul telemetry in your consul.hcl:"
echo " telemetry {"
echo " prometheus_retention_time = \"30s\""
echo " disable_hostname = true"
echo " }"
echo "2. Restart Consul: systemctl restart consul"
echo "3. Test metrics: curl http://$CONSUL_HOST:8500/v1/agent/metrics?format=prometheus"
echo "4. Import Consul dashboards in Grafana"
Review the script before running. Execute with: bash install.sh