Implement Consul backup and disaster recovery with automated snapshots and restoration

Intermediate 45 min Apr 24, 2026 27 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Set up automated Consul snapshots with GPG encryption, systemd timers, and complete disaster recovery procedures. Includes monitoring integration with Prometheus and automated restoration workflows for production environments.

Prerequisites

  • Running Consul cluster with ACL enabled
  • Root or sudo access
  • GPG for encryption
  • Basic systemd knowledge

What this solves

Consul stores critical service discovery data, key-value pairs, and configuration that your applications depend on. Without proper backups, a cluster failure can bring down your entire infrastructure. This tutorial shows you how to implement automated Consul snapshots with encryption, monitoring, and disaster recovery procedures to protect against data loss and minimize downtime during failures.

Prerequisites

  • Running Consul cluster with ACL enabled
  • Root or sudo access to all Consul nodes
  • GPG installed for backup encryption
  • Basic understanding of systemd services

Step-by-step configuration

Install required packages

Install GPG for encryption, AWS CLI for remote storage, and monitoring tools.

sudo apt update
sudo apt install -y gnupg2 awscli jq curl
sudo dnf install -y gnupg2 awscli jq curl

Create backup user and directories

Create a dedicated user for backup operations with minimal privileges.

sudo useradd -r -s /bin/bash -d /opt/consul-backup consul-backup
sudo mkdir -p /opt/consul-backup/{scripts,backups,logs,keys}
sudo chown -R consul-backup:consul-backup /opt/consul-backup
sudo chmod 755 /opt/consul-backup
sudo chmod 700 /opt/consul-backup/{backups,keys}

Generate GPG encryption key

Create a GPG key pair for encrypting backup files. Store the private key securely.

sudo -u consul-backup gpg --batch --gen-key <<EOF
Key-Type: RSA
Key-Length: 4096
Name-Real: Consul Backup
Name-Email: consul-backup@example.com
Expire-Date: 0
Passphrase: YourSecurePassphrase123!
%commit
EOF

Export the public key for verification and store the key ID:

sudo -u consul-backup gpg --list-keys
export CONSUL_GPG_KEY=$(sudo -u consul-backup gpg --list-keys --with-colons | grep fpr | head -1 | cut -d: -f10)
echo "CONSUL_GPG_KEY=$CONSUL_GPG_KEY" | sudo tee /opt/consul-backup/keys/gpg-key-id

Create Consul ACL token for backups

Generate a dedicated ACL token with minimal permissions for snapshot operations.

consul acl policy create -name "consul-backup" -rules - <<EOF
acl = "read"
key_prefix "" {
  policy = "read"
}
node_prefix "" {
  policy = "read"
}
operator = "read"
service_prefix "" {
  policy = "read"
}
session_prefix "" {
  policy = "read"
}
EOF

Create the token and store it securely:

BACKUP_TOKEN=$(consul acl token create -policy-name "consul-backup" -description "Consul backup token" -format json | jq -r '.SecretID')
echo "CONSUL_HTTP_TOKEN=$BACKUP_TOKEN" | sudo tee /opt/consul-backup/keys/consul-token
sudo chown consul-backup:consul-backup /opt/consul-backup/keys/consul-token
sudo chmod 600 /opt/consul-backup/keys/consul-token

Create backup script

Create the main backup script that handles snapshot creation, encryption, and remote storage.

#!/bin/bash

set -euo pipefail

Configuration

BACKUP_DIR="/opt/consul-backup/backups" LOG_FILE="/opt/consul-backup/logs/backup.log" RETENTION_DAYS=30 S3_BUCKET="your-consul-backups" CONSUL_ADDR="http://localhost:8500" GPG_KEY_FILE="/opt/consul-backup/keys/gpg-key-id" TOKEN_FILE="/opt/consul-backup/keys/consul-token"

Source configuration files

[ -f "$GPG_KEY_FILE" ] && source "$GPG_KEY_FILE" [ -f "$TOKEN_FILE" ] && source "$TOKEN_FILE"

Logging function

log() { echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG_FILE" }

Create timestamp

TIMESTAMP=$(date '+%Y%m%d_%H%M%S') SNAPSHOT_FILE="consul-snapshot-$TIMESTAMP.snap" ENCRYPTED_FILE="$SNAPSHOT_FILE.gpg" log "Starting Consul backup process"

Check Consul connectivity

if ! consul members &>/dev/null; then log "ERROR: Cannot connect to Consul cluster" exit 1 fi

Create snapshot

log "Creating Consul snapshot" if consul snapshot save -http-addr="$CONSUL_ADDR" "$BACKUP_DIR/$SNAPSHOT_FILE"; then log "Snapshot created successfully: $SNAPSHOT_FILE" else log "ERROR: Failed to create snapshot" exit 1 fi

Encrypt snapshot

log "Encrypting snapshot with GPG" if gpg --trust-model always --encrypt -r "$CONSUL_GPG_KEY" --cipher-algo AES256 --compress-algo 2 --output "$BACKUP_DIR/$ENCRYPTED_FILE" "$BACKUP_DIR/$SNAPSHOT_FILE"; then log "Snapshot encrypted successfully: $ENCRYPTED_FILE" # Remove unencrypted file rm "$BACKUP_DIR/$SNAPSHOT_FILE" else log "ERROR: Failed to encrypt snapshot" exit 1 fi

Upload to S3 (optional)

if [ -n "${S3_BUCKET:-}" ]; then log "Uploading encrypted snapshot to S3" if aws s3 cp "$BACKUP_DIR/$ENCRYPTED_FILE" "s3://$S3_BUCKET/consul-backups/$ENCRYPTED_FILE"; then log "Snapshot uploaded to S3 successfully" else log "WARNING: Failed to upload snapshot to S3" fi fi

Clean up old local backups

log "Cleaning up backups older than $RETENTION_DAYS days" find "$BACKUP_DIR" -name "consul-snapshot-*.snap.gpg" -type f -mtime +"$RETENTION_DAYS" -delete

Verify backup integrity

log "Verifying backup integrity" if gpg --trust-model always --decrypt "$BACKUP_DIR/$ENCRYPTED_FILE" >/dev/null 2>&1; then log "Backup integrity verification successful" else log "ERROR: Backup integrity verification failed" exit 1 fi

Update metrics file for monitoring

echo "consul_backup_last_success_timestamp $(date +%s)" > /opt/consul-backup/logs/backup-metrics.prom echo "consul_backup_file_size_bytes $(stat -c%s "$BACKUP_DIR/$ENCRYPTED_FILE")" >> /opt/consul-backup/logs/backup-metrics.prom log "Consul backup completed successfully" log "Backup file: $ENCRYPTED_FILE" log "File size: $(du -h "$BACKUP_DIR/$ENCRYPTED_FILE" | cut -f1)"

Make the script executable:

sudo chmod +x /opt/consul-backup/scripts/consul-backup.sh
sudo chown consul-backup:consul-backup /opt/consul-backup/scripts/consul-backup.sh

Create restore script

Create a disaster recovery script for restoring from encrypted snapshots.

#!/bin/bash

set -euo pipefail

Configuration

BACKUP_DIR="/opt/consul-backup/backups" LOG_FILE="/opt/consul-backup/logs/restore.log" CONSUL_ADDR="http://localhost:8500" TOKEN_FILE="/opt/consul-backup/keys/consul-token"

Source token file

[ -f "$TOKEN_FILE" ] && source "$TOKEN_FILE"

Logging function

log() { echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG_FILE" }

Check if backup file is provided

if [ $# -eq 0 ]; then echo "Usage: $0 " echo "Available backups:" ls -la "$BACKUP_DIR"/*.snap.gpg 2>/dev/null || echo "No backups found" exit 1 fi ENCRYPTED_FILE="$1" SNAPSHOT_FILE="${ENCRYPTED_FILE%.gpg}"

Verify file exists

if [ ! -f "$BACKUP_DIR/$ENCRYPTED_FILE" ]; then log "ERROR: Encrypted snapshot file not found: $BACKUP_DIR/$ENCRYPTED_FILE" exit 1 fi log "Starting Consul restore process" log "Restoring from: $ENCRYPTED_FILE"

Decrypt snapshot

log "Decrypting snapshot" if gpg --trust-model always --decrypt "$BACKUP_DIR/$ENCRYPTED_FILE" > "$BACKUP_DIR/$SNAPSHOT_FILE"; then log "Snapshot decrypted successfully" else log "ERROR: Failed to decrypt snapshot" exit 1 fi

Confirm restore operation

read -p "Are you sure you want to restore Consul data? This will overwrite existing data. (yes/no): " -r if [[ ! $REPLY =~ ^[Yy][Ee][Ss]$ ]]; then log "Restore operation cancelled" rm "$BACKUP_DIR/$SNAPSHOT_FILE" exit 0 fi

Stop Consul service on all nodes

log "Stopping Consul service on all nodes" echo "Please stop Consul on all cluster nodes before proceeding" read -p "Press Enter after stopping Consul on all nodes..."

Restore snapshot

log "Restoring Consul snapshot" if consul snapshot restore -http-addr="$CONSUL_ADDR" "$BACKUP_DIR/$SNAPSHOT_FILE"; then log "Snapshot restored successfully" else log "ERROR: Failed to restore snapshot" rm "$BACKUP_DIR/$SNAPSHOT_FILE" exit 1 fi

Clean up decrypted file

rm "$BACKUP_DIR/$SNAPSHOT_FILE" log "Consul restore completed successfully" log "Please start Consul service on all nodes"

Make the restore script executable:

sudo chmod +x /opt/consul-backup/scripts/consul-restore.sh
sudo chown consul-backup:consul-backup /opt/consul-backup/scripts/consul-restore.sh

Configure automated backups with systemd

Create a systemd service for running backups.

[Unit]
Description=Consul Backup Service
After=consul.service
Requires=consul.service

[Service]
Type=oneshot
User=consul-backup
Group=consul-backup
ExecStart=/opt/consul-backup/scripts/consul-backup.sh
Environment="HOME=/opt/consul-backup"
WorkingDirectory=/opt/consul-backup
StandardOutput=append:/opt/consul-backup/logs/backup.log
StandardError=append:/opt/consul-backup/logs/backup.log

[Install]
WantedBy=multi-user.target

Create a systemd timer for automated execution:

[Unit]
Description=Run Consul Backup Daily
Requires=consul-backup.service

[Timer]
OnCalendar=daily
RandomizedDelaySec=3600
Persistent=true

[Install]
WantedBy=timers.target

Enable and start the timer:

sudo systemctl daemon-reload
sudo systemctl enable consul-backup.timer
sudo systemctl start consul-backup.timer
sudo systemctl status consul-backup.timer

Configure backup monitoring

Create a monitoring script that exports metrics for Prometheus.

#!/bin/bash

set -euo pipefail

Configuration

BACKUP_DIR="/opt/consul-backup/backups" METRICS_FILE="/opt/consul-backup/logs/backup-metrics.prom" MAX_AGE_HOURS=26 # Alert if backup is older than 26 hours

Initialize metrics

echo "# HELP consul_backup_last_success_timestamp Unix timestamp of last successful backup" > "$METRICS_FILE" echo "# TYPE consul_backup_last_success_timestamp gauge" >> "$METRICS_FILE" echo "# HELP consul_backup_file_size_bytes Size of latest backup file in bytes" >> "$METRICS_FILE" echo "# TYPE consul_backup_file_size_bytes gauge" >> "$METRICS_FILE" echo "# HELP consul_backup_age_hours Age of latest backup in hours" >> "$METRICS_FILE" echo "# TYPE consul_backup_age_hours gauge" >> "$METRICS_FILE"

Find latest backup

LATEST_BACKUP=$(find "$BACKUP_DIR" -name "consul-snapshot-*.snap.gpg" -type f -printf '%T@ %p\n' | sort -n | tail -1 | cut -d' ' -f2-) if [ -n "$LATEST_BACKUP" ]; then # Get backup timestamp and size BACKUP_TIMESTAMP=$(stat -c %Y "$LATEST_BACKUP") BACKUP_SIZE=$(stat -c %s "$LATEST_BACKUP") CURRENT_TIME=$(date +%s) AGE_HOURS=$(( (CURRENT_TIME - BACKUP_TIMESTAMP) / 3600 )) echo "consul_backup_last_success_timestamp $BACKUP_TIMESTAMP" >> "$METRICS_FILE" echo "consul_backup_file_size_bytes $BACKUP_SIZE" >> "$METRICS_FILE" echo "consul_backup_age_hours $AGE_HOURS" >> "$METRICS_FILE" # Health check if [ $AGE_HOURS -gt $MAX_AGE_HOURS ]; then echo "consul_backup_healthy 0" >> "$METRICS_FILE" else echo "consul_backup_healthy 1" >> "$METRICS_FILE" fi else echo "consul_backup_healthy 0" >> "$METRICS_FILE" echo "consul_backup_age_hours 999" >> "$METRICS_FILE" fi

Export metrics to node_exporter textfile directory (if available)

if [ -d "/var/lib/prometheus/node-exporter" ]; then cp "$METRICS_FILE" "/var/lib/prometheus/node-exporter/consul-backup.prom" fi

Make the monitoring script executable and create a cron job:

sudo chmod +x /opt/consul-backup/scripts/backup-monitor.sh
sudo chown consul-backup:consul-backup /opt/consul-backup/scripts/backup-monitor.sh

Add to consul-backup user's crontab:

sudo -u consul-backup crontab -l 2>/dev/null | { cat; echo "/5    * /opt/consul-backup/scripts/backup-monitor.sh"; } | sudo -u consul-backup crontab -

Configure S3 storage (optional)

Set up AWS credentials for remote backup storage.

[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
region = us-west-2
sudo mkdir -p /opt/consul-backup/.aws
sudo chown consul-backup:consul-backup /opt/consul-backup/.aws
sudo chmod 700 /opt/consul-backup/.aws

Create the S3 bucket:

aws s3 mb s3://your-consul-backups
aws s3api put-bucket-versioning --bucket your-consul-backups --versioning-configuration Status=Enabled

Create disaster recovery documentation

Document the complete disaster recovery procedure.

# Consul Disaster Recovery Procedure

Emergency Restore Steps

  1. Stop all Consul agents across the cluster
sudo systemctl stop consul
  1. Clean Consul data directory on all nodes
sudo rm -rf /opt/consul/data/*
  1. Choose restore point
ls -la /opt/consul-backup/backups/
  1. Restore from backup (run on one node only)
sudo -u consul-backup /opt/consul-backup/scripts/consul-restore.sh consul-snapshot-YYYYMMDD_HHMMSS.snap.gpg
  1. Start Consul on the leader node first
sudo systemctl start consul
  1. Verify leader election and wait for stability
consul members
   consul operator raft list-peers
  1. Start Consul on remaining nodes one by one
sudo systemctl start consul

Recovery Verification

  • Check cluster health: consul members
  • Verify services: consul catalog services
  • Test KV store: consul kv get -recurse
  • Monitor logs: sudo journalctl -u consul -f

Emergency Contacts

  • Infrastructure Team: [contact info]
  • On-call Engineer: [contact info]

Verify your setup

Test the backup and restore process to ensure everything works correctly.

# Test manual backup
sudo -u consul-backup /opt/consul-backup/scripts/consul-backup.sh

Check backup files

ls -la /opt/consul-backup/backups/

Verify systemd timer status

sudo systemctl status consul-backup.timer sudo systemctl list-timers consul-backup.timer

Check monitoring metrics

cat /opt/consul-backup/logs/backup-metrics.prom

View backup logs

tail -f /opt/consul-backup/logs/backup.log

Test decryption (without restoring)

sudo -u consul-backup gpg --decrypt /opt/consul-backup/backups/consul-snapshot-*.snap.gpg > /dev/null

Verify Consul connectivity

consul members consul kv put test/backup "$(date)" consul kv get test/backup

Configure Prometheus alerting

If you have Prometheus monitoring set up, add these alerting rules to monitor backup health. This integrates well with Prometheus long-term storage setups.

groups:
  • name: consul-backup
rules: - alert: ConsulBackupFailed expr: consul_backup_healthy == 0 for: 1h labels: severity: critical annotations: summary: "Consul backup is failing" description: "Consul backup has not completed successfully in the last 26 hours" - alert: ConsulBackupAging expr: consul_backup_age_hours > 30 for: 15m labels: severity: warning annotations: summary: "Consul backup is getting old" description: "Last Consul backup is {{ $value }} hours old" - alert: ConsulBackupSizeAnomaly expr: | consul_backup_file_size_bytes < 1000 or consul_backup_file_size_bytes > 1000000000 for: 5m labels: severity: warning annotations: summary: "Consul backup size is unusual" description: "Consul backup file size is {{ $value }} bytes, which seems unusual"

Common issues

Symptom Cause Fix
Permission denied creating snapshot Incorrect ACL token permissions Verify token has operator read permissions: consul acl token read -id $TOKEN_ID
GPG encryption fails GPG key not found or expired List keys: sudo -u consul-backup gpg --list-keys and regenerate if needed
Backup script fails silently Missing environment variables Check log file: tail /opt/consul-backup/logs/backup.log
S3 upload fails Invalid AWS credentials or permissions Test AWS CLI: aws s3 ls and verify IAM permissions
Restore fails with "no leader" Cluster not properly stopped Ensure all Consul nodes are stopped before restore
Timer not running backups Systemd timer not enabled Enable timer: sudo systemctl enable consul-backup.timer
Monitoring metrics not updating Node exporter textfile directory missing Create directory: sudo mkdir -p /var/lib/prometheus/node-exporter

Next steps

Running this in production?

Need this managed for you? Setting up Consul backups once is straightforward. Keeping them tested, monitored, and ready for real disaster scenarios across environments is the harder part. See how we run infrastructure like this for European SaaS and fintech teams.

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.