Set up CockroachDB backup and disaster recovery automation with systemd timers and monitoring

Advanced 90 min May 29, 2026 98 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Configure automated backup strategies for CockroachDB with systemd timers, implement comprehensive disaster recovery procedures, and set up monitoring with Prometheus and Grafana for production-grade database infrastructure.

Prerequisites

  • Root or sudo access
  • S3-compatible storage account
  • 3 or more servers for cluster
  • Basic understanding of SQL and systemd

What this solves

CockroachDB clusters need automated backup strategies and disaster recovery procedures to protect against data loss and maintain business continuity. This tutorial sets up automated database backups using systemd timers, configures cross-region disaster recovery with point-in-time recovery capabilities, and implements comprehensive monitoring with alerting to ensure your distributed SQL database remains resilient and recoverable.

Step-by-step installation

Install CockroachDB cluster

First, install CockroachDB on your primary cluster nodes. We'll set up a three-node cluster for high availability.

wget -qO- https://binaries.cockroachdb.com/cockroach-v24.3.0.linux-amd64.tgz | tar xz
sudo mv cockroach-v24.3.0.linux-amd64/cockroach /usr/local/bin/
sudo chmod 755 /usr/local/bin/cockroach
wget -qO- https://binaries.cockroachdb.com/cockroach-v24.3.0.linux-amd64.tgz | tar xz
sudo mv cockroach-v24.3.0.linux-amd64/cockroach /usr/local/bin/
sudo chmod 755 /usr/local/bin/cockroach

Create CockroachDB user and directories

Create a dedicated user for CockroachDB and set up the required directory structure with proper permissions.

sudo useradd -m -s /bin/bash cockroach
sudo mkdir -p /var/lib/cockroach /var/log/cockroach /etc/cockroach
sudo chown cockroach:cockroach /var/lib/cockroach /var/log/cockroach
sudo chmod 750 /var/lib/cockroach /var/log/cockroach
sudo chmod 755 /etc/cockroach

Generate cluster certificates

Create SSL certificates for secure cluster communication. This sets up a certificate authority and node certificates.

sudo mkdir -p /etc/cockroach/certs /etc/cockroach/private
sudo cockroach cert create-ca --certs-dir=/etc/cockroach/certs --ca-key=/etc/cockroach/private/ca.key
sudo cockroach cert create-node localhost 203.0.113.10 203.0.113.11 203.0.113.12 --certs-dir=/etc/cockroach/certs --ca-key=/etc/cockroach/private/ca.key
sudo cockroach cert create-client root --certs-dir=/etc/cockroach/certs --ca-key=/etc/cockroach/private/ca.key
sudo chown -R cockroach:cockroach /etc/cockroach/certs
sudo chmod 400 /etc/cockroach/private/ca.key

Configure CockroachDB systemd service

Create a systemd service file for automatic startup and process management.

[Unit]
Description=Cockroach Database cluster node
Requires=network.target
After=network.target

[Service]
Type=notify
User=cockroach
Group=cockroach
ExecStart=/usr/local/bin/cockroach start --certs-dir=/etc/cockroach/certs --store=/var/lib/cockroach --listen-addr=203.0.113.10:26257 --http-addr=203.0.113.10:8080 --join=203.0.113.10:26257,203.0.113.11:26257,203.0.113.12:26257 --background=false
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
SyslogIdentifier=cockroach
KillMode=mixed
KillSignal=SIGTERM
TimeoutStopSec=60

[Install]
WantedBy=multi-user.target

Initialize the cluster

Start the services on all nodes, then initialize the cluster from one node.

sudo systemctl daemon-reload
sudo systemctl enable cockroach
sudo systemctl start cockroach
sudo systemctl status cockroach

On the first node only, initialize the cluster:

sudo -u cockroach cockroach init --certs-dir=/etc/cockroach/certs --host=203.0.113.10:26257

Install backup dependencies

Install required packages for backup automation and monitoring.

sudo apt update
sudo apt install -y awscli postgresql-client-common curl jq
sudo dnf install -y awscli postgresql curl jq

Configure S3-compatible backup storage

Set up credentials for S3-compatible storage where backups will be stored.

#!/bin/bash
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-east-1"
export BACKUP_BUCKET="cockroachdb-backups"
export CLUSTER_NAME="production"
export BACKUP_RETENTION_DAYS="30"
sudo chown cockroach:cockroach /etc/cockroach/backup-config
sudo chmod 600 /etc/cockroach/backup-config

Create backup automation script

Create a comprehensive backup script with error handling, logging, and cleanup.

#!/bin/bash
set -euo pipefail

Source configuration

source /etc/cockroach/backup-config

Logging setup

LOGFILE="/var/log/cockroach/backup-$(date +%Y%m%d-%H%M%S).log" exec 1> >(tee -a "$LOGFILE") exec 2>&1 echo "[$(date)] Starting CockroachDB backup"

Backup timestamp

BACKUP_TIMESTAMP=$(date +%Y%m%d-%H%M%S) BACKUP_PATH="s3://${BACKUP_BUCKET}/${CLUSTER_NAME}/full/${BACKUP_TIMESTAMP}"

Perform backup

echo "[$(date)] Creating full backup to: $BACKUP_PATH" cockroach sql --certs-dir=/etc/cockroach/certs --host=localhost:26257 --execute="BACKUP TO '$BACKUP_PATH' WITH revision_history;" if [ $? -eq 0 ]; then echo "[$(date)] Backup completed successfully" # Update latest backup marker echo "$BACKUP_TIMESTAMP" > /var/lib/cockroach/last-backup # Cleanup old backups echo "[$(date)] Cleaning up backups older than $BACKUP_RETENTION_DAYS days" aws s3 ls "s3://${BACKUP_BUCKET}/${CLUSTER_NAME}/full/" | while read -r line; do BACKUP_DATE=$(echo "$line" | awk '{print $1" "$2}') BACKUP_NAME=$(echo "$line" | awk '{print $4}' | sed 's|/||') if [[ -n "$BACKUP_DATE" && -n "$BACKUP_NAME" ]]; then BACKUP_EPOCH=$(date -d "$BACKUP_DATE" +%s) CUTOFF_EPOCH=$(date -d "$BACKUP_RETENTION_DAYS days ago" +%s) if [[ $BACKUP_EPOCH -lt $CUTOFF_EPOCH ]]; then echo "[$(date)] Deleting old backup: $BACKUP_NAME" aws s3 rm "s3://${BACKUP_BUCKET}/${CLUSTER_NAME}/full/${BACKUP_NAME}/" --recursive fi fi done echo "[$(date)] Backup process completed successfully" exit 0 else echo "[$(date)] Backup failed with exit code $?" exit 1 fi
sudo chmod 755 /usr/local/bin/cockroach-backup.sh
sudo chown cockroach:cockroach /usr/local/bin/cockroach-backup.sh

Create incremental backup script

Set up incremental backups for more frequent data protection between full backups.

#!/bin/bash
set -euo pipefail

Source configuration

source /etc/cockroach/backup-config

Logging setup

LOGFILE="/var/log/cockroach/incremental-backup-$(date +%Y%m%d-%H%M%S).log" exec 1> >(tee -a "$LOGFILE") exec 2>&1 echo "[$(date)] Starting CockroachDB incremental backup"

Get latest full backup

if [ ! -f "/var/lib/cockroach/last-backup" ]; then echo "[$(date)] No full backup found. Please run full backup first." exit 1 fi LAST_FULL_BACKUP=$(cat /var/lib/cockroach/last-backup) BACKUP_TIMESTAMP=$(date +%Y%m%d-%H%M%S) FULL_BACKUP_PATH="s3://${BACKUP_BUCKET}/${CLUSTER_NAME}/full/${LAST_FULL_BACKUP}" INCREMENTAL_PATH="s3://${BACKUP_BUCKET}/${CLUSTER_NAME}/incremental/${BACKUP_TIMESTAMP}" echo "[$(date)] Creating incremental backup based on: $FULL_BACKUP_PATH" echo "[$(date)] Incremental backup destination: $INCREMENTAL_PATH" cockroach sql --certs-dir=/etc/cockroach/certs --host=localhost:26257 --execute="BACKUP TO '$INCREMENTAL_PATH' INCREMENTAL FROM '$FULL_BACKUP_PATH' WITH revision_history;" if [ $? -eq 0 ]; then echo "[$(date)] Incremental backup completed successfully" exit 0 else echo "[$(date)] Incremental backup failed with exit code $?" exit 1 fi
sudo chmod 755 /usr/local/bin/cockroach-incremental-backup.sh
sudo chown cockroach:cockroach /usr/local/bin/cockroach-incremental-backup.sh

Configure systemd timers for automated backups

Create systemd timer units for both full and incremental backups.

[Unit]
Description=CockroachDB Full Backup
Wants=network-online.target
After=network-online.target cockroach.service
Requires=cockroach.service

[Service]
Type=oneshot
User=cockroach
Group=cockroach
ExecStart=/usr/local/bin/cockroach-backup.sh
Environment=PATH=/usr/local/bin:/usr/bin:/bin
StandardOutput=journal
StandardError=journal
[Unit]
Description=Run CockroachDB full backup daily
Requires=cockroach-backup.service

[Timer]
OnCalendar=daily
RandomizedDelaySec=1800
Persistent=true

[Install]
WantedBy=timers.target
[Unit]
Description=CockroachDB Incremental Backup
Wants=network-online.target
After=network-online.target cockroach.service
Requires=cockroach.service

[Service]
Type=oneshot
User=cockroach
Group=cockroach
ExecStart=/usr/local/bin/cockroach-incremental-backup.sh
Environment=PATH=/usr/local/bin:/usr/bin:/bin
StandardOutput=journal
StandardError=journal
[Unit]
Description=Run CockroachDB incremental backup every 4 hours
Requires=cockroach-incremental-backup.service

[Timer]
OnCalendar=--* 00,04,08,12,16,20:00:00
RandomizedDelaySec=300
Persistent=true

[Install]
WantedBy=timers.target

Enable backup timers

Start and enable the systemd timers for automated backup execution.

sudo systemctl daemon-reload
sudo systemctl enable cockroach-backup.timer cockroach-incremental-backup.timer
sudo systemctl start cockroach-backup.timer cockroach-incremental-backup.timer
sudo systemctl status cockroach-backup.timer cockroach-incremental-backup.timer

Create disaster recovery script

Create a comprehensive disaster recovery script for point-in-time restoration.

#!/bin/bash
set -euo pipefail

Source configuration

source /etc/cockroach/backup-config if [ "$#" -lt 1 ]; then echo "Usage: $0 [target-time]" echo "Example: $0 20241201-120000" echo "Example: $0 20241201-120000 '2024-12-01 15:30:00'" exit 1 fi BACKUP_TIMESTAMP="$1" TARGET_TIME="${2:-}" BACKUP_PATH="s3://${BACKUP_BUCKET}/${CLUSTER_NAME}/full/${BACKUP_TIMESTAMP}"

Logging setup

LOGFILE="/var/log/cockroach/restore-$(date +%Y%m%d-%H%M%S).log" exec 1> >(tee -a "$LOGFILE") exec 2>&1 echo "[$(date)] Starting CockroachDB restore from backup: $BACKUP_TIMESTAMP"

Verify backup exists

echo "[$(date)] Verifying backup exists at: $BACKUP_PATH" aws s3 ls "$BACKUP_PATH/" > /dev/null if [ $? -ne 0 ]; then echo "[$(date)] ERROR: Backup not found at $BACKUP_PATH" exit 1 fi

Build restore command

if [ -n "$TARGET_TIME" ]; then RESTORE_CMD="RESTORE DATABASE defaultdb FROM '$BACKUP_PATH' AS OF SYSTEM TIME '$TARGET_TIME' WITH skip_missing_foreign_keys;" echo "[$(date)] Performing point-in-time restore to: $TARGET_TIME" else RESTORE_CMD="RESTORE DATABASE defaultdb FROM '$BACKUP_PATH' WITH skip_missing_foreign_keys;" echo "[$(date)] Performing full restore to latest backup time" fi echo "[$(date)] Executing restore command" cockroach sql --certs-dir=/etc/cockroach/certs --host=localhost:26257 --execute="$RESTORE_CMD" if [ $? -eq 0 ]; then echo "[$(date)] Restore completed successfully" exit 0 else echo "[$(date)] Restore failed with exit code $?" exit 1 fi
sudo chmod 755 /usr/local/bin/cockroach-restore.sh
sudo chown cockroach:cockroach /usr/local/bin/cockroach-restore.sh

Install Prometheus for monitoring

Install Prometheus to collect metrics from CockroachDB and backup processes.

wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xzf prometheus-2.45.0.linux-amd64.tar.gz
sudo mv prometheus-2.45.0.linux-amd64/prometheus /usr/local/bin/
sudo mv prometheus-2.45.0.linux-amd64/promtool /usr/local/bin/
sudo useradd -M -s /bin/false prometheus
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xzf prometheus-2.45.0.linux-amd64.tar.gz
sudo mv prometheus-2.45.0.linux-amd64/prometheus /usr/local/bin/
sudo mv prometheus-2.45.0.linux-amd64/promtool /usr/local/bin/
sudo useradd -M -s /bin/false prometheus
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus

Configure Prometheus for CockroachDB monitoring

Set up Prometheus configuration to scrape CockroachDB metrics and backup job status.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "/etc/prometheus/cockroach-rules.yml"

scrape_configs:
  - job_name: 'cockroachdb'
    static_configs:
      - targets: 
        - '203.0.113.10:8080'
        - '203.0.113.11:8080'
        - '203.0.113.12:8080'
    metrics_path: '/_status/vars'
    scrape_interval: 10s

  - job_name: 'node-exporter'
    static_configs:
      - targets:
        - '203.0.113.10:9100'
        - '203.0.113.11:9100'
        - '203.0.113.12:9100'

  - job_name: 'backup-monitoring'
    static_configs:
      - targets: ['localhost:9090']
    metrics_path: '/metrics'
    scrape_interval: 60s

Create CockroachDB alerting rules

Define Prometheus alerting rules for CockroachDB cluster health and backup monitoring.

groups:
  • name: cockroachdb
rules: - alert: CockroachDBNodeDown expr: up{job="cockroachdb"} == 0 for: 5m labels: severity: critical annotations: summary: "CockroachDB node is down" description: "CockroachDB node {{ $labels.instance }} has been down for more than 5 minutes." - alert: CockroachDBHighCPU expr: rate(sys_cpu_user_percent{job="cockroachdb"}[5m]) > 80 for: 10m labels: severity: warning annotations: summary: "CockroachDB high CPU usage" description: "CockroachDB node {{ $labels.instance }} CPU usage is above 80% for more than 10 minutes." - alert: CockroachDBHighMemory expr: sys_rss{job="cockroachdb"} / sys_rss_limit{job="cockroachdb"} > 0.9 for: 10m labels: severity: warning annotations: summary: "CockroachDB high memory usage" description: "CockroachDB node {{ $labels.instance }} memory usage is above 90% for more than 10 minutes." - alert: CockroachDBBackupFailed expr: time() - cockroachdb_last_backup_timestamp > 86400 for: 1h labels: severity: critical annotations: summary: "CockroachDB backup failed" description: "CockroachDB backup has not completed successfully in the last 24 hours." - alert: CockroachDBReplicationLag expr: replication_lag_seconds{job="cockroachdb"} > 300 for: 5m labels: severity: warning annotations: summary: "CockroachDB replication lag" description: "CockroachDB replication lag on {{ $labels.instance }} is above 5 minutes." - alert: CockroachDBUnderReplicated expr: ranges_underreplicated{job="cockroachdb"} > 0 for: 15m labels: severity: critical annotations: summary: "CockroachDB under-replicated ranges" description: "CockroachDB has {{ $value }} under-replicated ranges for more than 15 minutes."

Create backup monitoring script

Create a script that exposes backup metrics to Prometheus.

#!/bin/bash
set -euo pipefail

Source configuration

source /etc/cockroach/backup-config METRICS_FILE="/var/lib/prometheus/backup-metrics.prom"

Initialize metrics file

echo "# HELP cockroachdb_last_backup_timestamp Unix timestamp of last successful backup" > "$METRICS_FILE" echo "# TYPE cockroachdb_last_backup_timestamp gauge" >> "$METRICS_FILE"

Check last backup time

if [ -f "/var/lib/cockroach/last-backup" ]; then LAST_BACKUP_FILE=$(cat /var/lib/cockroach/last-backup) BACKUP_TIMESTAMP=$(date -d "${LAST_BACKUP_FILE:0:8} ${LAST_BACKUP_FILE:9:2}:${LAST_BACKUP_FILE:11:2}:${LAST_BACKUP_FILE:13:2}" +%s) echo "cockroachdb_last_backup_timestamp $BACKUP_TIMESTAMP" >> "$METRICS_FILE" else echo "cockroachdb_last_backup_timestamp 0" >> "$METRICS_FILE" fi

Check backup job status from logs

echo "# HELP cockroachdb_backup_success Last backup job success (1=success, 0=failure)" >> "$METRICS_FILE" echo "# TYPE cockroachdb_backup_success gauge" >> "$METRICS_FILE" if grep -q "Backup completed successfully" /var/log/cockroach/backup-*.log 2>/dev/null | tail -1; then echo "cockroachdb_backup_success 1" >> "$METRICS_FILE" else echo "cockroachdb_backup_success 0" >> "$METRICS_FILE" fi

Count backup files

echo "# HELP cockroachdb_backup_count Total number of backups in storage" >> "$METRICS_FILE" echo "# TYPE cockroachdb_backup_count gauge" >> "$METRICS_FILE" BACKUP_COUNT=$(aws s3 ls "s3://${BACKUP_BUCKET}/${CLUSTER_NAME}/full/" | wc -l || echo "0") echo "cockroachdb_backup_count $BACKUP_COUNT" >> "$METRICS_FILE"
sudo chmod 755 /usr/local/bin/cockroach-backup-metrics.sh
sudo chown prometheus:prometheus /usr/local/bin/cockroach-backup-metrics.sh

Create systemd timer for backup metrics

Set up a systemd timer to update backup metrics for Prometheus scraping.

[Unit]
Description=Update CockroachDB backup metrics
Wants=network-online.target
After=network-online.target

[Service]
Type=oneshot
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/cockroach-backup-metrics.sh
StandardOutput=journal
StandardError=journal
[Unit]
Description=Update backup metrics every 5 minutes
Requires=cockroach-backup-metrics.service

[Timer]
OnBootSec=5min
OnUnitActiveSec=5min
Persistent=true

[Install]
WantedBy=timers.target

Configure Prometheus systemd service

Create and start the Prometheus service for monitoring.

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
Restart=on-failure
RestartSec=5s
ExecStart=/usr/local/bin/prometheus \
  --config.file /etc/prometheus/prometheus.yml \
  --storage.tsdb.path /var/lib/prometheus/ \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --web.listen-address=0.0.0.0:9090 \
  --web.enable-lifecycle \
  --storage.tsdb.retention.time=30d

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable prometheus cockroach-backup-metrics.timer
sudo systemctl start prometheus cockroach-backup-metrics.timer

Install and configure Grafana

Install Grafana for visualizing CockroachDB metrics and backup status.

wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install -y grafana
sudo dnf install -y https://dl.grafana.com/oss/release/grafana-10.0.0-1.x86_64.rpm
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

Configure alerting with Alertmanager

Install and configure Alertmanager for sending backup failure notifications.

wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
tar xzf alertmanager-0.25.0.linux-amd64.tar.gz
sudo mv alertmanager-0.25.0.linux-amd64/alertmanager /usr/local/bin/
sudo useradd -M -s /bin/false alertmanager
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager
global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alerts@example.com'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
  • name: 'web.hook'
email_configs: - to: 'admin@example.com' subject: 'CockroachDB Alert: {{ .GroupLabels.alertname }}' body: | {{ range .Alerts }} Alert: {{ .Annotations.summary }} Description: {{ .Annotations.description }} Instance: {{ .Labels.instance }} {{ end }}

Verify your setup

Test the backup system and verify monitoring is working correctly.

# Check CockroachDB cluster status
sudo -u cockroach cockroach node status --certs-dir=/etc/cockroach/certs --host=localhost:26257

Test manual backup

sudo -u cockroach /usr/local/bin/cockroach-backup.sh

Check systemd timer status

sudo systemctl status cockroach-backup.timer cockroach-incremental-backup.timer

Verify Prometheus metrics

curl -s http://localhost:9090/metrics | grep cockroachdb_

Check backup logs

sudo tail -f /var/log/cockroach/backup-*.log

Access Grafana at http://your-server:3000 (admin/admin) and verify that the Prometheus data source is configured correctly. You can reference our advanced Grafana dashboards tutorial for detailed monitoring setup.

Test disaster recovery procedure

Practice the disaster recovery process with a test restoration.

# Create test data
sudo -u cockroach cockroach sql --certs-dir=/etc/cockroach/certs --host=localhost:26257 --execute="CREATE DATABASE testdr; USE testdr; CREATE TABLE test (id INT PRIMARY KEY, data STRING); INSERT INTO test VALUES (1, 'test-data');"

Perform backup

sudo -u cockroach /usr/local/bin/cockroach-backup.sh

Simulate disaster and restore (on a test cluster)

sudo -u cockroach /usr/local/bin/cockroach-restore.sh 20241201-120000

Common issues

SymptomCauseFix
Backup fails with permission deniedIncorrect file ownership or AWS credentialssudo chown cockroach:cockroach /etc/cockroach/backup-config && chmod 600 /etc/cockroach/backup-config
Systemd timer not runningTimer not enabled or service file issuessudo systemctl enable cockroach-backup.timer && systemctl start cockroach-backup.timer
Prometheus not scraping CockroachDBFirewall blocking port 8080 or wrong targetCheck netstat -ln | grep 8080 and verify prometheus.yml targets
Backup restoration failsTarget database already exists or wrong backup pathDrop existing database first or verify S3 backup path exists
High backup storage costsRetention policy not workingCheck backup cleanup script logs and verify AWS CLI permissions

Next steps

Running this in production?

Want this handled for you? Running CockroachDB at scale adds a second layer of work: capacity planning, failover drills, cost control, and on-call. See how we run infrastructure like this for European teams.

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle high availability infrastructure for businesses that depend on uptime. From initial setup to ongoing operations.