CockroachDB Backup & Disaster Recovery Setup

Configure automated backup strategies for CockroachDB with systemd timers, implement comprehensive disaster recovery procedures, and set up monitoring with Prometheus and Grafana for production-grade database infrastructure.

Prerequisites

Root or sudo access
S3-compatible storage account
3 or more servers for cluster
Basic understanding of SQL and systemd

What this solves

CockroachDB clusters need automated backup strategies and disaster recovery procedures to protect against data loss and maintain business continuity. This tutorial sets up automated database backups using systemd timers, configures cross-region disaster recovery with point-in-time recovery capabilities, and implements comprehensive monitoring with alerting to ensure your distributed SQL database remains resilient and recoverable.

Step-by-step installation

Install CockroachDB cluster

First, install CockroachDB on your primary cluster nodes. We'll set up a three-node cluster for high availability.

wget -qO- https://binaries.cockroachdb.com/cockroach-v24.3.0.linux-amd64.tgz | tar xz
sudo mv cockroach-v24.3.0.linux-amd64/cockroach /usr/local/bin/
sudo chmod 755 /usr/local/bin/cockroach

wget -qO- https://binaries.cockroachdb.com/cockroach-v24.3.0.linux-amd64.tgz | tar xz
sudo mv cockroach-v24.3.0.linux-amd64/cockroach /usr/local/bin/
sudo chmod 755 /usr/local/bin/cockroach

Create CockroachDB user and directories

Create a dedicated user for CockroachDB and set up the required directory structure with proper permissions.

sudo useradd -m -s /bin/bash cockroach
sudo mkdir -p /var/lib/cockroach /var/log/cockroach /etc/cockroach
sudo chown cockroach:cockroach /var/lib/cockroach /var/log/cockroach
sudo chmod 750 /var/lib/cockroach /var/log/cockroach
sudo chmod 755 /etc/cockroach

Generate cluster certificates

Create SSL certificates for secure cluster communication. This sets up a certificate authority and node certificates.

sudo mkdir -p /etc/cockroach/certs /etc/cockroach/private
sudo cockroach cert create-ca --certs-dir=/etc/cockroach/certs --ca-key=/etc/cockroach/private/ca.key
sudo cockroach cert create-node localhost 203.0.113.10 203.0.113.11 203.0.113.12 --certs-dir=/etc/cockroach/certs --ca-key=/etc/cockroach/private/ca.key
sudo cockroach cert create-client root --certs-dir=/etc/cockroach/certs --ca-key=/etc/cockroach/private/ca.key
sudo chown -R cockroach:cockroach /etc/cockroach/certs
sudo chmod 400 /etc/cockroach/private/ca.key

Configure CockroachDB systemd service

Create a systemd service file for automatic startup and process management.

[Unit]
Description=Cockroach Database cluster node
Requires=network.target
After=network.target

[Service]
Type=notify
User=cockroach
Group=cockroach
ExecStart=/usr/local/bin/cockroach start --certs-dir=/etc/cockroach/certs --store=/var/lib/cockroach --listen-addr=203.0.113.10:26257 --http-addr=203.0.113.10:8080 --join=203.0.113.10:26257,203.0.113.11:26257,203.0.113.12:26257 --background=false
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
SyslogIdentifier=cockroach
KillMode=mixed
KillSignal=SIGTERM
TimeoutStopSec=60

[Install]
WantedBy=multi-user.target

Initialize the cluster

Start the services on all nodes, then initialize the cluster from one node.

sudo systemctl daemon-reload
sudo systemctl enable cockroach
sudo systemctl start cockroach
sudo systemctl status cockroach

On the first node only, initialize the cluster:

sudo -u cockroach cockroach init --certs-dir=/etc/cockroach/certs --host=203.0.113.10:26257

Install backup dependencies

Install required packages for backup automation and monitoring.

sudo apt update
sudo apt install -y awscli postgresql-client-common curl jq

sudo dnf install -y awscli postgresql curl jq

Configure S3-compatible backup storage

Set up credentials for S3-compatible storage where backups will be stored.

#!/bin/bash
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-east-1"
export BACKUP_BUCKET="cockroachdb-backups"
export CLUSTER_NAME="production"
export BACKUP_RETENTION_DAYS="30"

sudo chown cockroach:cockroach /etc/cockroach/backup-config
sudo chmod 600 /etc/cockroach/backup-config

Create backup automation script

Create a comprehensive backup script with error handling, logging, and cleanup.

#!/bin/bash
set -euo pipefail

Source configuration
source /etc/cockroach/backup-config

Logging setup
LOGFILE="/var/log/cockroach/backup-$(date +%Y%m%d-%H%M%S).log"
exec 1> >(tee -a "$LOGFILE")
exec 2>&1

echo "[$(date)] Starting CockroachDB backup"

Backup timestamp
BACKUP_TIMESTAMP=$(date +%Y%m%d-%H%M%S)
BACKUP_PATH="s3://${BACKUP_BUCKET}/${CLUSTER_NAME}/full/${BACKUP_TIMESTAMP}"

Perform backup
echo "[$(date)] Creating full backup to: $BACKUP_PATH"
cockroach sql --certs-dir=/etc/cockroach/certs --host=localhost:26257 --execute="BACKUP TO '$BACKUP_PATH' WITH revision_history;"

if [ $? -eq 0 ]; then
    echo "[$(date)] Backup completed successfully"
    
    # Update latest backup marker
    echo "$BACKUP_TIMESTAMP" > /var/lib/cockroach/last-backup
    
    # Cleanup old backups
    echo "[$(date)] Cleaning up backups older than $BACKUP_RETENTION_DAYS days"
    aws s3 ls "s3://${BACKUP_BUCKET}/${CLUSTER_NAME}/full/" | while read -r line; do
        BACKUP_DATE=$(echo "$line" | awk '{print $1" "$2}')
        BACKUP_NAME=$(echo "$line" | awk '{print $4}' | sed 's|/||')
        
        if [[ -n "$BACKUP_DATE" && -n "$BACKUP_NAME" ]]; then
            BACKUP_EPOCH=$(date -d "$BACKUP_DATE" +%s)
            CUTOFF_EPOCH=$(date -d "$BACKUP_RETENTION_DAYS days ago" +%s)
            
            if [[ $BACKUP_EPOCH -lt $CUTOFF_EPOCH ]]; then
                echo "[$(date)] Deleting old backup: $BACKUP_NAME"
                aws s3 rm "s3://${BACKUP_BUCKET}/${CLUSTER_NAME}/full/${BACKUP_NAME}/" --recursive
            fi
        fi
    done
    
    echo "[$(date)] Backup process completed successfully"
    exit 0
else
    echo "[$(date)] Backup failed with exit code $?"
    exit 1
fi

sudo chmod 755 /usr/local/bin/cockroach-backup.sh
sudo chown cockroach:cockroach /usr/local/bin/cockroach-backup.sh

Create incremental backup script

Set up incremental backups for more frequent data protection between full backups.

#!/bin/bash
set -euo pipefail

Source configuration
source /etc/cockroach/backup-config

Logging setup
LOGFILE="/var/log/cockroach/incremental-backup-$(date +%Y%m%d-%H%M%S).log"
exec 1> >(tee -a "$LOGFILE")
exec 2>&1

echo "[$(date)] Starting CockroachDB incremental backup"

Get latest full backup
if [ ! -f "/var/lib/cockroach/last-backup" ]; then
    echo "[$(date)] No full backup found. Please run full backup first."
    exit 1
fi

LAST_FULL_BACKUP=$(cat /var/lib/cockroach/last-backup)
BACKUP_TIMESTAMP=$(date +%Y%m%d-%H%M%S)
FULL_BACKUP_PATH="s3://${BACKUP_BUCKET}/${CLUSTER_NAME}/full/${LAST_FULL_BACKUP}"
INCREMENTAL_PATH="s3://${BACKUP_BUCKET}/${CLUSTER_NAME}/incremental/${BACKUP_TIMESTAMP}"

echo "[$(date)] Creating incremental backup based on: $FULL_BACKUP_PATH"
echo "[$(date)] Incremental backup destination: $INCREMENTAL_PATH"

cockroach sql --certs-dir=/etc/cockroach/certs --host=localhost:26257 --execute="BACKUP TO '$INCREMENTAL_PATH' INCREMENTAL FROM '$FULL_BACKUP_PATH' WITH revision_history;"

if [ $? -eq 0 ]; then
    echo "[$(date)] Incremental backup completed successfully"
    exit 0
else
    echo "[$(date)] Incremental backup failed with exit code $?"
    exit 1
fi

sudo chmod 755 /usr/local/bin/cockroach-incremental-backup.sh
sudo chown cockroach:cockroach /usr/local/bin/cockroach-incremental-backup.sh

Configure systemd timers for automated backups

Create systemd timer units for both full and incremental backups.

[Unit]
Description=CockroachDB Full Backup
Wants=network-online.target
After=network-online.target cockroach.service
Requires=cockroach.service

[Service]
Type=oneshot
User=cockroach
Group=cockroach
ExecStart=/usr/local/bin/cockroach-backup.sh
Environment=PATH=/usr/local/bin:/usr/bin:/bin
StandardOutput=journal
StandardError=journal

[Unit]
Description=Run CockroachDB full backup daily
Requires=cockroach-backup.service

[Timer]
OnCalendar=daily
RandomizedDelaySec=1800
Persistent=true

[Install]
WantedBy=timers.target

[Unit]
Description=CockroachDB Incremental Backup
Wants=network-online.target
After=network-online.target cockroach.service
Requires=cockroach.service

[Service]
Type=oneshot
User=cockroach
Group=cockroach
ExecStart=/usr/local/bin/cockroach-incremental-backup.sh
Environment=PATH=/usr/local/bin:/usr/bin:/bin
StandardOutput=journal
StandardError=journal

[Unit]
Description=Run CockroachDB incremental backup every 4 hours
Requires=cockroach-incremental-backup.service

[Timer]
OnCalendar=--* 00,04,08,12,16,20:00:00
RandomizedDelaySec=300
Persistent=true

[Install]
WantedBy=timers.target

Enable backup timers

Start and enable the systemd timers for automated backup execution.

sudo systemctl daemon-reload
sudo systemctl enable cockroach-backup.timer cockroach-incremental-backup.timer
sudo systemctl start cockroach-backup.timer cockroach-incremental-backup.timer
sudo systemctl status cockroach-backup.timer cockroach-incremental-backup.timer

Create disaster recovery script

Create a comprehensive disaster recovery script for point-in-time restoration.

#!/bin/bash
set -euo pipefail

Source configuration
source /etc/cockroach/backup-config

if [ "$#" -lt 1 ]; then
    echo "Usage: $0  [target-time]"
    echo "Example: $0 20241201-120000"
    echo "Example: $0 20241201-120000 '2024-12-01 15:30:00'"
    exit 1
fi

BACKUP_TIMESTAMP="$1"
TARGET_TIME="${2:-}"
BACKUP_PATH="s3://${BACKUP_BUCKET}/${CLUSTER_NAME}/full/${BACKUP_TIMESTAMP}"

Logging setup
LOGFILE="/var/log/cockroach/restore-$(date +%Y%m%d-%H%M%S).log"
exec 1> >(tee -a "$LOGFILE")
exec 2>&1

echo "[$(date)] Starting CockroachDB restore from backup: $BACKUP_TIMESTAMP"

Verify backup exists
echo "[$(date)] Verifying backup exists at: $BACKUP_PATH"
aws s3 ls "$BACKUP_PATH/" > /dev/null
if [ $? -ne 0 ]; then
    echo "[$(date)] ERROR: Backup not found at $BACKUP_PATH"
    exit 1
fi

Build restore command
if [ -n "$TARGET_TIME" ]; then
    RESTORE_CMD="RESTORE DATABASE defaultdb FROM '$BACKUP_PATH' AS OF SYSTEM TIME '$TARGET_TIME' WITH skip_missing_foreign_keys;"
    echo "[$(date)] Performing point-in-time restore to: $TARGET_TIME"
else
    RESTORE_CMD="RESTORE DATABASE defaultdb FROM '$BACKUP_PATH' WITH skip_missing_foreign_keys;"
    echo "[$(date)] Performing full restore to latest backup time"
fi

echo "[$(date)] Executing restore command"
cockroach sql --certs-dir=/etc/cockroach/certs --host=localhost:26257 --execute="$RESTORE_CMD"

if [ $? -eq 0 ]; then
    echo "[$(date)] Restore completed successfully"
    exit 0
else
    echo "[$(date)] Restore failed with exit code $?"
    exit 1
fi

sudo chmod 755 /usr/local/bin/cockroach-restore.sh
sudo chown cockroach:cockroach /usr/local/bin/cockroach-restore.sh

Install Prometheus for monitoring

Install Prometheus to collect metrics from CockroachDB and backup processes.

wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xzf prometheus-2.45.0.linux-amd64.tar.gz
sudo mv prometheus-2.45.0.linux-amd64/prometheus /usr/local/bin/
sudo mv prometheus-2.45.0.linux-amd64/promtool /usr/local/bin/
sudo useradd -M -s /bin/false prometheus
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus

wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xzf prometheus-2.45.0.linux-amd64.tar.gz
sudo mv prometheus-2.45.0.linux-amd64/prometheus /usr/local/bin/
sudo mv prometheus-2.45.0.linux-amd64/promtool /usr/local/bin/
sudo useradd -M -s /bin/false prometheus
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus

Configure Prometheus for CockroachDB monitoring

Set up Prometheus configuration to scrape CockroachDB metrics and backup job status.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "/etc/prometheus/cockroach-rules.yml"

scrape_configs:
  - job_name: 'cockroachdb'
    static_configs:
      - targets: 
        - '203.0.113.10:8080'
        - '203.0.113.11:8080'
        - '203.0.113.12:8080'
    metrics_path: '/_status/vars'
    scrape_interval: 10s

  - job_name: 'node-exporter'
    static_configs:
      - targets:
        - '203.0.113.10:9100'
        - '203.0.113.11:9100'
        - '203.0.113.12:9100'

  - job_name: 'backup-monitoring'
    static_configs:
      - targets: ['localhost:9090']
    metrics_path: '/metrics'
    scrape_interval: 60s

Create CockroachDB alerting rules

Define Prometheus alerting rules for CockroachDB cluster health and backup monitoring.

groups:
name: cockroachdb  rules:
  - alert: CockroachDBNodeDown
    expr: up{job="cockroachdb"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "CockroachDB node is down"
      description: "CockroachDB node {{ $labels.instance }} has been down for more than 5 minutes."

  - alert: CockroachDBHighCPU
    expr: rate(sys_cpu_user_percent{job="cockroachdb"}[5m]) > 80
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "CockroachDB high CPU usage"
      description: "CockroachDB node {{ $labels.instance }} CPU usage is above 80% for more than 10 minutes."

  - alert: CockroachDBHighMemory
    expr: sys_rss{job="cockroachdb"} / sys_rss_limit{job="cockroachdb"} > 0.9
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "CockroachDB high memory usage"
      description: "CockroachDB node {{ $labels.instance }} memory usage is above 90% for more than 10 minutes."

  - alert: CockroachDBBackupFailed
    expr: time() - cockroachdb_last_backup_timestamp > 86400
    for: 1h
    labels:
      severity: critical
    annotations:
      summary: "CockroachDB backup failed"
      description: "CockroachDB backup has not completed successfully in the last 24 hours."

  - alert: CockroachDBReplicationLag
    expr: replication_lag_seconds{job="cockroachdb"} > 300
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "CockroachDB replication lag"
      description: "CockroachDB replication lag on {{ $labels.instance }} is above 5 minutes."

  - alert: CockroachDBUnderReplicated
    expr: ranges_underreplicated{job="cockroachdb"} > 0
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: "CockroachDB under-replicated ranges"
      description: "CockroachDB has {{ $value }} under-replicated ranges for more than 15 minutes."

Create backup monitoring script

Create a script that exposes backup metrics to Prometheus.

#!/bin/bash
set -euo pipefail

Source configuration
source /etc/cockroach/backup-config

METRICS_FILE="/var/lib/prometheus/backup-metrics.prom"

Initialize metrics file
echo "# HELP cockroachdb_last_backup_timestamp Unix timestamp of last successful backup" > "$METRICS_FILE"
echo "# TYPE cockroachdb_last_backup_timestamp gauge" >> "$METRICS_FILE"

Check last backup time
if [ -f "/var/lib/cockroach/last-backup" ]; then
    LAST_BACKUP_FILE=$(cat /var/lib/cockroach/last-backup)
    BACKUP_TIMESTAMP=$(date -d "${LAST_BACKUP_FILE:0:8} ${LAST_BACKUP_FILE:9:2}:${LAST_BACKUP_FILE:11:2}:${LAST_BACKUP_FILE:13:2}" +%s)
    echo "cockroachdb_last_backup_timestamp $BACKUP_TIMESTAMP" >> "$METRICS_FILE"
else
    echo "cockroachdb_last_backup_timestamp 0" >> "$METRICS_FILE"
fi

Check backup job status from logs
echo "# HELP cockroachdb_backup_success Last backup job success (1=success, 0=failure)" >> "$METRICS_FILE"
echo "# TYPE cockroachdb_backup_success gauge" >> "$METRICS_FILE"

if grep -q "Backup completed successfully" /var/log/cockroach/backup-*.log 2>/dev/null | tail -1; then
    echo "cockroachdb_backup_success 1" >> "$METRICS_FILE"
else
    echo "cockroachdb_backup_success 0" >> "$METRICS_FILE"
fi

Count backup files
echo "# HELP cockroachdb_backup_count Total number of backups in storage" >> "$METRICS_FILE"
echo "# TYPE cockroachdb_backup_count gauge" >> "$METRICS_FILE"
BACKUP_COUNT=$(aws s3 ls "s3://${BACKUP_BUCKET}/${CLUSTER_NAME}/full/" | wc -l || echo "0")
echo "cockroachdb_backup_count $BACKUP_COUNT" >> "$METRICS_FILE"

sudo chmod 755 /usr/local/bin/cockroach-backup-metrics.sh
sudo chown prometheus:prometheus /usr/local/bin/cockroach-backup-metrics.sh

Create systemd timer for backup metrics

Set up a systemd timer to update backup metrics for Prometheus scraping.

[Unit]
Description=Update CockroachDB backup metrics
Wants=network-online.target
After=network-online.target

[Service]
Type=oneshot
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/cockroach-backup-metrics.sh
StandardOutput=journal
StandardError=journal

[Unit]
Description=Update backup metrics every 5 minutes
Requires=cockroach-backup-metrics.service

[Timer]
OnBootSec=5min
OnUnitActiveSec=5min
Persistent=true

[Install]
WantedBy=timers.target

Configure Prometheus systemd service

Create and start the Prometheus service for monitoring.

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
Restart=on-failure
RestartSec=5s
ExecStart=/usr/local/bin/prometheus \
  --config.file /etc/prometheus/prometheus.yml \
  --storage.tsdb.path /var/lib/prometheus/ \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --web.listen-address=0.0.0.0:9090 \
  --web.enable-lifecycle \
  --storage.tsdb.retention.time=30d

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable prometheus cockroach-backup-metrics.timer
sudo systemctl start prometheus cockroach-backup-metrics.timer

Install and configure Grafana

Install Grafana for visualizing CockroachDB metrics and backup status.

wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install -y grafana

sudo dnf install -y https://dl.grafana.com/oss/release/grafana-10.0.0-1.x86_64.rpm

sudo systemctl enable grafana-server
sudo systemctl start grafana-server

Configure alerting with Alertmanager

Install and configure Alertmanager for sending backup failure notifications.

wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
tar xzf alertmanager-0.25.0.linux-amd64.tar.gz
sudo mv alertmanager-0.25.0.linux-amd64/alertmanager /usr/local/bin/
sudo useradd -M -s /bin/false alertmanager
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager

global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alerts@example.com'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
name: 'web.hook'  email_configs:
  - to: 'admin@example.com'
    subject: 'CockroachDB Alert: {{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      Instance: {{ .Labels.instance }}
      {{ end }}

Verify your setup

Test the backup system and verify monitoring is working correctly.

# Check CockroachDB cluster status
sudo -u cockroach cockroach node status --certs-dir=/etc/cockroach/certs --host=localhost:26257

Test manual backup
sudo -u cockroach /usr/local/bin/cockroach-backup.sh

Check systemd timer status
sudo systemctl status cockroach-backup.timer cockroach-incremental-backup.timer

Verify Prometheus metrics
curl -s http://localhost:9090/metrics | grep cockroachdb_

Check backup logs
sudo tail -f /var/log/cockroach/backup-*.log

Access Grafana at http://your-server:3000 (admin/admin) and verify that the Prometheus data source is configured correctly. You can reference our advanced Grafana dashboards tutorial for detailed monitoring setup.

Test disaster recovery procedure

Practice the disaster recovery process with a test restoration.

# Create test data
sudo -u cockroach cockroach sql --certs-dir=/etc/cockroach/certs --host=localhost:26257 --execute="CREATE DATABASE testdr; USE testdr; CREATE TABLE test (id INT PRIMARY KEY, data STRING); INSERT INTO test VALUES (1, 'test-data');"

Perform backup
sudo -u cockroach /usr/local/bin/cockroach-backup.sh

Simulate disaster and restore (on a test cluster)
sudo -u cockroach /usr/local/bin/cockroach-restore.sh 20241201-120000

Common issues

Symptom	Cause	Fix
Backup fails with permission denied	Incorrect file ownership or AWS credentials	`sudo chown cockroach:cockroach /etc/cockroach/backup-config && chmod 600 /etc/cockroach/backup-config`
Systemd timer not running	Timer not enabled or service file issues	`sudo systemctl enable cockroach-backup.timer && systemctl start cockroach-backup.timer`
Prometheus not scraping CockroachDB	Firewall blocking port 8080 or wrong target	Check `netstat -ln \| grep 8080` and verify prometheus.yml targets
Backup restoration fails	Target database already exists or wrong backup path	Drop existing database first or verify S3 backup path exists
High backup storage costs	Retention policy not working	Check backup cleanup script logs and verify AWS CLI permissions

Next steps

Configure Prometheus Alertmanager with Slack integration for team notifications
Implement backup rotation policies for optimized storage management
Configure multi-region CockroachDB deployment for geographic redundancy
Set up cross-cluster replication for disaster recovery
Monitor CockroachDB performance with advanced Grafana dashboards

Running this in production?

Want this handled for you? Running CockroachDB at scale adds a second layer of work: capacity planning, failover drills, cost control, and on-call. See how we run infrastructure like this for European teams.

Automated install script

Run this to automate the entire setup

install.sh

#!/usr/bin/env bash

set -euo pipefail

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'

# Default configuration
COCKROACH_VERSION="24.3.0"
CLUSTER_IPS="${1:-}"
NODE_IP="${2:-}"
BACKUP_BUCKET="${3:-cockroachdb-backups}"

usage() {
    echo "Usage: $0 <cluster_ips> <node_ip> [backup_bucket]"
    echo "Example: $0 '10.0.1.10,10.0.1.11,10.0.1.12' '10.0.1.10' 'my-backup-bucket'"
    exit 1
}

if [[ -z "$CLUSTER_IPS" || -z "$NODE_IP" ]]; then
    usage
fi

# Error handling
cleanup() {
    echo -e "${RED}[ERROR] Installation failed. Cleaning up...${NC}"
    systemctl stop cockroach 2>/dev/null || true
    userdel cockroach 2>/dev/null || true
    rm -rf /var/lib/cockroach /var/log/cockroach /etc/cockroach
    exit 1
}
trap cleanup ERR

# Check prerequisites
if [[ $EUID -ne 0 ]]; then
    echo -e "${RED}[ERROR] This script must be run as root${NC}"
    exit 1
fi

# Auto-detect distribution
echo -e "${YELLOW}[1/12] Detecting distribution...${NC}"
if [ -f /etc/os-release ]; then
    . /etc/os-release
    case "$ID" in
        ubuntu|debian) 
            PKG_MGR="apt"
            PKG_UPDATE="apt update"
            PKG_INSTALL="apt install -y"
            PSQL_CLIENT="postgresql-client-common"
            ;;
        almalinux|rocky|centos|rhel|ol|fedora) 
            PKG_MGR="dnf"
            PKG_UPDATE="dnf check-update || true"
            PKG_INSTALL="dnf install -y"
            PSQL_CLIENT="postgresql"
            ;;
        amzn) 
            PKG_MGR="yum"
            PKG_UPDATE="yum check-update || true"
            PKG_INSTALL="yum install -y"
            PSQL_CLIENT="postgresql"
            ;;
        *) 
            echo -e "${RED}[ERROR] Unsupported distribution: $ID${NC}"
            exit 1
            ;;
    esac
    echo -e "${GREEN}Detected: $PRETTY_NAME${NC}"
else
    echo -e "${RED}[ERROR] Cannot detect distribution${NC}"
    exit 1
fi

# Install dependencies
echo -e "${YELLOW}[2/12] Installing dependencies...${NC}"
$PKG_UPDATE
$PKG_INSTALL curl wget tar awscli $PSQL_CLIENT jq

# Download and install CockroachDB
echo -e "${YELLOW}[3/12] Downloading CockroachDB v${COCKROACH_VERSION}...${NC}"
cd /tmp
wget -q "https://binaries.cockroachdb.com/cockroach-v${COCKROACH_VERSION}.linux-amd64.tgz"
tar xzf "cockroach-v${COCKROACH_VERSION}.linux-amd64.tgz"
mv "cockroach-v${COCKROACH_VERSION}.linux-amd64/cockroach" /usr/local/bin/
chmod 755 /usr/local/bin/cockroach
rm -rf "cockroach-v${COCKROACH_VERSION}.linux-amd64"*

# Create user and directories
echo -e "${YELLOW}[4/12] Creating CockroachDB user and directories...${NC}"
useradd -m -s /bin/bash cockroach || true
mkdir -p /var/lib/cockroach /var/log/cockroach /etc/cockroach/{certs,private}
chown cockroach:cockroach /var/lib/cockroach /var/log/cockroach
chown -R cockroach:cockroach /etc/cockroach
chmod 750 /var/lib/cockroach /var/log/cockroach
chmod 755 /etc/cockroach

# Generate certificates
echo -e "${YELLOW}[5/12] Generating cluster certificates...${NC}"
cockroach cert create-ca --certs-dir=/etc/cockroach/certs --ca-key=/etc/cockroach/private/ca.key
cockroach cert create-node localhost $NODE_IP $(echo $CLUSTER_IPS | tr ',' ' ') --certs-dir=/etc/cockroach/certs --ca-key=/etc/cockroach/private/ca.key
cockroach cert create-client root --certs-dir=/etc/cockroach/certs --ca-key=/etc/cockroach/private/ca.key
chown -R cockroach:cockroach /etc/cockroach/certs
chmod 400 /etc/cockroach/private/ca.key

# Create systemd service
echo -e "${YELLOW}[6/12] Creating systemd service...${NC}"
cat > /etc/systemd/system/cockroach.service << EOF
[Unit]
Description=Cockroach Database cluster node
Requires=network.target
After=network.target

[Service]
Type=notify
User=cockroach
Group=cockroach
ExecStart=/usr/local/bin/cockroach start --certs-dir=/etc/cockroach/certs --store=/var/lib/cockroach --listen-addr=${NODE_IP}:26257 --http-addr=${NODE_IP}:8080 --join=$(echo $CLUSTER_IPS | sed 's/,/:26257,/g'):26257 --background=false
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
SyslogIdentifier=cockroach
KillMode=mixed
KillSignal=SIGTERM
TimeoutStopSec=60

[Install]
WantedBy=multi-user.target
EOF

# Start CockroachDB service
echo -e "${YELLOW}[7/12] Starting CockroachDB service...${NC}"
systemctl daemon-reload
systemctl enable cockroach
systemctl start cockroach
sleep 10

# Create backup configuration
echo -e "${YELLOW}[8/12] Creating backup configuration...${NC}"
cat > /etc/cockroach/backup-config << 'EOF'
#!/bin/bash
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-east-1"
export BACKUP_BUCKET="cockroachdb-backups"
export CLUSTER_NAME="production"
export BACKUP_RETENTION_DAYS="30"
EOF
chown cockroach:cockroach /etc/cockroach/backup-config
chmod 600 /etc/cockroach/backup-config

# Create backup script
echo -e "${YELLOW}[9/12] Creating backup automation script...${NC}"
cat > /usr/local/bin/cockroach-backup << EOF
#!/bin/bash
set -euo pipefail

source /etc/cockroach/backup-config

TIMESTAMP=\$(date +%Y%m%d_%H%M%S)
BACKUP_PATH="s3://\${BACKUP_BUCKET}/\${CLUSTER_NAME}/\${TIMESTAMP}"

echo "\$(date): Starting backup to \${BACKUP_PATH}"

# Full backup
sudo -u cockroach cockroach sql --certs-dir=/etc/cockroach/certs --host=${NODE_IP}:26257 \\
    --execute="BACKUP INTO '\${BACKUP_PATH}' WITH DETACHED;"

# Log backup completion
echo "\$(date): Backup completed successfully"

# Cleanup old backups
aws s3 ls "s3://\${BACKUP_BUCKET}/\${CLUSTER_NAME}/" | awk '\$1 < "'"\$(date -d "\${BACKUP_RETENTION_DAYS} days ago" +%Y-%m-%d)"'" {print \$4}' | \\
    xargs -I {} aws s3 rm "s3://\${BACKUP_BUCKET}/\${CLUSTER_NAME}/{}" --recursive
EOF
chmod 755 /usr/local/bin/cockroach-backup

# Create systemd timer for backups
echo -e "${YELLOW}[10/12] Creating backup timer...${NC}"
cat > /etc/systemd/system/cockroach-backup.service << EOF
[Unit]
Description=CockroachDB Backup Service
Wants=network-online.target
After=network-online.target

[Service]
Type=oneshot
User=root
ExecStart=/usr/local/bin/cockroach-backup
StandardOutput=journal
StandardError=journal
SyslogIdentifier=cockroach-backup
EOF

cat > /etc/systemd/system/cockroach-backup.timer << EOF
[Unit]
Description=CockroachDB Backup Timer
Requires=cockroach-backup.service

[Timer]
OnCalendar=daily
Persistent=true
RandomizedDelaySec=3600

[Install]
WantedBy=timers.target
EOF

systemctl daemon-reload
systemctl enable cockroach-backup.timer

# Create monitoring script
echo -e "${YELLOW}[11/12] Creating monitoring script...${NC}"
cat > /usr/local/bin/cockroach-monitor << EOF
#!/bin/bash
set -euo pipefail

HEALTH_URL="http://${NODE_IP}:8080/health"
CLUSTER_URL="http://${NODE_IP}:8080/_status/cluster"

# Check node health
if ! curl -sf "\$HEALTH_URL" > /dev/null; then
    echo "CRITICAL: CockroachDB node health check failed"
    exit 2
fi

# Check cluster status
CLUSTER_STATUS=\$(curl -sf "\$CLUSTER_URL" | jq -r '.cluster_id // "unknown"')
if [[ "\$CLUSTER_STATUS" == "unknown" ]]; then
    echo "WARNING: Could not retrieve cluster status"
    exit 1
fi

echo "OK: CockroachDB cluster is healthy (ID: \$CLUSTER_STATUS)"
exit 0
EOF
chmod 755 /usr/local/bin/cockroach-monitor

# Configure firewall
echo -e "${YELLOW}[12/12] Configuring firewall...${NC}"
if command -v firewall-cmd >/dev/null 2>&1; then
    firewall-cmd --permanent --add-port=26257/tcp --add-port=8080/tcp
    firewall-cmd --reload
elif command -v ufw >/dev/null 2>&1; then
    ufw allow 26257/tcp
    ufw allow 8080/tcp
fi

# Verification
echo -e "${GREEN}[SUCCESS] CockroachDB installation completed!${NC}"
echo ""
echo "Next steps:"
echo "1. Configure backup credentials in /etc/cockroach/backup-config"
echo "2. Start backup timer: systemctl start cockroach-backup.timer"
echo "3. Initialize cluster from first node:"
echo "   sudo -u cockroach cockroach init --certs-dir=/etc/cockroach/certs --host=${NODE_IP}:26257"
echo "4. Access web UI: https://${NODE_IP}:8080"
echo "5. Test monitoring: /usr/local/bin/cockroach-monitor"
echo ""
echo "Service status:"
systemctl status cockroach --no-pager -l

Review the script before running. Execute with: bash install.sh

#cockroachdb #backup #disaster-recovery #systemd #prometheus #grafana #automation #monitoring #sql

Set up CockroachDB backup and disaster recovery automation with systemd timers and monitoring

Prerequisites

What this solves

Step-by-step installation

Install CockroachDB cluster

Create CockroachDB user and directories

Generate cluster certificates

Configure CockroachDB systemd service

Initialize the cluster

Install backup dependencies

Configure S3-compatible backup storage

Create backup automation script

Source configuration

Logging setup

Backup timestamp

Perform backup

Create incremental backup script

Source configuration

Logging setup

Get latest full backup

Configure systemd timers for automated backups

Enable backup timers

Create disaster recovery script

Source configuration

Logging setup

Verify backup exists

Build restore command

Install Prometheus for monitoring

Configure Prometheus for CockroachDB monitoring

Create CockroachDB alerting rules

Create backup monitoring script

Source configuration

Initialize metrics file

Check last backup time

Check backup job status from logs

Count backup files

Create systemd timer for backup metrics

Configure Prometheus systemd service

Install and configure Grafana

Configure alerting with Alertmanager

Verify your setup

Test manual backup

Check systemd timer status

Verify Prometheus metrics

Check backup logs

Test disaster recovery procedure

Perform backup

Simulate disaster and restore (on a test cluster)

sudo -u cockroach /usr/local/bin/cockroach-restore.sh 20241201-120000

Common issues

Next steps

Running this in production?

Related tutorials

Implement MariaDB backup encryption with Mariabackup and automated restoration

Configure MariaDB Galera cluster for multi-master replication with automatic failover

Configure Elasticsearch 8 snapshot and restore policies with automated backup strategies

Don't want to manage this yourself?

`sudo -u cockroach /usr/local/bin/cockroach-restore.sh 20241201-120000`