ScyllaDB Disaster Recovery with Cross-Region Replication

Set up ScyllaDB multi-region cluster with automated backup strategies, cross-datacenter replication, and failover automation for enterprise-grade disaster recovery and business continuity.

Prerequisites

Minimum 6 servers across 2 regions
Root access to all servers
Network connectivity between regions
AWS S3 bucket for backups
Basic understanding of distributed databases

What this solves

ScyllaDB disaster recovery with cross-region replication ensures your database remains available during datacenter failures, natural disasters, or regional outages. This tutorial implements a production-grade multi-region ScyllaDB cluster with automated backup strategies, real-time replication, and monitoring systems that can handle enterprise workloads with sub-second failover times.

Prerequisites and planning

Infrastructure requirements

You'll need at least 6 servers across two regions (minimum 3 nodes per region) with sufficient network bandwidth between regions for replication traffic.

Component	Minimum specs	Recommended specs
CPU	8 cores	16+ cores
RAM	32GB	64GB+
Storage	1TB NVMe SSD	2TB+ NVMe SSD
Network	10 Gbps	25 Gbps+
Inter-region bandwidth	1 Gbps	5 Gbps+

Network and firewall configuration

Configure firewall rules to allow ScyllaDB cluster communication and monitoring access across regions.

sudo ufw allow 7000/tcp  # Inter-node communication
sudo ufw allow 7001/tcp  # TLS inter-node communication
sudo ufw allow 9042/tcp  # CQL native transport
sudo ufw allow 9160/tcp  # Thrift RPC
sudo ufw allow 10000/tcp # REST API
sudo ufw allow 9180/tcp  # Prometheus metrics
sudo ufw allow 19042/tcp # CQL shard-aware port

Step-by-step installation

Install ScyllaDB on all nodes

Install ScyllaDB packages and dependencies on every cluster node across both regions.

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 5e08fbd8b5d6ec9c
sudo curl -L --output /etc/apt/sources.list.d/scylla.list http://downloads.scylladb.com/deb/ubuntu/scylla-5.4.list
sudo apt update
sudo apt install -y scylla

sudo curl -L --output /etc/yum.repos.d/scylla.repo http://downloads.scylladb.com/rpm/centos/scylla-5.4.repo
sudo dnf install -y scylla
sudo dnf install -y python3-PyYAML

Configure system optimization

Run ScyllaDB's setup script to optimize system parameters for maximum performance and reliability.

sudo scylla_setup --no-raid-setup
sudo scylla_io_setup

Note: The setup script configures kernel parameters, CPU isolation, and I/O schedulers. Answer 'yes' to most prompts for production optimization.

Configure ScyllaDB for multi-region topology

Create the main configuration file with topology-aware settings and cross-region replication support.

# Cluster settings
cluster_name: 'production-cluster'
num_tokens: 256
authenticator: PasswordAuthenticator
authorizer: CassandraAuthorizer

Network settings
listen_address: 10.0.1.10  # Replace with actual node IP
broadcast_address: 10.0.1.10  # Replace with actual node IP
rpc_address: 0.0.0.0
broadcast_rpc_address: 10.0.1.10  # Replace with actual node IP

Seeds configuration (include nodes from both regions)
seed_provider:
    - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
          - seeds: "10.0.1.10,10.0.1.11,10.0.1.12,10.0.2.10,10.0.2.11,10.0.2.12"

Topology settings
endpoint_snitch: GossipingPropertyFileSnitch
data_file_directories:
    - /var/lib/scylla/data
commitlog_directory: /var/lib/scylla/commitlog
hints_directory: /var/lib/scylla/hints
view_hints_directory: /var/lib/scylla/view_hints

Cross-region optimization
inter_dc_stream_throughput_outbound_megabits_per_sec: 1000
inter_dc_tcp_nodelay: true
read_request_timeout_in_ms: 10000
write_request_timeout_in_ms: 10000
range_request_timeout_in_ms: 20000

Enable experimental features for better replication
experimental_features:
    - udf
    - alternator-streams

Backup and snapshot settings
snapshot_before_compaction: false
auto_snapshot: true
incremental_backups: true

Configure datacenter topology

Define datacenter and rack information for proper replica placement and network topology awareness.

# Region 1 nodes (adjust for each node)
dc=us-east-1
rack=rack1

For region 2 nodes, use:
dc=us-west-1
rack=rack1

Note: Each node must have the correct datacenter and rack configuration before starting. Update these values on each node accordingly.

Start ScyllaDB services

Enable and start ScyllaDB on all nodes, beginning with seed nodes in the first region.

sudo systemctl enable scylla-server
sudo systemctl start scylla-server
sudo systemctl status scylla-server

Important: Start seed nodes first, wait for them to be operational, then start remaining nodes one by one with 2-minute intervals.

Verify cluster formation

Check that all nodes have joined the cluster and can communicate across regions.

nodetool status
nodetool ring
nodetool describecluster

Configure cross-region replication

Create keyspace with NetworkTopologyStrategy

Create keyspaces configured for multi-datacenter replication with appropriate consistency levels.

cqlsh -u cassandra -p cassandra

CREATE KEYSPACE production_data 
WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'us-east-1': 3,
    'us-west-1': 3
};

CREATE KEYSPACE system_auth_backup 
WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'us-east-1': 3,
    'us-west-1': 3
};

USE production_data;

CREATE TABLE user_sessions (
    session_id UUID PRIMARY KEY,
    user_id UUID,
    created_at TIMESTAMP,
    last_access TIMESTAMP,
    session_data TEXT
) WITH gc_grace_seconds = 86400;

Configure consistency levels

Set appropriate consistency levels for read and write operations to balance performance with data consistency.

-- For strong consistency across regions
CONSISTENCY EACH_QUORUM;

-- For local datacenter consistency (better performance)
CONSISTENCY LOCAL_QUORUM;

-- Test consistency settings
SELECT * FROM system.local;
SELECT * FROM system.peers;

Implement automated backup procedures

Install backup dependencies

Install required tools for automated backup creation, compression, and remote storage.

sudo apt install -y awscli s3cmd pigz parallel
pip3 install --user scylla-manager-client

sudo dnf install -y awscli s3cmd pigz parallel
pip3 install --user scylla-manager-client

Create backup automation script

Develop a comprehensive backup script that handles snapshots, compression, and remote storage with error handling.

#!/bin/bash

ScyllaDB Automated Backup Script
set -euo pipefail

Configuration
BACKUP_DIR="/opt/scylla/backups"
S3_BUCKET="s3://scylla-backups-production"
RETENTION_DAYS=30
LOG_FILE="/var/log/scylla/backup.log"
DATE=$(date +%Y%m%d_%H%M%S)
HOSTNAME=$(hostname -f)

Logging function
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

log "Starting backup process on $HOSTNAME"

Create backup directory
mkdir -p "$BACKUP_DIR/$DATE"

Create snapshot
log "Creating snapshot"
nodetool snapshot -t "backup_$DATE"

Find and compress snapshot files
log "Compressing snapshot data"
find /var/lib/scylla/data -name "backup_$DATE" -type d | while read -r snapshot_dir; do
    keyspace=$(echo "$snapshot_dir" | cut -d'/' -f6)
    table=$(echo "$snapshot_dir" | cut -d'/' -f7)
    
    # Create compressed archive
    tar -I pigz -cf "$BACKUP_DIR/$DATE/${keyspace}_${table}_$DATE.tar.gz" -C "$snapshot_dir" .
    
    log "Compressed $keyspace.$table"
done

Upload to S3
log "Uploading to S3"
aws s3 sync "$BACKUP_DIR/$DATE" "$S3_BUCKET/$HOSTNAME/$DATE/" --storage-class STANDARD_IA

Cleanup old local backups
log "Cleaning up old local backups"
find "$BACKUP_DIR" -type d -mtime +7 -exec rm -rf {} +

Cleanup old snapshots
log "Cleaning up old snapshots"
nodetool clearsnapshot

Cleanup old S3 backups
log "Cleaning up old S3 backups"
aws s3 ls "$S3_BUCKET/$HOSTNAME/" | while read -r line; do
    backup_date=$(echo "$line" | awk '{print $2}' | tr -d '/')
    if [[ $(date -d "$backup_date" +%s 2>/dev/null || echo 0) -lt $(date -d "$RETENTION_DAYS days ago" +%s) ]]; then
        aws s3 rm "$S3_BUCKET/$HOSTNAME/$backup_date/" --recursive
        log "Removed old backup: $backup_date"
    fi
done

log "Backup process completed successfully"

Health check
nodetool status | grep -q "UN" && log "Cluster health: OK" || log "Cluster health: WARNING"

sudo chmod +x /opt/scylla/backup-automation.sh
sudo mkdir -p /var/log/scylla

Configure S3 credentials

Set up AWS credentials and S3 bucket configuration for secure backup storage.

sudo mkdir -p /root/.aws

[default]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY
region = us-east-1

[default]
region = us-east-1
output = json

[profile backup]
region = us-east-1
s3 =
    max_concurrent_requests = 20
    max_bandwidth = 100MB/s

Security: Use IAM roles instead of hardcoded credentials in production. Create a dedicated S3 bucket with versioning and encryption enabled.

Schedule automated backups

Configure systemd timers for reliable backup scheduling with proper error handling and monitoring.

[Unit]
Description=ScyllaDB Automated Backup
After=scylla-server.service
Requires=scylla-server.service

[Service]
Type=oneshot
User=root
Group=root
ExecStart=/opt/scylla/backup-automation.sh
StandardOutput=journal
StandardError=journal

[Unit]
Description=Run ScyllaDB backup daily at 2 AM
Requires=scylla-backup.service

[Timer]
OnCalendar=--* 02:00:00
RandomizedDelaySec=300
Persistent=true

[Install]
WantedBy=timers.target

sudo systemctl daemon-reload
sudo systemctl enable scylla-backup.timer
sudo systemctl start scylla-backup.timer
sudo systemctl status scylla-backup.timer

Set up monitoring and failover automation

Install monitoring components

Install Prometheus and Grafana for comprehensive ScyllaDB cluster monitoring with custom dashboards.

sudo apt install -y prometheus prometheus-node-exporter grafana
wget https://github.com/scylladb/scylla-monitoring/archive/scylla-monitoring-4.6.tar.gz
tar -xzf scylla-monitoring-4.6.tar.gz
sudo mv scylla-monitoring-4.6 /opt/scylla-monitoring

sudo dnf install -y prometheus prometheus-node-exporter grafana
wget https://github.com/scylladb/scylla-monitoring/archive/scylla-monitoring-4.6.tar.gz
tar -xzf scylla-monitoring-4.6.tar.gz
sudo mv scylla-monitoring-4.6 /opt/scylla-monitoring

Configure Prometheus for ScyllaDB

Set up Prometheus configuration to scrape metrics from all ScyllaDB nodes across regions.

# ScyllaDB cluster nodes
targets:  - 10.0.1.10:9180
  - 10.0.1.11:9180
  - 10.0.1.12:9180
  labels:
    cluster: production-cluster
    dc: us-east-1
    
targets:  - 10.0.2.10:9180
  - 10.0.2.11:9180
  - 10.0.2.12:9180
  labels:
    cluster: production-cluster
    dc: us-west-1

# Node exporter targets
targets:  - 10.0.1.10:9100
  - 10.0.1.11:9100
  - 10.0.1.12:9100
  - 10.0.2.10:9100
  - 10.0.2.11:9100
  - 10.0.2.12:9100

Create failover automation script

Develop intelligent failover automation that detects datacenter failures and redirects application traffic.

#!/bin/bash

ScyllaDB Failover Automation Script
set -euo pipefail

Configuration
PRIMARY_DC="us-east-1"
SECONDARY_DC="us-west-1"
PRIMARY_NODES=("10.0.1.10" "10.0.1.11" "10.0.1.12")
SECONDARY_NODES=("10.0.2.10" "10.0.2.11" "10.0.2.12")
LOAD_BALANCER_CONFIG="/etc/haproxy/haproxy.cfg"
ALERT_EMAIL="ops@example.com"
LOG_FILE="/var/log/scylla/failover.log"

Logging
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

Health check function
check_datacenter_health() {
    local dc=$1
    local nodes=()
    local healthy_count=0
    
    if [[ "$dc" == "$PRIMARY_DC" ]]; then
        nodes=("${PRIMARY_NODES[@]}")
    else
        nodes=("${SECONDARY_NODES[@]}")
    fi
    
    for node in "${nodes[@]}"; do
        if timeout 5 cqlsh "$node" -e "SELECT now() FROM system.local" &>/dev/null; then
            ((healthy_count++))
        fi
    done
    
    # Require majority of nodes to be healthy
    if [[ $healthy_count -ge 2 ]]; then
        return 0  # Healthy
    else
        return 1  # Unhealthy
    fi
}

Update load balancer configuration
update_load_balancer() {
    local active_dc=$1
    
    log "Updating load balancer to use $active_dc"
    
    # Generate new HAProxy config
    cat > "$LOAD_BALANCER_CONFIG" << EOF
global
    daemon
    log stdout local0
    
defaults
    mode tcp
    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms
    
listen scylla-cluster
    bind *:9042
    balance roundrobin
    option tcp-check
    tcp-check send-binary 040000000a000400436040
    tcp-check expect binary 8400000a
EOF
    
    if [[ "$active_dc" == "$PRIMARY_DC" ]]; then
        for node in "${PRIMARY_NODES[@]}"; do
            echo "    server node-$node $node:9042 check" >> "$LOAD_BALANCER_CONFIG"
        done
    else
        for node in "${SECONDARY_NODES[@]}"; do
            echo "    server node-$node $node:9042 check" >> "$LOAD_BALANCER_CONFIG"
        done
    fi
    
    systemctl reload haproxy
}

Send alert
send_alert() {
    local message=$1
    echo "$message" | mail -s "ScyllaDB Failover Alert" "$ALERT_EMAIL"
    log "Alert sent: $message"
}

Main failover logic
log "Starting failover check"

if check_datacenter_health "$PRIMARY_DC"; then
    log "Primary datacenter ($PRIMARY_DC) is healthy"
    # Ensure we're using primary DC
    if ! grep -q "${PRIMARY_NODES[0]}" "$LOAD_BALANCER_CONFIG"; then
        log "Failing back to primary datacenter"
        update_load_balancer "$PRIMARY_DC"
        send_alert "ScyllaDB: Failed back to primary datacenter $PRIMARY_DC"
    fi
else
    log "Primary datacenter ($PRIMARY_DC) is unhealthy"
    
    if check_datacenter_health "$SECONDARY_DC"; then
        log "Secondary datacenter ($SECONDARY_DC) is healthy, initiating failover"
        update_load_balancer "$SECONDARY_DC"
        send_alert "ScyllaDB: Failed over to secondary datacenter $SECONDARY_DC"
    else
        log "Both datacenters are unhealthy - CRITICAL ALERT"
        send_alert "CRITICAL: ScyllaDB cluster completely unavailable - both datacenters down"
    fi
fi

log "Failover check completed"

sudo chmod +x /opt/scylla/failover-automation.sh

Configure automated failover monitoring

Set up continuous monitoring with automatic failover detection and execution.

[Unit]
Description=ScyllaDB Failover Monitor
After=network.target scylla-server.service

[Service]
Type=oneshot
User=root
Group=root
ExecStart=/opt/scylla/failover-automation.sh
StandardOutput=journal
StandardError=journal

[Unit]
Description=Run ScyllaDB failover check every 30 seconds
Requires=scylla-failover.service

[Timer]
OnCalendar=::0/30
Persistent=true

[Install]
WantedBy=timers.target

sudo systemctl daemon-reload
sudo systemctl enable scylla-failover.timer
sudo systemctl start scylla-failover.timer

Configure monitoring dashboards

Set up Grafana dashboards

Configure comprehensive monitoring dashboards for cluster health, performance metrics, and disaster recovery status.

sudo systemctl start grafana-server
sudo systemctl enable grafana-server

Import ScyllaDB monitoring dashboards
cd /opt/scylla-monitoring
sudo ./start-all.sh -d /opt/scylla-monitoring/prometheus/scylla_servers.yml -s /opt/scylla-monitoring/prometheus/scylla_manager_servers.yml

Note: Access Grafana at http://your-monitoring-server:3000 (admin/admin). The ScyllaDB monitoring stack includes pre-built dashboards for cluster overview, node details, and cross-datacenter replication metrics.

Configure alerting rules

Set up Prometheus alerting rules for critical ScyllaDB metrics and disaster recovery scenarios.

groups:
name: scylla-cluster  rules:
  - alert: ScyllaNodeDown
    expr: up{job="scylla"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "ScyllaDB node {{ $labels.instance }} is down"
      
  - alert: ScyllaDatacenterDown
    expr: count by (dc) (up{job="scylla"} == 1) < 2
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "ScyllaDB datacenter {{ $labels.dc }} has less than 2 nodes available"
      
  - alert: ScyllaReplicationLag
    expr: scylla_node_operation_mode != 3
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "ScyllaDB replication lag detected on {{ $labels.instance }}"
      
  - alert: ScyllaHighReadLatency
    expr: histogram_quantile(0.99, rate(scylla_storage_proxy_coordinator_read_latency_bucket[5m])) > 100
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: "High read latency on {{ $labels.instance }}: {{ $value }}ms"
      
  - alert: ScyllaHighWriteLatency
    expr: histogram_quantile(0.99, rate(scylla_storage_proxy_coordinator_write_latency_bucket[5m])) > 100
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: "High write latency on {{ $labels.instance }}: {{ $value }}ms"

Implement restore procedures

Create restore automation script

Develop a comprehensive restore script for point-in-time recovery and disaster recovery scenarios.

#!/bin/bash

ScyllaDB Automated Restore Script
set -euo pipefail

Configuration
BACKUP_DIR="/opt/scylla/backups"
S3_BUCKET="s3://scylla-backups-production"
RESTORE_DATE="$1"  # Format: YYYYMMDD_HHMMSS
TARGET_KEYSPACE="$2"
LOG_FILE="/var/log/scylla/restore.log"
HOSTNAME=$(hostname -f)

Validation
if [[ -z "$RESTORE_DATE" ]] || [[ -z "$TARGET_KEYSPACE" ]]; then
    echo "Usage: $0  "
    echo "Example: $0 20240115_020000 production_data"
    exit 1
fi

Logging function
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

log "Starting restore process for $TARGET_KEYSPACE from backup $RESTORE_DATE"

Stop ScyllaDB service
log "Stopping ScyllaDB service"
sudo systemctl stop scylla-server

Create restore directory
mkdir -p "$BACKUP_DIR/restore_$RESTORE_DATE"

Download backup from S3
log "Downloading backup from S3"
aws s3 sync "$S3_BUCKET/$HOSTNAME/$RESTORE_DATE/" "$BACKUP_DIR/restore_$RESTORE_DATE/"

Clear existing data for target keyspace
log "Clearing existing data for keyspace $TARGET_KEYSPACE"
find /var/lib/scylla/data/$TARGET_KEYSPACE -name "*.db" -delete 2>/dev/null || true

Extract and restore backup files
log "Extracting and restoring backup files"
for backup_file in "$BACKUP_DIR/restore_$RESTORE_DATE/"${TARGET_KEYSPACE}_*.tar.gz; do
    if [[ -f "$backup_file" ]]; then
        table_name=$(basename "$backup_file" | sed "s/${TARGET_KEYSPACE}_//" | sed 's/_[0-9]_[0-9].tar.gz//')
        
        log "Restoring table $table_name"
        
        # Create target directory
        table_dir="/var/lib/scylla/data/$TARGET_KEYSPACE/$table_name-$(uuidgen | tr -d '-')"
        mkdir -p "$table_dir"
        
        # Extract backup
        tar -xzf "$backup_file" -C "$table_dir"
        
        # Fix ownership
        chown -R scylla:scylla "$table_dir"
        
        log "Restored table $table_name"
    fi
done

Start ScyllaDB service
log "Starting ScyllaDB service"
sudo systemctl start scylla-server

Wait for service to be ready
log "Waiting for ScyllaDB to be ready"
while ! cqlsh -e "SELECT now() FROM system.local" &>/dev/null; do
    sleep 5
done

Refresh the keyspace
log "Refreshing keyspace $TARGET_KEYSPACE"
nodetool refresh $TARGET_KEYSPACE

Run repair
log "Running repair on keyspace $TARGET_KEYSPACE"
nodetool repair $TARGET_KEYSPACE

Cleanup temporary files
log "Cleaning up temporary files"
rm -rf "$BACKUP_DIR/restore_$RESTORE_DATE"

log "Restore process completed successfully for keyspace $TARGET_KEYSPACE"

Verify data
log "Verifying restored data"
cqlsh -e "SELECT COUNT(*) FROM $TARGET_KEYSPACE.user_sessions;" || log "Warning: Could not verify restored data"

sudo chmod +x /opt/scylla/restore-automation.sh

Verify your setup

Test cluster connectivity and replication

Verify that cross-region replication is working correctly and data is being replicated across datacenters.

# Check cluster status
nodetool status

Verify cross-datacenter connectivity
nodetool ring

Test data replication
cqlsh -e "INSERT INTO production_data.user_sessions (session_id, user_id, created_at) VALUES (uuid(), uuid(), toTimestamp(now()));"
cqlsh -h 10.0.2.10 -e "SELECT COUNT(*) FROM production_data.user_sessions;"

Check replication lag
nodetool netstats

Test backup and restore procedures

Perform a complete backup and restore test to validate your disaster recovery procedures.

# Run manual backup
sudo /opt/scylla/backup-automation.sh

Check backup status
ls -la /opt/scylla/backups/
aws s3 ls s3://scylla-backups-production/

Test restore procedure (use actual backup date)
sudo /opt/scylla/restore-automation.sh 20240115_020000 production_data

Verify monitoring and alerting

Ensure monitoring dashboards are collecting metrics and alerting is configured properly.

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'

Verify ScyllaDB metrics
curl http://localhost:9180/metrics | grep scylla_reactor_utilization

Check failover monitoring
sudo systemctl status scylla-failover.timer
journalctl -u scylla-failover.service -f

Performance optimization

Optimize cross-region settings

Fine-tune ScyllaDB parameters for optimal cross-region performance and reduce replication lag.

# Cross-region optimization
stream_throughput_outbound_megabits_per_sec: 2000
compaction_throughput_mb_per_sec: 256
concurrent_compactors: 8

Memory settings
memtable_heap_space_in_mb: 8192
memtable_offheap_space_in_mb: 8192

Network optimization
rpc_keepalive: true
inter_dc_tcp_nodelay: true
commit_failure_policy: stop

Timeouts for cross-region
cross_node_timeout: true
request_timeout_in_ms: 20000

sudo systemctl restart scylla-server

Security hardening

Enable SSL/TLS encryption

Configure SSL encryption for client connections and inter-node communication across regions.

# Generate SSL certificates
sudo mkdir -p /etc/scylla/ssl
sudo openssl genrsa -out /etc/scylla/ssl/scylla.key 4096
sudo openssl req -new -x509 -key /etc/scylla/ssl/scylla.key -out /etc/scylla/ssl/scylla.crt -days 3650 -subj "/C=US/ST=State/L=City/O=Organization/CN=*.example.com"
sudo chown -R scylla:scylla /etc/scylla/ssl
sudo chmod 600 /etc/scylla/ssl/scylla.key

# SSL Configuration
client_encryption_options:
    enabled: true
    optional: false
    keystore: /etc/scylla/ssl/scylla.key
    certificate: /etc/scylla/ssl/scylla.crt
    
server_encryption_options:
    internode_encryption: all
    keystore: /etc/scylla/ssl/scylla.key
    certificate: /etc/scylla/ssl/scylla.crt
    require_client_auth: false

Common issues

Symptom	Cause	Fix
Nodes can't join cluster	Firewall blocking ports	`sudo ufw allow 7000:7001/tcp` and check `seeds` configuration
High replication lag	Network bandwidth limitations	Increase `stream_throughput_outbound_megabits_per_sec` and optimize network
Backup script fails	S3 credentials or permissions	Check AWS credentials and S3 bucket permissions with `aws s3 ls`
Restore process hangs	Insufficient disk space	Check disk space with `df -h` and clear old snapshots
Failover not triggered	Health check timeouts	Adjust timeout values in failover script and check network connectivity
SSL connection errors	Certificate issues	Verify certificate validity with `openssl x509 -in /etc/scylla/ssl/scylla.crt -text`
Monitoring metrics missing	Prometheus scraping failures	Check ScyllaDB metrics endpoint with `curl http://node:9180/metrics`

Next steps

Automated install script

Run this to automate the entire setup

install.sh

#!/usr/bin/env bash
set -euo pipefail

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color

# Script variables
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
LOG_FILE="/var/log/scylladb-install.log"
CLUSTER_NAME="production-cluster"
SCYLLA_CONFIG_DIR="/etc/scylla"

# Usage function
usage() {
    echo "Usage: $0 <node_ip> <datacenter> <rack> [seed_nodes]"
    echo "Example: $0 10.0.1.10 us-east-1 rack1 10.0.1.10,10.0.1.11,10.0.2.10"
    exit 1
}

# Logging function
log() {
    echo -e "${GREEN}[$(date '+%Y-%m-%d %H:%M:%S')] $1${NC}" | tee -a "$LOG_FILE"
}

error() {
    echo -e "${RED}[ERROR] $1${NC}" | tee -a "$LOG_FILE"
    exit 1
}

warning() {
    echo -e "${YELLOW}[WARNING] $1${NC}" | tee -a "$LOG_FILE"
}

# Cleanup function for rollback
cleanup() {
    if [ $? -ne 0 ]; then
        error "Installation failed. Check $LOG_FILE for details"
    fi
}

trap cleanup ERR

# Check prerequisites
check_prerequisites() {
    log "[1/12] Checking prerequisites..."
    
    if [[ $EUID -ne 0 ]]; then
        error "This script must be run as root or with sudo"
    fi
    
    if [[ $# -lt 3 ]]; then
        usage
    fi
    
    # Validate IP address format
    if ! [[ $1 =~ ^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$ ]]; then
        error "Invalid IP address format: $1"
    fi
    
    # Check system resources
    local mem_gb=$(free -g | awk '/^Mem:/{print $2}')
    local cpu_cores=$(nproc)
    
    if [[ $mem_gb -lt 16 ]]; then
        warning "RAM is ${mem_gb}GB, recommended minimum is 32GB"
    fi
    
    if [[ $cpu_cores -lt 4 ]]; then
        warning "CPU cores: ${cpu_cores}, recommended minimum is 8 cores"
    fi
}

# Detect distribution
detect_distro() {
    log "[2/12] Detecting distribution..."
    
    if [ -f /etc/os-release ]; then
        . /etc/os-release
        case "$ID" in
            ubuntu|debian)
                PKG_MGR="apt"
                PKG_UPDATE="apt update"
                PKG_INSTALL="apt install -y"
                FIREWALL_CMD="ufw"
                ;;
            almalinux|rocky|centos|rhel|ol)
                PKG_MGR="dnf"
                PKG_UPDATE="dnf update -y"
                PKG_INSTALL="dnf install -y"
                FIREWALL_CMD="firewall-cmd"
                ;;
            fedora)
                PKG_MGR="dnf"
                PKG_UPDATE="dnf update -y"
                PKG_INSTALL="dnf install -y"
                FIREWALL_CMD="firewall-cmd"
                ;;
            amzn)
                PKG_MGR="yum"
                PKG_UPDATE="yum update -y"
                PKG_INSTALL="yum install -y"
                FIREWALL_CMD="firewall-cmd"
                ;;
            *)
                error "Unsupported distribution: $ID"
                ;;
        esac
        log "Detected: $PRETTY_NAME using $PKG_MGR"
    else
        error "Cannot detect distribution - /etc/os-release not found"
    fi
}

# Configure firewall
configure_firewall() {
    log "[3/12] Configuring firewall..."
    
    case "$FIREWALL_CMD" in
        ufw)
            systemctl enable ufw || true
            ufw --force enable
            ufw allow 7000/tcp comment "ScyllaDB inter-node"
            ufw allow 7001/tcp comment "ScyllaDB TLS inter-node"
            ufw allow 9042/tcp comment "ScyllaDB CQL native"
            ufw allow 9160/tcp comment "ScyllaDB Thrift RPC"
            ufw allow 10000/tcp comment "ScyllaDB REST API"
            ufw allow 9180/tcp comment "ScyllaDB Prometheus"
            ufw allow 19042/tcp comment "ScyllaDB CQL shard-aware"
            ;;
        firewall-cmd)
            systemctl enable firewalld
            systemctl start firewalld
            firewall-cmd --permanent --add-port=7000/tcp
            firewall-cmd --permanent --add-port=7001/tcp
            firewall-cmd --permanent --add-port=9042/tcp
            firewall-cmd --permanent --add-port=9160/tcp
            firewall-cmd --permanent --add-port=10000/tcp
            firewall-cmd --permanent --add-port=9180/tcp
            firewall-cmd --permanent --add-port=19042/tcp
            firewall-cmd --reload
            ;;
    esac
}

# Install dependencies
install_dependencies() {
    log "[4/12] Installing dependencies..."
    
    $PKG_UPDATE
    
    case "$PKG_MGR" in
        apt)
            $PKG_INSTALL curl gnupg2 software-properties-common apt-transport-https ca-certificates
            ;;
        dnf|yum)
            $PKG_INSTALL curl gnupg2 python3-PyYAML
            ;;
    esac
}

# Add ScyllaDB repository
add_scylla_repo() {
    log "[5/12] Adding ScyllaDB repository..."
    
    case "$PKG_MGR" in
        apt)
            curl -fsSL https://downloads.scylladb.com/downloads/scylla/rpm/unstable/centos/scylladb-2023.1/scylladb-2023.1.repo
            apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 5e08fbd8b5d6ec9c
            curl -L -o /etc/apt/sources.list.d/scylla.list http://downloads.scylladb.com/deb/ubuntu/scylla-5.4.list
            $PKG_UPDATE
            ;;
        dnf|yum)
            curl -L -o /etc/yum.repos.d/scylla.repo http://downloads.scylladb.com/rpm/centos/scylla-5.4.repo
            ;;
    esac
}

# Install ScyllaDB
install_scylladb() {
    log "[6/12] Installing ScyllaDB..."
    
    $PKG_INSTALL scylla
    
    # Install additional tools
    case "$PKG_MGR" in
        apt)
            $PKG_INSTALL scylla-tools scylla-python3
            ;;
        dnf|yum)
            $PKG_INSTALL scylla-tools
            ;;
    esac
}

# System optimization
optimize_system() {
    log "[7/12] Running system optimization..."
    
    # Run ScyllaDB setup scripts
    echo -e "yes\nyes\nyes\nyes\nyes\nyes\n" | scylla_setup --no-raid-setup || warning "Setup script had warnings"
    scylla_io_setup || warning "IO setup had warnings"
    
    # Additional optimizations
    echo 'vm.swappiness = 1' >> /etc/sysctl.conf
    echo 'vm.max_map_count = 1048575' >> /etc/sysctl.conf
    sysctl -p
}

# Configure ScyllaDB
configure_scylladb() {
    log "[8/12] Configuring ScyllaDB..."
    
    local node_ip=$1
    local datacenter=$2
    local rack=$3
    local seeds=${4:-$node_ip}
    
    # Backup original config
    cp "${SCYLLA_CONFIG_DIR}/scylla.yaml" "${SCYLLA_CONFIG_DIR}/scylla.yaml.backup"
    
    # Create new configuration
    cat > "${SCYLLA_CONFIG_DIR}/scylla.yaml" << EOF
cluster_name: '${CLUSTER_NAME}'
num_tokens: 256
authenticator: PasswordAuthenticator
authorizer: CassandraAuthorizer

listen_address: ${node_ip}
broadcast_address: ${node_ip}
rpc_address: 0.0.0.0
broadcast_rpc_address: ${node_ip}

seed_provider:
    - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
          - seeds: "${seeds}"

endpoint_snitch: GossipingPropertyFileSnitch

data_file_directories:
    - /var/lib/scylla/data
commitlog_directory: /var/lib/scylla/commitlog
hints_directory: /var/lib/scylla/hints
view_hints_directory: /var/lib/scylla/view_hints

inter_dc_stream_throughput_outbound_megabits_per_sec: 1000
inter_dc_tcp_nodelay: true
read_request_timeout_in_ms: 10000
write_request_timeout_in_ms: 10000
range_request_timeout_in_ms: 20000

experimental_features:
    - udf
    - alternator-streams

snapshot_before_compaction: false
auto_snapshot: true
incremental_backups: true

enable_user_defined_functions: true
enable_scripted_user_defined_functions: false

compaction_throughput_mb_per_sec: 256
stream_throughput_outbound_megabits_per_sec: 2000
EOF

    # Set proper permissions
    chown scylla:scylla "${SCYLLA_CONFIG_DIR}/scylla.yaml"
    chmod 644 "${SCYLLA_CONFIG_DIR}/scylla.yaml"
    
    # Configure datacenter properties
    cat > "${SCYLLA_CONFIG_DIR}/cassandra-rackdc.properties" << EOF
dc=${datacenter}
rack=${rack}
prefer_local=true
EOF
    
    chown scylla:scylla "${SCYLLA_CONFIG_DIR}/cassandra-rackdc.properties"
    chmod 644 "${SCYLLA_CONFIG_DIR}/cassandra-rackdc.properties"
}

# Configure data directories
setup_data_directories() {
    log "[9/12] Setting up data directories..."
    
    local dirs=("/var/lib/scylla/data" "/var/lib/scylla/commitlog" "/var/lib/scylla/hints" "/var/lib/scylla/view_hints")
    
    for dir in "${dirs[@]}"; do
        mkdir -p "$dir"
        chown -R scylla:scylla "$dir"
        chmod 755 "$dir"
    done
}

# Enable and start services
start_services() {
    log "[10/12] Starting ScyllaDB services..."
    
    systemctl daemon-reload
    systemctl enable scylla-server
    systemctl start scylla-server
    
    # Wait for service to be ready
    log "Waiting for ScyllaDB to start..."
    sleep 30
    
    local attempts=0
    while ! systemctl is-active --quiet scylla-server && [ $attempts -lt 12 ]; do
        sleep 10
        attempts=$((attempts + 1))
        log "Waiting for ScyllaDB... (attempt $attempts/12)"
    done
    
    if ! systemctl is-active --quiet scylla-server; then
        error "ScyllaDB failed to start properly"
    fi
}

# Configure monitoring
setup_monitoring() {
    log "[11/12] Setting up monitoring..."
    
    # Enable Prometheus metrics
    systemctl enable scylla-jmx || warning "Could not enable scylla-jmx"
    systemctl start scylla-jmx || warning "Could not start scylla-jmx"
}

# Verification
verify_installation() {
    log "[12/12] Verifying installation..."
    
    # Check service status
    if ! systemctl is-active --quiet scylla-server; then
        error "ScyllaDB service is not running"
    fi
    
    # Check if ScyllaDB is responding
    sleep 10
    if ! timeout 30 cqlsh $(hostname -I | awk '{print $1}') -e "SELECT release_version FROM system.local;" 2>/dev/null; then
        warning "CQL connection test failed - this is normal for new clusters"
    fi
    
    # Check logs for errors
    if journalctl -u scylla-server --since "5 minutes ago" | grep -i "error" | grep -v "Connection refused" >/dev/null; then
        warning "Found errors in ScyllaDB logs - check with: journalctl -u scylla-server"
    fi
    
    log "ScyllaDB installation completed successfully!"
    log "Configuration file: ${SCYLLA_CONFIG_DIR}/scylla.yaml"
    log "Data directory: /var/lib/scylla/data"
    log "Check status: systemctl status scylla-server"
    log "View logs: journalctl -u scylla-server -f"
}

# Main execution
main() {
    local node_ip=$1
    local datacenter=$2
    local rack=$3
    local seeds=${4:-$node_ip}
    
    check_prerequisites "$@"
    detect_distro
    configure_firewall
    install_dependencies
    add_scylla_repo
    install_scylladb
    optimize_system
    configure_scylladb "$node_ip" "$datacenter" "$rack" "$seeds"
    setup_data_directories
    start_services
    setup_monitoring
    verify_installation
}

# Run main function with all arguments
main "$@"

Review the script before running. Execute with: bash install.sh

#scylladb #disaster-recovery #cross-region-replication #database-backup #failover-automation

Implement ScyllaDB disaster recovery with cross-region replication

Prerequisites

What this solves

Prerequisites and planning

Infrastructure requirements

Network and firewall configuration

Step-by-step installation

Install ScyllaDB on all nodes

Configure system optimization

Configure ScyllaDB for multi-region topology

Network settings

Seeds configuration (include nodes from both regions)

Topology settings

Cross-region optimization

Enable experimental features for better replication

Backup and snapshot settings

Configure datacenter topology

For region 2 nodes, use:

dc=us-west-1

rack=rack1

Start ScyllaDB services

Verify cluster formation

Configure cross-region replication

Create keyspace with NetworkTopologyStrategy

Configure consistency levels

Implement automated backup procedures

Install backup dependencies

Create backup automation script

ScyllaDB Automated Backup Script

Configuration

Logging function

Create backup directory

Create snapshot

Find and compress snapshot files

Upload to S3

Cleanup old local backups

Cleanup old snapshots

Cleanup old S3 backups

Health check

Configure S3 credentials

Schedule automated backups

Set up monitoring and failover automation

Install monitoring components

Configure Prometheus for ScyllaDB

Create failover automation script

ScyllaDB Failover Automation Script

Configuration

Logging

Health check function

Update load balancer configuration

Send alert

Main failover logic

Configure automated failover monitoring

Configure monitoring dashboards

Set up Grafana dashboards

Import ScyllaDB monitoring dashboards

Configure alerting rules

Implement restore procedures

Create restore automation script

ScyllaDB Automated Restore Script

Configuration

Validation

Logging function

Stop ScyllaDB service

Create restore directory

Download backup from S3

Clear existing data for target keyspace

Extract and restore backup files

Start ScyllaDB service

Wait for service to be ready

Refresh the keyspace

Run repair

Cleanup temporary files

Verify data

Verify your setup

Test cluster connectivity and replication

Verify cross-datacenter connectivity

Test data replication

Check replication lag

Test backup and restore procedures

`rack=rack1`