Set up ScyllaDB multi-region cluster with automated backup strategies, cross-datacenter replication, and failover automation for enterprise-grade disaster recovery and business continuity.
Prerequisites
- Minimum 6 servers across 2 regions
- Root access to all servers
- Network connectivity between regions
- AWS S3 bucket for backups
- Basic understanding of distributed databases
What this solves
ScyllaDB disaster recovery with cross-region replication ensures your database remains available during datacenter failures, natural disasters, or regional outages. This tutorial implements a production-grade multi-region ScyllaDB cluster with automated backup strategies, real-time replication, and monitoring systems that can handle enterprise workloads with sub-second failover times.
Prerequisites and planning
Infrastructure requirements
You'll need at least 6 servers across two regions (minimum 3 nodes per region) with sufficient network bandwidth between regions for replication traffic.
| Component | Minimum specs | Recommended specs |
|---|---|---|
| CPU | 8 cores | 16+ cores |
| RAM | 32GB | 64GB+ |
| Storage | 1TB NVMe SSD | 2TB+ NVMe SSD |
| Network | 10 Gbps | 25 Gbps+ |
| Inter-region bandwidth | 1 Gbps | 5 Gbps+ |
Network and firewall configuration
Configure firewall rules to allow ScyllaDB cluster communication and monitoring access across regions.
sudo ufw allow 7000/tcp # Inter-node communication
sudo ufw allow 7001/tcp # TLS inter-node communication
sudo ufw allow 9042/tcp # CQL native transport
sudo ufw allow 9160/tcp # Thrift RPC
sudo ufw allow 10000/tcp # REST API
sudo ufw allow 9180/tcp # Prometheus metrics
sudo ufw allow 19042/tcp # CQL shard-aware port
Step-by-step installation
Install ScyllaDB on all nodes
Install ScyllaDB packages and dependencies on every cluster node across both regions.
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 5e08fbd8b5d6ec9c
sudo curl -L --output /etc/apt/sources.list.d/scylla.list http://downloads.scylladb.com/deb/ubuntu/scylla-5.4.list
sudo apt update
sudo apt install -y scylla
Configure system optimization
Run ScyllaDB's setup script to optimize system parameters for maximum performance and reliability.
sudo scylla_setup --no-raid-setup
sudo scylla_io_setup
Configure ScyllaDB for multi-region topology
Create the main configuration file with topology-aware settings and cross-region replication support.
# Cluster settings
cluster_name: 'production-cluster'
num_tokens: 256
authenticator: PasswordAuthenticator
authorizer: CassandraAuthorizer
Network settings
listen_address: 10.0.1.10 # Replace with actual node IP
broadcast_address: 10.0.1.10 # Replace with actual node IP
rpc_address: 0.0.0.0
broadcast_rpc_address: 10.0.1.10 # Replace with actual node IP
Seeds configuration (include nodes from both regions)
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "10.0.1.10,10.0.1.11,10.0.1.12,10.0.2.10,10.0.2.11,10.0.2.12"
Topology settings
endpoint_snitch: GossipingPropertyFileSnitch
data_file_directories:
- /var/lib/scylla/data
commitlog_directory: /var/lib/scylla/commitlog
hints_directory: /var/lib/scylla/hints
view_hints_directory: /var/lib/scylla/view_hints
Cross-region optimization
inter_dc_stream_throughput_outbound_megabits_per_sec: 1000
inter_dc_tcp_nodelay: true
read_request_timeout_in_ms: 10000
write_request_timeout_in_ms: 10000
range_request_timeout_in_ms: 20000
Enable experimental features for better replication
experimental_features:
- udf
- alternator-streams
Backup and snapshot settings
snapshot_before_compaction: false
auto_snapshot: true
incremental_backups: true
Configure datacenter topology
Define datacenter and rack information for proper replica placement and network topology awareness.
# Region 1 nodes (adjust for each node)
dc=us-east-1
rack=rack1
For region 2 nodes, use:
dc=us-west-1
rack=rack1
Start ScyllaDB services
Enable and start ScyllaDB on all nodes, beginning with seed nodes in the first region.
sudo systemctl enable scylla-server
sudo systemctl start scylla-server
sudo systemctl status scylla-server
Verify cluster formation
Check that all nodes have joined the cluster and can communicate across regions.
nodetool status
nodetool ring
nodetool describecluster
Configure cross-region replication
Create keyspace with NetworkTopologyStrategy
Create keyspaces configured for multi-datacenter replication with appropriate consistency levels.
cqlsh -u cassandra -p cassandra
CREATE KEYSPACE production_data
WITH replication = {
'class': 'NetworkTopologyStrategy',
'us-east-1': 3,
'us-west-1': 3
};
CREATE KEYSPACE system_auth_backup
WITH replication = {
'class': 'NetworkTopologyStrategy',
'us-east-1': 3,
'us-west-1': 3
};
USE production_data;
CREATE TABLE user_sessions (
session_id UUID PRIMARY KEY,
user_id UUID,
created_at TIMESTAMP,
last_access TIMESTAMP,
session_data TEXT
) WITH gc_grace_seconds = 86400;
Configure consistency levels
Set appropriate consistency levels for read and write operations to balance performance with data consistency.
-- For strong consistency across regions
CONSISTENCY EACH_QUORUM;
-- For local datacenter consistency (better performance)
CONSISTENCY LOCAL_QUORUM;
-- Test consistency settings
SELECT * FROM system.local;
SELECT * FROM system.peers;
Implement automated backup procedures
Install backup dependencies
Install required tools for automated backup creation, compression, and remote storage.
sudo apt install -y awscli s3cmd pigz parallel
pip3 install --user scylla-manager-client
Create backup automation script
Develop a comprehensive backup script that handles snapshots, compression, and remote storage with error handling.
#!/bin/bash
ScyllaDB Automated Backup Script
set -euo pipefail
Configuration
BACKUP_DIR="/opt/scylla/backups"
S3_BUCKET="s3://scylla-backups-production"
RETENTION_DAYS=30
LOG_FILE="/var/log/scylla/backup.log"
DATE=$(date +%Y%m%d_%H%M%S)
HOSTNAME=$(hostname -f)
Logging function
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
log "Starting backup process on $HOSTNAME"
Create backup directory
mkdir -p "$BACKUP_DIR/$DATE"
Create snapshot
log "Creating snapshot"
nodetool snapshot -t "backup_$DATE"
Find and compress snapshot files
log "Compressing snapshot data"
find /var/lib/scylla/data -name "backup_$DATE" -type d | while read -r snapshot_dir; do
keyspace=$(echo "$snapshot_dir" | cut -d'/' -f6)
table=$(echo "$snapshot_dir" | cut -d'/' -f7)
# Create compressed archive
tar -I pigz -cf "$BACKUP_DIR/$DATE/${keyspace}_${table}_$DATE.tar.gz" -C "$snapshot_dir" .
log "Compressed $keyspace.$table"
done
Upload to S3
log "Uploading to S3"
aws s3 sync "$BACKUP_DIR/$DATE" "$S3_BUCKET/$HOSTNAME/$DATE/" --storage-class STANDARD_IA
Cleanup old local backups
log "Cleaning up old local backups"
find "$BACKUP_DIR" -type d -mtime +7 -exec rm -rf {} +
Cleanup old snapshots
log "Cleaning up old snapshots"
nodetool clearsnapshot
Cleanup old S3 backups
log "Cleaning up old S3 backups"
aws s3 ls "$S3_BUCKET/$HOSTNAME/" | while read -r line; do
backup_date=$(echo "$line" | awk '{print $2}' | tr -d '/')
if [[ $(date -d "$backup_date" +%s 2>/dev/null || echo 0) -lt $(date -d "$RETENTION_DAYS days ago" +%s) ]]; then
aws s3 rm "$S3_BUCKET/$HOSTNAME/$backup_date/" --recursive
log "Removed old backup: $backup_date"
fi
done
log "Backup process completed successfully"
Health check
nodetool status | grep -q "UN" && log "Cluster health: OK" || log "Cluster health: WARNING"
sudo chmod +x /opt/scylla/backup-automation.sh
sudo mkdir -p /var/log/scylla
Configure S3 credentials
Set up AWS credentials and S3 bucket configuration for secure backup storage.
sudo mkdir -p /root/.aws
[default]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY
region = us-east-1
[default]
region = us-east-1
output = json
[profile backup]
region = us-east-1
s3 =
max_concurrent_requests = 20
max_bandwidth = 100MB/s
Schedule automated backups
Configure systemd timers for reliable backup scheduling with proper error handling and monitoring.
[Unit]
Description=ScyllaDB Automated Backup
After=scylla-server.service
Requires=scylla-server.service
[Service]
Type=oneshot
User=root
Group=root
ExecStart=/opt/scylla/backup-automation.sh
StandardOutput=journal
StandardError=journal
[Unit]
Description=Run ScyllaDB backup daily at 2 AM
Requires=scylla-backup.service
[Timer]
OnCalendar=--* 02:00:00
RandomizedDelaySec=300
Persistent=true
[Install]
WantedBy=timers.target
sudo systemctl daemon-reload
sudo systemctl enable scylla-backup.timer
sudo systemctl start scylla-backup.timer
sudo systemctl status scylla-backup.timer
Set up monitoring and failover automation
Install monitoring components
Install Prometheus and Grafana for comprehensive ScyllaDB cluster monitoring with custom dashboards.
sudo apt install -y prometheus prometheus-node-exporter grafana
wget https://github.com/scylladb/scylla-monitoring/archive/scylla-monitoring-4.6.tar.gz
tar -xzf scylla-monitoring-4.6.tar.gz
sudo mv scylla-monitoring-4.6 /opt/scylla-monitoring
Configure Prometheus for ScyllaDB
Set up Prometheus configuration to scrape metrics from all ScyllaDB nodes across regions.
# ScyllaDB cluster nodes
- targets:
- 10.0.1.10:9180
- 10.0.1.11:9180
- 10.0.1.12:9180
labels:
cluster: production-cluster
dc: us-east-1
- targets:
- 10.0.2.10:9180
- 10.0.2.11:9180
- 10.0.2.12:9180
labels:
cluster: production-cluster
dc: us-west-1
# Node exporter targets
- targets:
- 10.0.1.10:9100
- 10.0.1.11:9100
- 10.0.1.12:9100
- 10.0.2.10:9100
- 10.0.2.11:9100
- 10.0.2.12:9100
Create failover automation script
Develop intelligent failover automation that detects datacenter failures and redirects application traffic.
#!/bin/bash
ScyllaDB Failover Automation Script
set -euo pipefail
Configuration
PRIMARY_DC="us-east-1"
SECONDARY_DC="us-west-1"
PRIMARY_NODES=("10.0.1.10" "10.0.1.11" "10.0.1.12")
SECONDARY_NODES=("10.0.2.10" "10.0.2.11" "10.0.2.12")
LOAD_BALANCER_CONFIG="/etc/haproxy/haproxy.cfg"
ALERT_EMAIL="ops@example.com"
LOG_FILE="/var/log/scylla/failover.log"
Logging
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
Health check function
check_datacenter_health() {
local dc=$1
local nodes=()
local healthy_count=0
if [[ "$dc" == "$PRIMARY_DC" ]]; then
nodes=("${PRIMARY_NODES[@]}")
else
nodes=("${SECONDARY_NODES[@]}")
fi
for node in "${nodes[@]}"; do
if timeout 5 cqlsh "$node" -e "SELECT now() FROM system.local" &>/dev/null; then
((healthy_count++))
fi
done
# Require majority of nodes to be healthy
if [[ $healthy_count -ge 2 ]]; then
return 0 # Healthy
else
return 1 # Unhealthy
fi
}
Update load balancer configuration
update_load_balancer() {
local active_dc=$1
log "Updating load balancer to use $active_dc"
# Generate new HAProxy config
cat > "$LOAD_BALANCER_CONFIG" << EOF
global
daemon
log stdout local0
defaults
mode tcp
timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms
listen scylla-cluster
bind *:9042
balance roundrobin
option tcp-check
tcp-check send-binary 040000000a000400436040
tcp-check expect binary 8400000a
EOF
if [[ "$active_dc" == "$PRIMARY_DC" ]]; then
for node in "${PRIMARY_NODES[@]}"; do
echo " server node-$node $node:9042 check" >> "$LOAD_BALANCER_CONFIG"
done
else
for node in "${SECONDARY_NODES[@]}"; do
echo " server node-$node $node:9042 check" >> "$LOAD_BALANCER_CONFIG"
done
fi
systemctl reload haproxy
}
Send alert
send_alert() {
local message=$1
echo "$message" | mail -s "ScyllaDB Failover Alert" "$ALERT_EMAIL"
log "Alert sent: $message"
}
Main failover logic
log "Starting failover check"
if check_datacenter_health "$PRIMARY_DC"; then
log "Primary datacenter ($PRIMARY_DC) is healthy"
# Ensure we're using primary DC
if ! grep -q "${PRIMARY_NODES[0]}" "$LOAD_BALANCER_CONFIG"; then
log "Failing back to primary datacenter"
update_load_balancer "$PRIMARY_DC"
send_alert "ScyllaDB: Failed back to primary datacenter $PRIMARY_DC"
fi
else
log "Primary datacenter ($PRIMARY_DC) is unhealthy"
if check_datacenter_health "$SECONDARY_DC"; then
log "Secondary datacenter ($SECONDARY_DC) is healthy, initiating failover"
update_load_balancer "$SECONDARY_DC"
send_alert "ScyllaDB: Failed over to secondary datacenter $SECONDARY_DC"
else
log "Both datacenters are unhealthy - CRITICAL ALERT"
send_alert "CRITICAL: ScyllaDB cluster completely unavailable - both datacenters down"
fi
fi
log "Failover check completed"
sudo chmod +x /opt/scylla/failover-automation.sh
Configure automated failover monitoring
Set up continuous monitoring with automatic failover detection and execution.
[Unit]
Description=ScyllaDB Failover Monitor
After=network.target scylla-server.service
[Service]
Type=oneshot
User=root
Group=root
ExecStart=/opt/scylla/failover-automation.sh
StandardOutput=journal
StandardError=journal
[Unit]
Description=Run ScyllaDB failover check every 30 seconds
Requires=scylla-failover.service
[Timer]
OnCalendar=::0/30
Persistent=true
[Install]
WantedBy=timers.target
sudo systemctl daemon-reload
sudo systemctl enable scylla-failover.timer
sudo systemctl start scylla-failover.timer
Configure monitoring dashboards
Set up Grafana dashboards
Configure comprehensive monitoring dashboards for cluster health, performance metrics, and disaster recovery status.
sudo systemctl start grafana-server
sudo systemctl enable grafana-server
Import ScyllaDB monitoring dashboards
cd /opt/scylla-monitoring
sudo ./start-all.sh -d /opt/scylla-monitoring/prometheus/scylla_servers.yml -s /opt/scylla-monitoring/prometheus/scylla_manager_servers.yml
Configure alerting rules
Set up Prometheus alerting rules for critical ScyllaDB metrics and disaster recovery scenarios.
groups:
- name: scylla-cluster
rules:
- alert: ScyllaNodeDown
expr: up{job="scylla"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "ScyllaDB node {{ $labels.instance }} is down"
- alert: ScyllaDatacenterDown
expr: count by (dc) (up{job="scylla"} == 1) < 2
for: 2m
labels:
severity: critical
annotations:
summary: "ScyllaDB datacenter {{ $labels.dc }} has less than 2 nodes available"
- alert: ScyllaReplicationLag
expr: scylla_node_operation_mode != 3
for: 5m
labels:
severity: warning
annotations:
summary: "ScyllaDB replication lag detected on {{ $labels.instance }}"
- alert: ScyllaHighReadLatency
expr: histogram_quantile(0.99, rate(scylla_storage_proxy_coordinator_read_latency_bucket[5m])) > 100
for: 3m
labels:
severity: warning
annotations:
summary: "High read latency on {{ $labels.instance }}: {{ $value }}ms"
- alert: ScyllaHighWriteLatency
expr: histogram_quantile(0.99, rate(scylla_storage_proxy_coordinator_write_latency_bucket[5m])) > 100
for: 3m
labels:
severity: warning
annotations:
summary: "High write latency on {{ $labels.instance }}: {{ $value }}ms"
Implement restore procedures
Create restore automation script
Develop a comprehensive restore script for point-in-time recovery and disaster recovery scenarios.
#!/bin/bash
ScyllaDB Automated Restore Script
set -euo pipefail
Configuration
BACKUP_DIR="/opt/scylla/backups"
S3_BUCKET="s3://scylla-backups-production"
RESTORE_DATE="$1" # Format: YYYYMMDD_HHMMSS
TARGET_KEYSPACE="$2"
LOG_FILE="/var/log/scylla/restore.log"
HOSTNAME=$(hostname -f)
Validation
if [[ -z "$RESTORE_DATE" ]] || [[ -z "$TARGET_KEYSPACE" ]]; then
echo "Usage: $0 "
echo "Example: $0 20240115_020000 production_data"
exit 1
fi
Logging function
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
log "Starting restore process for $TARGET_KEYSPACE from backup $RESTORE_DATE"
Stop ScyllaDB service
log "Stopping ScyllaDB service"
sudo systemctl stop scylla-server
Create restore directory
mkdir -p "$BACKUP_DIR/restore_$RESTORE_DATE"
Download backup from S3
log "Downloading backup from S3"
aws s3 sync "$S3_BUCKET/$HOSTNAME/$RESTORE_DATE/" "$BACKUP_DIR/restore_$RESTORE_DATE/"
Clear existing data for target keyspace
log "Clearing existing data for keyspace $TARGET_KEYSPACE"
find /var/lib/scylla/data/$TARGET_KEYSPACE -name "*.db" -delete 2>/dev/null || true
Extract and restore backup files
log "Extracting and restoring backup files"
for backup_file in "$BACKUP_DIR/restore_$RESTORE_DATE/"${TARGET_KEYSPACE}_*.tar.gz; do
if [[ -f "$backup_file" ]]; then
table_name=$(basename "$backup_file" | sed "s/${TARGET_KEYSPACE}_//" | sed 's/_[0-9]_[0-9].tar.gz//')
log "Restoring table $table_name"
# Create target directory
table_dir="/var/lib/scylla/data/$TARGET_KEYSPACE/$table_name-$(uuidgen | tr -d '-')"
mkdir -p "$table_dir"
# Extract backup
tar -xzf "$backup_file" -C "$table_dir"
# Fix ownership
chown -R scylla:scylla "$table_dir"
log "Restored table $table_name"
fi
done
Start ScyllaDB service
log "Starting ScyllaDB service"
sudo systemctl start scylla-server
Wait for service to be ready
log "Waiting for ScyllaDB to be ready"
while ! cqlsh -e "SELECT now() FROM system.local" &>/dev/null; do
sleep 5
done
Refresh the keyspace
log "Refreshing keyspace $TARGET_KEYSPACE"
nodetool refresh $TARGET_KEYSPACE
Run repair
log "Running repair on keyspace $TARGET_KEYSPACE"
nodetool repair $TARGET_KEYSPACE
Cleanup temporary files
log "Cleaning up temporary files"
rm -rf "$BACKUP_DIR/restore_$RESTORE_DATE"
log "Restore process completed successfully for keyspace $TARGET_KEYSPACE"
Verify data
log "Verifying restored data"
cqlsh -e "SELECT COUNT(*) FROM $TARGET_KEYSPACE.user_sessions;" || log "Warning: Could not verify restored data"
sudo chmod +x /opt/scylla/restore-automation.sh
Verify your setup
Test cluster connectivity and replication
Verify that cross-region replication is working correctly and data is being replicated across datacenters.
# Check cluster status
nodetool status
Verify cross-datacenter connectivity
nodetool ring
Test data replication
cqlsh -e "INSERT INTO production_data.user_sessions (session_id, user_id, created_at) VALUES (uuid(), uuid(), toTimestamp(now()));"
cqlsh -h 10.0.2.10 -e "SELECT COUNT(*) FROM production_data.user_sessions;"
Check replication lag
nodetool netstats
Test backup and restore procedures
Perform a complete backup and restore test to validate your disaster recovery procedures.
# Run manual backup
sudo /opt/scylla/backup-automation.sh
Check backup status
ls -la /opt/scylla/backups/
aws s3 ls s3://scylla-backups-production/
Test restore procedure (use actual backup date)
sudo /opt/scylla/restore-automation.sh 20240115_020000 production_data
Verify monitoring and alerting
Ensure monitoring dashboards are collecting metrics and alerting is configured properly.
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'
Verify ScyllaDB metrics
curl http://localhost:9180/metrics | grep scylla_reactor_utilization
Check failover monitoring
sudo systemctl status scylla-failover.timer
journalctl -u scylla-failover.service -f
Performance optimization
Optimize cross-region settings
Fine-tune ScyllaDB parameters for optimal cross-region performance and reduce replication lag.
# Cross-region optimization
stream_throughput_outbound_megabits_per_sec: 2000
compaction_throughput_mb_per_sec: 256
concurrent_compactors: 8
Memory settings
memtable_heap_space_in_mb: 8192
memtable_offheap_space_in_mb: 8192
Network optimization
rpc_keepalive: true
inter_dc_tcp_nodelay: true
commit_failure_policy: stop
Timeouts for cross-region
cross_node_timeout: true
request_timeout_in_ms: 20000
sudo systemctl restart scylla-server
Security hardening
Enable SSL/TLS encryption
Configure SSL encryption for client connections and inter-node communication across regions.
# Generate SSL certificates
sudo mkdir -p /etc/scylla/ssl
sudo openssl genrsa -out /etc/scylla/ssl/scylla.key 4096
sudo openssl req -new -x509 -key /etc/scylla/ssl/scylla.key -out /etc/scylla/ssl/scylla.crt -days 3650 -subj "/C=US/ST=State/L=City/O=Organization/CN=*.example.com"
sudo chown -R scylla:scylla /etc/scylla/ssl
sudo chmod 600 /etc/scylla/ssl/scylla.key
# SSL Configuration
client_encryption_options:
enabled: true
optional: false
keystore: /etc/scylla/ssl/scylla.key
certificate: /etc/scylla/ssl/scylla.crt
server_encryption_options:
internode_encryption: all
keystore: /etc/scylla/ssl/scylla.key
certificate: /etc/scylla/ssl/scylla.crt
require_client_auth: false
Common issues
| Symptom | Cause | Fix |
|---|---|---|
| Nodes can't join cluster | Firewall blocking ports | sudo ufw allow 7000:7001/tcp and check seeds configuration |
| High replication lag | Network bandwidth limitations | Increase stream_throughput_outbound_megabits_per_sec and optimize network |
| Backup script fails | S3 credentials or permissions | Check AWS credentials and S3 bucket permissions with aws s3 ls |
| Restore process hangs | Insufficient disk space | Check disk space with df -h and clear old snapshots |
| Failover not triggered | Health check timeouts | Adjust timeout values in failover script and check network connectivity |
| SSL connection errors | Certificate issues | Verify certificate validity with openssl x509 -in /etc/scylla/ssl/scylla.crt -text |
| Monitoring metrics missing | Prometheus scraping failures | Check ScyllaDB metrics endpoint with curl http://node:9180/metrics |
Next steps
- Configure ScyllaDB backup and restore with automation
- Configure advanced Grafana dashboards and alerting with Prometheus integration
- Implement ScyllaDB security hardening with RBAC and encryption
- Optimize ScyllaDB performance for high-throughput workloads
- Set up ScyllaDB multi-tenant architecture with isolation
Automated install script
Run this to automate the entire setup
#!/usr/bin/env bash
set -euo pipefail
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# Script variables
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
LOG_FILE="/var/log/scylladb-install.log"
CLUSTER_NAME="production-cluster"
SCYLLA_CONFIG_DIR="/etc/scylla"
# Usage function
usage() {
echo "Usage: $0 <node_ip> <datacenter> <rack> [seed_nodes]"
echo "Example: $0 10.0.1.10 us-east-1 rack1 10.0.1.10,10.0.1.11,10.0.2.10"
exit 1
}
# Logging function
log() {
echo -e "${GREEN}[$(date '+%Y-%m-%d %H:%M:%S')] $1${NC}" | tee -a "$LOG_FILE"
}
error() {
echo -e "${RED}[ERROR] $1${NC}" | tee -a "$LOG_FILE"
exit 1
}
warning() {
echo -e "${YELLOW}[WARNING] $1${NC}" | tee -a "$LOG_FILE"
}
# Cleanup function for rollback
cleanup() {
if [ $? -ne 0 ]; then
error "Installation failed. Check $LOG_FILE for details"
fi
}
trap cleanup ERR
# Check prerequisites
check_prerequisites() {
log "[1/12] Checking prerequisites..."
if [[ $EUID -ne 0 ]]; then
error "This script must be run as root or with sudo"
fi
if [[ $# -lt 3 ]]; then
usage
fi
# Validate IP address format
if ! [[ $1 =~ ^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$ ]]; then
error "Invalid IP address format: $1"
fi
# Check system resources
local mem_gb=$(free -g | awk '/^Mem:/{print $2}')
local cpu_cores=$(nproc)
if [[ $mem_gb -lt 16 ]]; then
warning "RAM is ${mem_gb}GB, recommended minimum is 32GB"
fi
if [[ $cpu_cores -lt 4 ]]; then
warning "CPU cores: ${cpu_cores}, recommended minimum is 8 cores"
fi
}
# Detect distribution
detect_distro() {
log "[2/12] Detecting distribution..."
if [ -f /etc/os-release ]; then
. /etc/os-release
case "$ID" in
ubuntu|debian)
PKG_MGR="apt"
PKG_UPDATE="apt update"
PKG_INSTALL="apt install -y"
FIREWALL_CMD="ufw"
;;
almalinux|rocky|centos|rhel|ol)
PKG_MGR="dnf"
PKG_UPDATE="dnf update -y"
PKG_INSTALL="dnf install -y"
FIREWALL_CMD="firewall-cmd"
;;
fedora)
PKG_MGR="dnf"
PKG_UPDATE="dnf update -y"
PKG_INSTALL="dnf install -y"
FIREWALL_CMD="firewall-cmd"
;;
amzn)
PKG_MGR="yum"
PKG_UPDATE="yum update -y"
PKG_INSTALL="yum install -y"
FIREWALL_CMD="firewall-cmd"
;;
*)
error "Unsupported distribution: $ID"
;;
esac
log "Detected: $PRETTY_NAME using $PKG_MGR"
else
error "Cannot detect distribution - /etc/os-release not found"
fi
}
# Configure firewall
configure_firewall() {
log "[3/12] Configuring firewall..."
case "$FIREWALL_CMD" in
ufw)
systemctl enable ufw || true
ufw --force enable
ufw allow 7000/tcp comment "ScyllaDB inter-node"
ufw allow 7001/tcp comment "ScyllaDB TLS inter-node"
ufw allow 9042/tcp comment "ScyllaDB CQL native"
ufw allow 9160/tcp comment "ScyllaDB Thrift RPC"
ufw allow 10000/tcp comment "ScyllaDB REST API"
ufw allow 9180/tcp comment "ScyllaDB Prometheus"
ufw allow 19042/tcp comment "ScyllaDB CQL shard-aware"
;;
firewall-cmd)
systemctl enable firewalld
systemctl start firewalld
firewall-cmd --permanent --add-port=7000/tcp
firewall-cmd --permanent --add-port=7001/tcp
firewall-cmd --permanent --add-port=9042/tcp
firewall-cmd --permanent --add-port=9160/tcp
firewall-cmd --permanent --add-port=10000/tcp
firewall-cmd --permanent --add-port=9180/tcp
firewall-cmd --permanent --add-port=19042/tcp
firewall-cmd --reload
;;
esac
}
# Install dependencies
install_dependencies() {
log "[4/12] Installing dependencies..."
$PKG_UPDATE
case "$PKG_MGR" in
apt)
$PKG_INSTALL curl gnupg2 software-properties-common apt-transport-https ca-certificates
;;
dnf|yum)
$PKG_INSTALL curl gnupg2 python3-PyYAML
;;
esac
}
# Add ScyllaDB repository
add_scylla_repo() {
log "[5/12] Adding ScyllaDB repository..."
case "$PKG_MGR" in
apt)
curl -fsSL https://downloads.scylladb.com/downloads/scylla/rpm/unstable/centos/scylladb-2023.1/scylladb-2023.1.repo
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 5e08fbd8b5d6ec9c
curl -L -o /etc/apt/sources.list.d/scylla.list http://downloads.scylladb.com/deb/ubuntu/scylla-5.4.list
$PKG_UPDATE
;;
dnf|yum)
curl -L -o /etc/yum.repos.d/scylla.repo http://downloads.scylladb.com/rpm/centos/scylla-5.4.repo
;;
esac
}
# Install ScyllaDB
install_scylladb() {
log "[6/12] Installing ScyllaDB..."
$PKG_INSTALL scylla
# Install additional tools
case "$PKG_MGR" in
apt)
$PKG_INSTALL scylla-tools scylla-python3
;;
dnf|yum)
$PKG_INSTALL scylla-tools
;;
esac
}
# System optimization
optimize_system() {
log "[7/12] Running system optimization..."
# Run ScyllaDB setup scripts
echo -e "yes\nyes\nyes\nyes\nyes\nyes\n" | scylla_setup --no-raid-setup || warning "Setup script had warnings"
scylla_io_setup || warning "IO setup had warnings"
# Additional optimizations
echo 'vm.swappiness = 1' >> /etc/sysctl.conf
echo 'vm.max_map_count = 1048575' >> /etc/sysctl.conf
sysctl -p
}
# Configure ScyllaDB
configure_scylladb() {
log "[8/12] Configuring ScyllaDB..."
local node_ip=$1
local datacenter=$2
local rack=$3
local seeds=${4:-$node_ip}
# Backup original config
cp "${SCYLLA_CONFIG_DIR}/scylla.yaml" "${SCYLLA_CONFIG_DIR}/scylla.yaml.backup"
# Create new configuration
cat > "${SCYLLA_CONFIG_DIR}/scylla.yaml" << EOF
cluster_name: '${CLUSTER_NAME}'
num_tokens: 256
authenticator: PasswordAuthenticator
authorizer: CassandraAuthorizer
listen_address: ${node_ip}
broadcast_address: ${node_ip}
rpc_address: 0.0.0.0
broadcast_rpc_address: ${node_ip}
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "${seeds}"
endpoint_snitch: GossipingPropertyFileSnitch
data_file_directories:
- /var/lib/scylla/data
commitlog_directory: /var/lib/scylla/commitlog
hints_directory: /var/lib/scylla/hints
view_hints_directory: /var/lib/scylla/view_hints
inter_dc_stream_throughput_outbound_megabits_per_sec: 1000
inter_dc_tcp_nodelay: true
read_request_timeout_in_ms: 10000
write_request_timeout_in_ms: 10000
range_request_timeout_in_ms: 20000
experimental_features:
- udf
- alternator-streams
snapshot_before_compaction: false
auto_snapshot: true
incremental_backups: true
enable_user_defined_functions: true
enable_scripted_user_defined_functions: false
compaction_throughput_mb_per_sec: 256
stream_throughput_outbound_megabits_per_sec: 2000
EOF
# Set proper permissions
chown scylla:scylla "${SCYLLA_CONFIG_DIR}/scylla.yaml"
chmod 644 "${SCYLLA_CONFIG_DIR}/scylla.yaml"
# Configure datacenter properties
cat > "${SCYLLA_CONFIG_DIR}/cassandra-rackdc.properties" << EOF
dc=${datacenter}
rack=${rack}
prefer_local=true
EOF
chown scylla:scylla "${SCYLLA_CONFIG_DIR}/cassandra-rackdc.properties"
chmod 644 "${SCYLLA_CONFIG_DIR}/cassandra-rackdc.properties"
}
# Configure data directories
setup_data_directories() {
log "[9/12] Setting up data directories..."
local dirs=("/var/lib/scylla/data" "/var/lib/scylla/commitlog" "/var/lib/scylla/hints" "/var/lib/scylla/view_hints")
for dir in "${dirs[@]}"; do
mkdir -p "$dir"
chown -R scylla:scylla "$dir"
chmod 755 "$dir"
done
}
# Enable and start services
start_services() {
log "[10/12] Starting ScyllaDB services..."
systemctl daemon-reload
systemctl enable scylla-server
systemctl start scylla-server
# Wait for service to be ready
log "Waiting for ScyllaDB to start..."
sleep 30
local attempts=0
while ! systemctl is-active --quiet scylla-server && [ $attempts -lt 12 ]; do
sleep 10
attempts=$((attempts + 1))
log "Waiting for ScyllaDB... (attempt $attempts/12)"
done
if ! systemctl is-active --quiet scylla-server; then
error "ScyllaDB failed to start properly"
fi
}
# Configure monitoring
setup_monitoring() {
log "[11/12] Setting up monitoring..."
# Enable Prometheus metrics
systemctl enable scylla-jmx || warning "Could not enable scylla-jmx"
systemctl start scylla-jmx || warning "Could not start scylla-jmx"
}
# Verification
verify_installation() {
log "[12/12] Verifying installation..."
# Check service status
if ! systemctl is-active --quiet scylla-server; then
error "ScyllaDB service is not running"
fi
# Check if ScyllaDB is responding
sleep 10
if ! timeout 30 cqlsh $(hostname -I | awk '{print $1}') -e "SELECT release_version FROM system.local;" 2>/dev/null; then
warning "CQL connection test failed - this is normal for new clusters"
fi
# Check logs for errors
if journalctl -u scylla-server --since "5 minutes ago" | grep -i "error" | grep -v "Connection refused" >/dev/null; then
warning "Found errors in ScyllaDB logs - check with: journalctl -u scylla-server"
fi
log "ScyllaDB installation completed successfully!"
log "Configuration file: ${SCYLLA_CONFIG_DIR}/scylla.yaml"
log "Data directory: /var/lib/scylla/data"
log "Check status: systemctl status scylla-server"
log "View logs: journalctl -u scylla-server -f"
}
# Main execution
main() {
local node_ip=$1
local datacenter=$2
local rack=$3
local seeds=${4:-$node_ip}
check_prerequisites "$@"
detect_distro
configure_firewall
install_dependencies
add_scylla_repo
install_scylladb
optimize_system
configure_scylladb "$node_ip" "$datacenter" "$rack" "$seeds"
setup_data_directories
start_services
setup_monitoring
verify_installation
}
# Run main function with all arguments
main "$@"
Review the script before running. Execute with: bash install.sh