Monitor ScyllaDB with Prometheus and Grafana

Set up complete ScyllaDB cluster monitoring using Prometheus for metrics collection and Grafana for visualization. Configure alerting rules for proactive performance monitoring and issue detection.

Prerequisites

Root or sudo access
Three servers for ScyllaDB cluster
One server for monitoring stack
8GB RAM per ScyllaDB node
Basic understanding of NoSQL databases

What this solves

ScyllaDB clusters require continuous monitoring to track performance, resource usage, and cluster health. This tutorial sets up comprehensive monitoring using Prometheus to collect ScyllaDB metrics and Grafana for visualization, helping you detect issues before they impact your applications.

Step-by-step installation

Install ScyllaDB cluster nodes

Set up a three-node ScyllaDB cluster for high availability and performance monitoring.

sudo apt update && sudo apt upgrade -y
wget -qO - https://downloads.scylladb.com/deb/ubuntu/scylla-5.4-$(lsb_release -s -c).list | sudo tee /etc/apt/sources.list.d/scylla.list
wget -qO - https://downloads.scylladb.com/downloads/scylla-drivers-repo/scylla.key | sudo apt-key add -
sudo apt update
sudo apt install -y scylla

sudo dnf update -y
sudo curl -L --output /etc/yum.repos.d/scylla.repo http://downloads.scylladb.com/rpm/centos/scylla-5.4.repo
sudo dnf install -y scylla

Configure ScyllaDB for monitoring

Enable Prometheus metrics endpoints on each ScyllaDB node by configuring the monitoring settings.

cluster_name: 'ScyllaCluster'
seeds: "203.0.113.10,203.0.113.11,203.0.113.12"
listen_address: 203.0.113.10
rpc_address: 0.0.0.0
broadcast_rpc_address: 203.0.113.10
endpoint_snitch: GossipingPropertyFileSnitch
prometheus_port: 9180
prometheus_address: 0.0.0.0

Note: Replace the IP addresses with your actual node IPs. Configure each node with its respective IP address.

Start ScyllaDB cluster

Enable and start ScyllaDB on all cluster nodes, then verify cluster formation.

sudo scylla_setup --no-raid-setup --no-fstrim-setup --no-coredump-setup --no-sysconfig-setup --no-bootparam-setup --no-ec2-check
sudo systemctl enable scylla-server
sudo systemctl start scylla-server
sudo systemctl status scylla-server

Verify cluster status

Check that all nodes have joined the cluster successfully and are in UN (Up Normal) status.

nodetool status
nodetool describecluster

Install Prometheus server

Install Prometheus on a dedicated monitoring server to collect metrics from ScyllaDB nodes.

sudo apt update
sudo apt install -y prometheus prometheus-node-exporter

sudo dnf install -y epel-release
sudo dnf install -y prometheus2 node_exporter

Configure Prometheus for ScyllaDB

Set up Prometheus configuration to scrape metrics from ScyllaDB cluster nodes and node exporters.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "/etc/prometheus/rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'scylla'
    static_configs:
      - targets:
        - '203.0.113.10:9180'
        - '203.0.113.11:9180'
        - '203.0.113.12:9180'
    scrape_interval: 10s
    metrics_path: /metrics

  - job_name: 'node-exporter'
    static_configs:
      - targets:
        - '203.0.113.10:9100'
        - '203.0.113.11:9100'
        - '203.0.113.12:9100'
    scrape_interval: 15s

Create ScyllaDB alerting rules

Define Prometheus alerting rules for common ScyllaDB issues and performance thresholds.

sudo mkdir -p /etc/prometheus/rules

groups:
  - name: scylla.rules
    rules:
    - alert: ScyllaDBNodeDown
      expr: up{job="scylla"} == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "ScyllaDB node {{ $labels.instance }} is down"
        description: "ScyllaDB node {{ $labels.instance }} has been down for more than 1 minute."

    - alert: ScyllaDBHighLatency
      expr: scylla_storage_proxy_coordinator_read_latency{quantile="0.99"} > 100
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High read latency on {{ $labels.instance }}"
        description: "99th percentile read latency is {{ $value }}ms on {{ $labels.instance }}"

    - alert: ScyllaDBHighCPUUsage
      expr: scylla_reactor_utilization > 0.8
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High CPU usage on {{ $labels.instance }}"
        description: "CPU utilization is {{ $value | humanizePercentage }} on {{ $labels.instance }}"

    - alert: ScyllaDBLowDiskSpace
      expr: (scylla_database_total_disk_space_bytes - scylla_database_used_disk_space_bytes) / scylla_database_total_disk_space_bytes < 0.1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Low disk space on {{ $labels.instance }}"
        description: "Available disk space is less than 10% on {{ $labels.instance }}"

    - alert: ScyllaDBHighMemoryUsage
      expr: scylla_memory_allocated_memory / scylla_memory_total_memory > 0.9
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High memory usage on {{ $labels.instance }}"
        description: "Memory usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"

    - alert: ScyllaDBCompactionBacklog
      expr: scylla_compaction_manager_pending_tasks > 100
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "High compaction backlog on {{ $labels.instance }}"
        description: "{{ $value }} pending compaction tasks on {{ $labels.instance }}"

    - alert: ScyllaDBTimeoutOperations
      expr: increase(scylla_storage_proxy_coordinator_read_timeouts_total[5m]) > 10
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "High number of read timeouts on {{ $labels.instance }}"
        description: "{{ $value }} read timeouts in the last 5 minutes on {{ $labels.instance }}"

    - alert: ScyllaDBClusterNotHealthy
      expr: count(up{job="scylla"} == 1) < 2
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "ScyllaDB cluster unhealthy"
        description: "Only {{ $value }} ScyllaDB nodes are available out of expected 3 nodes"

Install and configure Alertmanager

Set up Alertmanager to handle alerts from Prometheus and send notifications.

sudo apt install -y prometheus-alertmanager

sudo dnf install -y alertmanager

global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alerts@example.com'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
  - name: 'web.hook'
    email_configs:
      - to: 'admin@example.com'
        subject: 'ScyllaDB Alert: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Instance: {{ .Labels.instance }}
          Severity: {{ .Labels.severity }}
          {{ end }}

Start monitoring services

Enable and start Prometheus and Alertmanager services.

sudo systemctl enable prometheus alertmanager
sudo systemctl start prometheus alertmanager
sudo systemctl status prometheus alertmanager

Install Grafana

Install Grafana for creating dashboards and visualizing ScyllaDB metrics.

sudo apt install -y software-properties-common
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install -y grafana

sudo dnf install -y grafana

Configure Grafana

Configure Grafana with security settings and enable anonymous access for monitoring dashboards.

[server]
http_port = 3000
domain = example.com
root_url = http://example.com:3000/

[security]
admin_user = admin
admin_password = your_secure_password
secret_key = your_secret_key

[auth.anonymous]
enabled = false

[alerting]
execute_alerts = true

Start Grafana

Enable and start Grafana service, then access the web interface.

sudo systemctl enable grafana-server
sudo systemctl start grafana-server
sudo systemctl status grafana-server

Configure Prometheus data source in Grafana

Add Prometheus as a data source in Grafana to access ScyllaDB metrics.

curl -X POST http://admin:your_secure_password@localhost:3000/api/datasources \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Prometheus",
    "type": "prometheus",
    "url": "http://localhost:9090",
    "access": "proxy",
    "isDefault": true
  }'

Import ScyllaDB dashboard

Create a comprehensive ScyllaDB monitoring dashboard with key performance metrics.

{
  "dashboard": {
    "title": "ScyllaDB Cluster Monitoring",
    "tags": ["scylladb", "performance"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Cluster Status",
        "type": "stat",
        "targets": [
          {
            "expr": "count(up{job=\"scylla\"} == 1)",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "displayName": "Nodes Up",
            "min": 0,
            "max": 3
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 0, "y": 0}
      },
      {
        "title": "Read Latency (99th percentile)",
        "type": "timeseries",
        "targets": [
          {
            "expr": "scylla_storage_proxy_coordinator_read_latency{quantile=\"0.99\"}",
            "refId": "A",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "ms"
          }
        },
        "gridPos": {"h": 8, "w": 18, "x": 6, "y": 0}
      },
      {
        "title": "Write Latency (99th percentile)",
        "type": "timeseries",
        "targets": [
          {
            "expr": "scylla_storage_proxy_coordinator_write_latency{quantile=\"0.99\"}",
            "refId": "A",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "ms"
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8}
      },
      {
        "title": "CPU Utilization",
        "type": "timeseries",
        "targets": [
          {
            "expr": "scylla_reactor_utilization * 100",
            "refId": "A",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8}
      },
      {
        "title": "Memory Usage",
        "type": "timeseries",
        "targets": [
          {
            "expr": "scylla_memory_allocated_memory / scylla_memory_total_memory * 100",
            "refId": "A",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 16}
      },
      {
        "title": "Disk Usage",
        "type": "timeseries",
        "targets": [
          {
            "expr": "scylla_database_used_disk_space_bytes / scylla_database_total_disk_space_bytes * 100",
            "refId": "A",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 16}
      },
      {
        "title": "Operations per Second",
        "type": "timeseries",
        "targets": [
          {
            "expr": "rate(scylla_storage_proxy_coordinator_reads_total[5m])",
            "refId": "A",
            "legendFormat": "Reads - {{instance}}"
          },
          {
            "expr": "rate(scylla_storage_proxy_coordinator_writes_total[5m])",
            "refId": "B",
            "legendFormat": "Writes - {{instance}}"
          }
        ],
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 24}
      },
      {
        "title": "Compaction Tasks",
        "type": "timeseries",
        "targets": [
          {
            "expr": "scylla_compaction_manager_pending_tasks",
            "refId": "A",
            "legendFormat": "Pending - {{instance}}"
          },
          {
            "expr": "scylla_compaction_manager_active_tasks",
            "refId": "B",
            "legendFormat": "Active - {{instance}}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 32}
      },
      {
        "title": "Error Rates",
        "type": "timeseries",
        "targets": [
          {
            "expr": "rate(scylla_storage_proxy_coordinator_read_timeouts_total[5m])",
            "refId": "A",
            "legendFormat": "Read Timeouts - {{instance}}"
          },
          {
            "expr": "rate(scylla_storage_proxy_coordinator_write_timeouts_total[5m])",
            "refId": "B",
            "legendFormat": "Write Timeouts - {{instance}}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 32}
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "10s"
  }
}

curl -X POST http://admin:your_secure_password@localhost:3000/api/dashboards/db \
  -H "Content-Type: application/json" \
  -d @/tmp/scylla-dashboard.json

Configure firewall access

Open necessary ports for monitoring services while maintaining security.

sudo ufw allow 9090/tcp comment 'Prometheus'
sudo ufw allow 9093/tcp comment 'Alertmanager'
sudo ufw allow 3000/tcp comment 'Grafana'
sudo ufw allow from 203.0.113.0/24 to any port 9180 comment 'ScyllaDB metrics'
sudo ufw reload

sudo firewall-cmd --permanent --add-port=9090/tcp --add-port=9093/tcp --add-port=3000/tcp
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="203.0.113.0/24" port protocol="tcp" port="9180" accept'
sudo firewall-cmd --reload

Configure performance optimization

Tune Prometheus retention

Configure Prometheus retention and storage settings for long-term monitoring data.

ARGS="--config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/var/lib/prometheus --storage.tsdb.retention.time=90d --storage.tsdb.retention.size=50GB --web.console.libraries=/etc/prometheus/console_libraries --web.console.templates=/etc/prometheus/consoles --web.enable-lifecycle"

Configure ScyllaDB monitoring user

Create a monitoring-specific user in ScyllaDB with limited privileges for security.

cqlsh -e "CREATE USER monitoring WITH PASSWORD 'monitoring_password' NOSUPERUSER;"
cqlsh -e "GRANT SELECT ON ALL KEYSPACES TO monitoring;"

Set up log monitoring

Configure log monitoring for ScyllaDB error detection and troubleshooting.

# Add this job to the existing scrape_configs section
  - job_name: 'scylla-logs'
    static_configs:
      - targets:
        - '203.0.113.10:9080'
        - '203.0.113.11:9080'
        - '203.0.113.12:9080'
    scrape_interval: 30s
    metrics_path: /metrics

Verify your setup

Check that all monitoring components are working and collecting data properly.

# Check ScyllaDB metrics endpoint
curl http://203.0.113.10:9180/metrics | grep scylla_storage_proxy

Verify Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job, instance, health}'

Check Grafana datasource
curl -u admin:your_secure_password http://localhost:3000/api/datasources

Test alerting rules
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[].name'

Verify cluster status
nodetool status
nodetool info

Common issues

Symptom	Cause	Fix
Prometheus can't reach ScyllaDB metrics	Firewall blocking port 9180	Configure firewall rules or disable for testing
Grafana shows "No data"	Prometheus data source not configured	Check datasource URL and connectivity
High memory usage alerts	Normal ScyllaDB behavior	Adjust thresholds in alerting rules
Missing ScyllaDB metrics	prometheus_port not configured	Add prometheus_port to scylla.yaml and restart
Alertmanager not sending emails	SMTP configuration issues	Check SMTP settings and test with `amtool`
Dashboard shows connection refused	ScyllaDB node down	Check ScyllaDB service status with `systemctl status scylla-server`

Next steps

You now have comprehensive ScyllaDB monitoring with Prometheus and Grafana. Consider these additional improvements:

Monitor Apache Cassandra cluster with Prometheus and Grafana dashboards for comparison with other NoSQL systems
Configure ScyllaDB backup and restore with automation for data protection strategies
Set up ScyllaDB multi-datacenter replication for disaster recovery
Optimize ScyllaDB performance tuning for production workloads

Running this in production?

Ready for 24/7 operations? Setting up ScyllaDB monitoring once is straightforward. Keeping it patched, monitored, backed up and tuned across environments is the harder part. Our managed platform covers monitoring, backups and 24/7 response by default.

Automated install script

Run this to automate the entire setup

install.sh

#!/usr/bin/env bash
set -euo pipefail

# ScyllaDB Cluster Monitoring Setup with Prometheus and Grafana
# Production-ready installation script

# Colors
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'

# Configuration
CLUSTER_NAME="ScyllaCluster"
PROMETHEUS_PORT="9180"
NODE_EXPORTER_PORT="9100"
PROMETHEUS_WEB_PORT="9090"

usage() {
    echo "Usage: $0 [OPTIONS]"
    echo "Options:"
    echo "  --cluster-ips IPs    Comma-separated list of cluster IPs (required)"
    echo "  --local-ip IP        Local IP address for this node (required)"
    echo "  --monitoring-only    Install only monitoring components (Prometheus/Grafana)"
    echo "  --scylla-only        Install only ScyllaDB"
    echo "  --help               Show this help message"
    echo ""
    echo "Example:"
    echo "  $0 --cluster-ips 10.0.1.10,10.0.1.11,10.0.1.12 --local-ip 10.0.1.10"
    exit 1
}

error() {
    echo -e "${RED}ERROR: $1${NC}" >&2
    exit 1
}

success() {
    echo -e "${GREEN}✓ $1${NC}"
}

warning() {
    echo -e "${YELLOW}⚠ $1${NC}"
}

info() {
    echo -e "[$(date '+%H:%M:%S')] $1"
}

cleanup() {
    if [ $? -ne 0 ]; then
        error "Installation failed. Check logs above for details."
    fi
}

trap cleanup ERR

check_prerequisites() {
    info "[1/12] Checking prerequisites..."
    
    if [[ $EUID -ne 0 ]]; then
        error "This script must be run as root or with sudo"
    fi
    
    if ! command -v wget &> /dev/null; then
        error "wget is required but not installed"
    fi
    
    success "Prerequisites check passed"
}

detect_distro() {
    info "[2/12] Detecting distribution..."
    
    if [ ! -f /etc/os-release ]; then
        error "Cannot detect distribution - /etc/os-release not found"
    fi
    
    . /etc/os-release
    
    case "$ID" in
        ubuntu|debian)
            PKG_MGR="apt"
            PKG_UPDATE="apt update"
            PKG_INSTALL="apt install -y"
            PKG_UPGRADE="apt upgrade -y"
            PROMETHEUS_CONFIG="/etc/prometheus/prometheus.yml"
            PROMETHEUS_RULES="/etc/prometheus/rules"
            PROMETHEUS_SERVICE="prometheus"
            FIREWALL_CMD="ufw"
            ;;
        almalinux|rocky|centos|rhel|ol)
            PKG_MGR="dnf"
            PKG_UPDATE="dnf update -y"
            PKG_INSTALL="dnf install -y"
            PKG_UPGRADE="dnf upgrade -y"
            PROMETHEUS_CONFIG="/etc/prometheus/prometheus.yml"
            PROMETHEUS_RULES="/etc/prometheus/rules"
            PROMETHEUS_SERVICE="prometheus"
            FIREWALL_CMD="firewall-cmd"
            ;;
        amzn)
            PKG_MGR="yum"
            PKG_UPDATE="yum update -y"
            PKG_INSTALL="yum install -y"
            PKG_UPGRADE="yum upgrade -y"
            PROMETHEUS_CONFIG="/etc/prometheus/prometheus.yml"
            PROMETHEUS_RULES="/etc/prometheus/rules"
            PROMETHEUS_SERVICE="prometheus"
            FIREWALL_CMD="firewall-cmd"
            ;;
        fedora)
            PKG_MGR="dnf"
            PKG_UPDATE="dnf update -y"
            PKG_INSTALL="dnf install -y"
            PKG_UPGRADE="dnf upgrade -y"
            PROMETHEUS_CONFIG="/etc/prometheus/prometheus.yml"
            PROMETHEUS_RULES="/etc/prometheus/rules"
            PROMETHEUS_SERVICE="prometheus"
            FIREWALL_CMD="firewall-cmd"
            ;;
        *)
            error "Unsupported distribution: $ID"
            ;;
    esac
    
    success "Detected $PRETTY_NAME"
}

update_system() {
    info "[3/12] Updating system packages..."
    $PKG_UPDATE
    success "System updated"
}

install_scylladb() {
    info "[4/12] Installing ScyllaDB..."
    
    case "$PKG_MGR" in
        apt)
            wget -qO - https://downloads.scylladb.com/deb/ubuntu/scylla-5.4-$(lsb_release -s -c).list | tee /etc/apt/sources.list.d/scylla.list
            wget -qO - https://downloads.scylladb.com/downloads/scylla-drivers-repo/scylla.key | apt-key add -
            $PKG_UPDATE
            $PKG_INSTALL scylla
            ;;
        dnf|yum)
            curl -L --output /etc/yum.repos.d/scylla.repo http://downloads.scylladb.com/rpm/centos/scylla-5.4.repo
            $PKG_INSTALL scylla
            ;;
    esac
    
    success "ScyllaDB installed"
}

configure_scylladb() {
    info "[5/12] Configuring ScyllaDB..."
    
    # Backup original config
    cp /etc/scylla/scylla.yaml /etc/scylla/scylla.yaml.backup
    
    # Create new configuration
    cat > /etc/scylla/scylla.yaml << EOF
cluster_name: '$CLUSTER_NAME'
seeds: "$CLUSTER_IPS"
listen_address: $LOCAL_IP
rpc_address: 0.0.0.0
broadcast_rpc_address: $LOCAL_IP
endpoint_snitch: GossipingPropertyFileSnitch
prometheus_port: $PROMETHEUS_PORT
prometheus_address: 0.0.0.0
data_file_directories:
    - /var/lib/scylla/data
commitlog_directory: /var/lib/scylla/commitlog
hints_directory: /var/lib/scylla/hints
view_hints_directory: /var/lib/scylla/view_hints
EOF
    
    chown scylla:scylla /etc/scylla/scylla.yaml
    chmod 644 /etc/scylla/scylla.yaml
    
    success "ScyllaDB configured"
}

setup_scylladb() {
    info "[6/12] Setting up ScyllaDB..."
    
    scylla_setup --no-raid-setup --no-fstrim-setup --no-coredump-setup --no-sysconfig-setup --no-bootparam-setup --no-ec2-check
    
    systemctl enable scylla-server
    systemctl start scylla-server
    
    # Wait for ScyllaDB to start
    sleep 10
    
    success "ScyllaDB setup completed"
}

install_monitoring() {
    info "[7/12] Installing monitoring components..."
    
    case "$PKG_MGR" in
        apt)
            $PKG_INSTALL prometheus prometheus-node-exporter grafana
            ;;
        dnf)
            $PKG_INSTALL epel-release
            $PKG_INSTALL prometheus2 node_exporter grafana
            # Fix service name for RHEL-based
            PROMETHEUS_SERVICE="prometheus"
            ;;
        yum)
            $PKG_INSTALL epel-release
            $PKG_INSTALL prometheus2 node_exporter grafana
            PROMETHEUS_SERVICE="prometheus"
            ;;
    esac
    
    success "Monitoring components installed"
}

configure_prometheus() {
    info "[8/12] Configuring Prometheus..."
    
    # Create rules directory
    mkdir -p $PROMETHEUS_RULES
    chmod 755 $PROMETHEUS_RULES
    
    # Configure Prometheus
    cat > $PROMETHEUS_CONFIG << EOF
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "$PROMETHEUS_RULES/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:$PROMETHEUS_WEB_PORT']

  - job_name: 'scylla'
    static_configs:
      - targets: [$(echo "$CLUSTER_IPS" | sed "s/,/:$PROMETHEUS_PORT','/g"):$PROMETHEUS_PORT']
    scrape_interval: 10s
    metrics_path: /metrics

  - job_name: 'node-exporter'
    static_configs:
      - targets: [$(echo "$CLUSTER_IPS" | sed "s/,/:$NODE_EXPORTER_PORT','/g"):$NODE_EXPORTER_PORT']
    scrape_interval: 15s
EOF
    
    chmod 644 $PROMETHEUS_CONFIG
    
    # Create alerting rules
    cat > $PROMETHEUS_RULES/scylla.yml << EOF
groups:
  - name: scylla.rules
    rules:
    - alert: ScyllaDBNodeDown
      expr: up{job="scylla"} == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "ScyllaDB node {{ \$labels.instance }} is down"
        description: "ScyllaDB node {{ \$labels.instance }} has been down for more than 1 minute."

    - alert: ScyllaDBHighLatency
      expr: scylla_storage_proxy_coordinator_read_latency{quantile="0.99"} > 100
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High read latency on {{ \$labels.instance }}"
        description: "99th percentile read latency is {{ \$value }}ms on {{ \$labels.instance }}"
EOF
    
    chmod 644 $PROMETHEUS_RULES/scylla.yml
    
    success "Prometheus configured"
}

configure_firewall() {
    info "[9/12] Configuring firewall..."
    
    case "$FIREWALL_CMD" in
        ufw)
            ufw --force enable
            ufw allow $PROMETHEUS_PORT/tcp
            ufw allow $NODE_EXPORTER_PORT/tcp
            ufw allow $PROMETHEUS_WEB_PORT/tcp
            ufw allow 3000/tcp  # Grafana
            ufw allow 7000/tcp  # ScyllaDB inter-node
            ufw allow 9042/tcp  # ScyllaDB CQL
            ;;
        firewall-cmd)
            systemctl enable firewalld
            systemctl start firewalld
            firewall-cmd --permanent --add-port=$PROMETHEUS_PORT/tcp
            firewall-cmd --permanent --add-port=$NODE_EXPORTER_PORT/tcp
            firewall-cmd --permanent --add-port=$PROMETHEUS_WEB_PORT/tcp
            firewall-cmd --permanent --add-port=3000/tcp
            firewall-cmd --permanent --add-port=7000/tcp
            firewall-cmd --permanent --add-port=9042/tcp
            firewall-cmd --reload
            ;;
    esac
    
    success "Firewall configured"
}

start_services() {
    info "[10/12] Starting services..."
    
    # Start node exporter
    systemctl enable node_exporter
    systemctl start node_exporter
    
    # Start Prometheus
    systemctl enable $PROMETHEUS_SERVICE
    systemctl start $PROMETHEUS_SERVICE
    
    # Start Grafana
    systemctl enable grafana-server
    systemctl start grafana-server
    
    success "Services started"
}

verify_installation() {
    info "[11/12] Verifying installation..."
    
    # Check ScyllaDB if installed
    if [ "$MONITORING_ONLY" != "true" ]; then
        if ! systemctl is-active --quiet scylla-server; then
            error "ScyllaDB service is not running"
        fi
    fi
    
    # Check monitoring services if installed
    if [ "$SCYLLA_ONLY" != "true" ]; then
        if ! systemctl is-active --quiet node_exporter; then
            error "Node exporter service is not running"
        fi
        
        if ! systemctl is-active --quiet $PROMETHEUS_SERVICE; then
            error "Prometheus service is not running"
        fi
        
        if ! systemctl is-active --quiet grafana-server; then
            error "Grafana service is not running"
        fi
    fi
    
    success "All services are running"
}

show_summary() {
    info "[12/12] Installation completed!"
    
    echo ""
    success "ScyllaDB Monitoring Stack installed successfully!"
    echo ""
    echo "Access URLs:"
    echo "  Prometheus: http://$LOCAL_IP:$PROMETHEUS_WEB_PORT"
    echo "  Grafana:    http://$LOCAL_IP:3000 (admin/admin)"
    echo ""
    echo "ScyllaDB Endpoints:"
    echo "  CQL:     $LOCAL_IP:9042"
    echo "  Metrics: $LOCAL_IP:$PROMETHEUS_PORT/metrics"
    echo ""
    warning "Remember to:"
    warning "1. Change Grafana admin password"
    warning "2. Import ScyllaDB dashboards in Grafana"
    warning "3. Configure alerting in Prometheus"
}

# Parse command line arguments
CLUSTER_IPS=""
LOCAL_IP=""
MONITORING_ONLY="false"
SCYLLA_ONLY="false"

while [[ $# -gt 0 ]]; do
    case $1 in
        --cluster-ips)
            CLUSTER_IPS="$2"
            shift 2
            ;;
        --local-ip)
            LOCAL_IP="$2"
            shift 2
            ;;
        --monitoring-only)
            MONITORING_ONLY="true"
            shift
            ;;
        --scylla-only)
            SCYLLA_ONLY="true"
            shift
            ;;
        --help)
            usage
            ;;
        *)
            error "Unknown option: $1"
            ;;
    esac
done

if [ -z "$CLUSTER_IPS" ] || [ -z "$LOCAL_IP" ]; then
    usage
fi

# Main execution
check_prerequisites
detect_distro
update_system

if [ "$MONITORING_ONLY" != "true" ]; then
    install_scylladb
    configure_scylladb
    setup_scylladb
fi

if [ "$SCYLLA_ONLY" != "true" ]; then
    install_monitoring
    configure_prometheus
fi

configure_firewall
start_services
verify_installation
show_summary

Review the script before running. Execute with: bash install.sh

#scylladb #prometheus #grafana #monitoring #performance

Monitor ScyllaDB cluster with Prometheus and Grafana for comprehensive performance tracking

Prerequisites

What this solves

Step-by-step installation

Install ScyllaDB cluster nodes

Configure ScyllaDB for monitoring

Start ScyllaDB cluster

Verify cluster status

Install Prometheus server

Configure Prometheus for ScyllaDB

Create ScyllaDB alerting rules

Install and configure Alertmanager

Start monitoring services

Install Grafana

Configure Grafana

Start Grafana

Configure Prometheus data source in Grafana

Import ScyllaDB dashboard

Configure firewall access

Configure performance optimization

Tune Prometheus retention

Configure ScyllaDB monitoring user

Set up log monitoring

Verify your setup

Verify Prometheus targets

Check Grafana datasource

Test alerting rules

Verify cluster status

Common issues

Next steps

Running this in production?

Related tutorials

Install and configure Filebeat 8.15 for efficient log shipping to ELK stack

Set up Alertmanager with email and Slack notifications for monitoring alerts

Configure advanced network monitoring with SmokePing for detailed latency analysis

Don't want to manage this yourself?