Monitor ScyllaDB cluster with Prometheus and Grafana for comprehensive performance tracking

Intermediate 45 min May 02, 2026 99 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Set up complete ScyllaDB cluster monitoring using Prometheus for metrics collection and Grafana for visualization. Configure alerting rules for proactive performance monitoring and issue detection.

Prerequisites

  • Root or sudo access
  • Three servers for ScyllaDB cluster
  • One server for monitoring stack
  • 8GB RAM per ScyllaDB node
  • Basic understanding of NoSQL databases

What this solves

ScyllaDB clusters require continuous monitoring to track performance, resource usage, and cluster health. This tutorial sets up comprehensive monitoring using Prometheus to collect ScyllaDB metrics and Grafana for visualization, helping you detect issues before they impact your applications.

Step-by-step installation

Install ScyllaDB cluster nodes

Set up a three-node ScyllaDB cluster for high availability and performance monitoring.

sudo apt update && sudo apt upgrade -y
wget -qO - https://downloads.scylladb.com/deb/ubuntu/scylla-5.4-$(lsb_release -s -c).list | sudo tee /etc/apt/sources.list.d/scylla.list
wget -qO - https://downloads.scylladb.com/downloads/scylla-drivers-repo/scylla.key | sudo apt-key add -
sudo apt update
sudo apt install -y scylla
sudo dnf update -y
sudo curl -L --output /etc/yum.repos.d/scylla.repo http://downloads.scylladb.com/rpm/centos/scylla-5.4.repo
sudo dnf install -y scylla

Configure ScyllaDB for monitoring

Enable Prometheus metrics endpoints on each ScyllaDB node by configuring the monitoring settings.

cluster_name: 'ScyllaCluster'
seeds: "203.0.113.10,203.0.113.11,203.0.113.12"
listen_address: 203.0.113.10
rpc_address: 0.0.0.0
broadcast_rpc_address: 203.0.113.10
endpoint_snitch: GossipingPropertyFileSnitch
prometheus_port: 9180
prometheus_address: 0.0.0.0
Note: Replace the IP addresses with your actual node IPs. Configure each node with its respective IP address.

Start ScyllaDB cluster

Enable and start ScyllaDB on all cluster nodes, then verify cluster formation.

sudo scylla_setup --no-raid-setup --no-fstrim-setup --no-coredump-setup --no-sysconfig-setup --no-bootparam-setup --no-ec2-check
sudo systemctl enable scylla-server
sudo systemctl start scylla-server
sudo systemctl status scylla-server

Verify cluster status

Check that all nodes have joined the cluster successfully and are in UN (Up Normal) status.

nodetool status
nodetool describecluster

Install Prometheus server

Install Prometheus on a dedicated monitoring server to collect metrics from ScyllaDB nodes.

sudo apt update
sudo apt install -y prometheus prometheus-node-exporter
sudo dnf install -y epel-release
sudo dnf install -y prometheus2 node_exporter

Configure Prometheus for ScyllaDB

Set up Prometheus configuration to scrape metrics from ScyllaDB cluster nodes and node exporters.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "/etc/prometheus/rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'scylla'
    static_configs:
      - targets:
        - '203.0.113.10:9180'
        - '203.0.113.11:9180'
        - '203.0.113.12:9180'
    scrape_interval: 10s
    metrics_path: /metrics

  - job_name: 'node-exporter'
    static_configs:
      - targets:
        - '203.0.113.10:9100'
        - '203.0.113.11:9100'
        - '203.0.113.12:9100'
    scrape_interval: 15s

Create ScyllaDB alerting rules

Define Prometheus alerting rules for common ScyllaDB issues and performance thresholds.

sudo mkdir -p /etc/prometheus/rules
groups:
  - name: scylla.rules
    rules:
    - alert: ScyllaDBNodeDown
      expr: up{job="scylla"} == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "ScyllaDB node {{ $labels.instance }} is down"
        description: "ScyllaDB node {{ $labels.instance }} has been down for more than 1 minute."

    - alert: ScyllaDBHighLatency
      expr: scylla_storage_proxy_coordinator_read_latency{quantile="0.99"} > 100
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High read latency on {{ $labels.instance }}"
        description: "99th percentile read latency is {{ $value }}ms on {{ $labels.instance }}"

    - alert: ScyllaDBHighCPUUsage
      expr: scylla_reactor_utilization > 0.8
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High CPU usage on {{ $labels.instance }}"
        description: "CPU utilization is {{ $value | humanizePercentage }} on {{ $labels.instance }}"

    - alert: ScyllaDBLowDiskSpace
      expr: (scylla_database_total_disk_space_bytes - scylla_database_used_disk_space_bytes) / scylla_database_total_disk_space_bytes < 0.1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Low disk space on {{ $labels.instance }}"
        description: "Available disk space is less than 10% on {{ $labels.instance }}"

    - alert: ScyllaDBHighMemoryUsage
      expr: scylla_memory_allocated_memory / scylla_memory_total_memory > 0.9
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High memory usage on {{ $labels.instance }}"
        description: "Memory usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"

    - alert: ScyllaDBCompactionBacklog
      expr: scylla_compaction_manager_pending_tasks > 100
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "High compaction backlog on {{ $labels.instance }}"
        description: "{{ $value }} pending compaction tasks on {{ $labels.instance }}"

    - alert: ScyllaDBTimeoutOperations
      expr: increase(scylla_storage_proxy_coordinator_read_timeouts_total[5m]) > 10
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "High number of read timeouts on {{ $labels.instance }}"
        description: "{{ $value }} read timeouts in the last 5 minutes on {{ $labels.instance }}"

    - alert: ScyllaDBClusterNotHealthy
      expr: count(up{job="scylla"} == 1) < 2
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "ScyllaDB cluster unhealthy"
        description: "Only {{ $value }} ScyllaDB nodes are available out of expected 3 nodes"

Install and configure Alertmanager

Set up Alertmanager to handle alerts from Prometheus and send notifications.

sudo apt install -y prometheus-alertmanager
sudo dnf install -y alertmanager
global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alerts@example.com'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
  - name: 'web.hook'
    email_configs:
      - to: 'admin@example.com'
        subject: 'ScyllaDB Alert: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Instance: {{ .Labels.instance }}
          Severity: {{ .Labels.severity }}
          {{ end }}

Start monitoring services

Enable and start Prometheus and Alertmanager services.

sudo systemctl enable prometheus alertmanager
sudo systemctl start prometheus alertmanager
sudo systemctl status prometheus alertmanager

Install Grafana

Install Grafana for creating dashboards and visualizing ScyllaDB metrics.

sudo apt install -y software-properties-common
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install -y grafana
sudo dnf install -y grafana

Configure Grafana

Configure Grafana with security settings and enable anonymous access for monitoring dashboards.

[server]
http_port = 3000
domain = example.com
root_url = http://example.com:3000/

[security]
admin_user = admin
admin_password = your_secure_password
secret_key = your_secret_key

[auth.anonymous]
enabled = false

[alerting]
execute_alerts = true

Start Grafana

Enable and start Grafana service, then access the web interface.

sudo systemctl enable grafana-server
sudo systemctl start grafana-server
sudo systemctl status grafana-server

Configure Prometheus data source in Grafana

Add Prometheus as a data source in Grafana to access ScyllaDB metrics.

curl -X POST http://admin:your_secure_password@localhost:3000/api/datasources \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Prometheus",
    "type": "prometheus",
    "url": "http://localhost:9090",
    "access": "proxy",
    "isDefault": true
  }'

Import ScyllaDB dashboard

Create a comprehensive ScyllaDB monitoring dashboard with key performance metrics.

{
  "dashboard": {
    "title": "ScyllaDB Cluster Monitoring",
    "tags": ["scylladb", "performance"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Cluster Status",
        "type": "stat",
        "targets": [
          {
            "expr": "count(up{job=\"scylla\"} == 1)",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "displayName": "Nodes Up",
            "min": 0,
            "max": 3
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 0, "y": 0}
      },
      {
        "title": "Read Latency (99th percentile)",
        "type": "timeseries",
        "targets": [
          {
            "expr": "scylla_storage_proxy_coordinator_read_latency{quantile=\"0.99\"}",
            "refId": "A",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "ms"
          }
        },
        "gridPos": {"h": 8, "w": 18, "x": 6, "y": 0}
      },
      {
        "title": "Write Latency (99th percentile)",
        "type": "timeseries",
        "targets": [
          {
            "expr": "scylla_storage_proxy_coordinator_write_latency{quantile=\"0.99\"}",
            "refId": "A",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "ms"
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8}
      },
      {
        "title": "CPU Utilization",
        "type": "timeseries",
        "targets": [
          {
            "expr": "scylla_reactor_utilization * 100",
            "refId": "A",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8}
      },
      {
        "title": "Memory Usage",
        "type": "timeseries",
        "targets": [
          {
            "expr": "scylla_memory_allocated_memory / scylla_memory_total_memory * 100",
            "refId": "A",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 16}
      },
      {
        "title": "Disk Usage",
        "type": "timeseries",
        "targets": [
          {
            "expr": "scylla_database_used_disk_space_bytes / scylla_database_total_disk_space_bytes * 100",
            "refId": "A",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 16}
      },
      {
        "title": "Operations per Second",
        "type": "timeseries",
        "targets": [
          {
            "expr": "rate(scylla_storage_proxy_coordinator_reads_total[5m])",
            "refId": "A",
            "legendFormat": "Reads - {{instance}}"
          },
          {
            "expr": "rate(scylla_storage_proxy_coordinator_writes_total[5m])",
            "refId": "B",
            "legendFormat": "Writes - {{instance}}"
          }
        ],
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 24}
      },
      {
        "title": "Compaction Tasks",
        "type": "timeseries",
        "targets": [
          {
            "expr": "scylla_compaction_manager_pending_tasks",
            "refId": "A",
            "legendFormat": "Pending - {{instance}}"
          },
          {
            "expr": "scylla_compaction_manager_active_tasks",
            "refId": "B",
            "legendFormat": "Active - {{instance}}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 32}
      },
      {
        "title": "Error Rates",
        "type": "timeseries",
        "targets": [
          {
            "expr": "rate(scylla_storage_proxy_coordinator_read_timeouts_total[5m])",
            "refId": "A",
            "legendFormat": "Read Timeouts - {{instance}}"
          },
          {
            "expr": "rate(scylla_storage_proxy_coordinator_write_timeouts_total[5m])",
            "refId": "B",
            "legendFormat": "Write Timeouts - {{instance}}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 32}
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "10s"
  }
}
curl -X POST http://admin:your_secure_password@localhost:3000/api/dashboards/db \
  -H "Content-Type: application/json" \
  -d @/tmp/scylla-dashboard.json

Configure firewall access

Open necessary ports for monitoring services while maintaining security.

sudo ufw allow 9090/tcp comment 'Prometheus'
sudo ufw allow 9093/tcp comment 'Alertmanager'
sudo ufw allow 3000/tcp comment 'Grafana'
sudo ufw allow from 203.0.113.0/24 to any port 9180 comment 'ScyllaDB metrics'
sudo ufw reload
sudo firewall-cmd --permanent --add-port=9090/tcp --add-port=9093/tcp --add-port=3000/tcp
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="203.0.113.0/24" port protocol="tcp" port="9180" accept'
sudo firewall-cmd --reload

Configure performance optimization

Tune Prometheus retention

Configure Prometheus retention and storage settings for long-term monitoring data.

ARGS="--config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/var/lib/prometheus --storage.tsdb.retention.time=90d --storage.tsdb.retention.size=50GB --web.console.libraries=/etc/prometheus/console_libraries --web.console.templates=/etc/prometheus/consoles --web.enable-lifecycle"

Configure ScyllaDB monitoring user

Create a monitoring-specific user in ScyllaDB with limited privileges for security.

cqlsh -e "CREATE USER monitoring WITH PASSWORD 'monitoring_password' NOSUPERUSER;"
cqlsh -e "GRANT SELECT ON ALL KEYSPACES TO monitoring;"

Set up log monitoring

Configure log monitoring for ScyllaDB error detection and troubleshooting.

# Add this job to the existing scrape_configs section
  - job_name: 'scylla-logs'
    static_configs:
      - targets:
        - '203.0.113.10:9080'
        - '203.0.113.11:9080'
        - '203.0.113.12:9080'
    scrape_interval: 30s
    metrics_path: /metrics

Verify your setup

Check that all monitoring components are working and collecting data properly.

# Check ScyllaDB metrics endpoint
curl http://203.0.113.10:9180/metrics | grep scylla_storage_proxy

Verify Prometheus targets

curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job, instance, health}'

Check Grafana datasource

curl -u admin:your_secure_password http://localhost:3000/api/datasources

Test alerting rules

curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[].name'

Verify cluster status

nodetool status nodetool info

Common issues

SymptomCauseFix
Prometheus can't reach ScyllaDB metricsFirewall blocking port 9180Configure firewall rules or disable for testing
Grafana shows "No data"Prometheus data source not configuredCheck datasource URL and connectivity
High memory usage alertsNormal ScyllaDB behaviorAdjust thresholds in alerting rules
Missing ScyllaDB metricsprometheus_port not configuredAdd prometheus_port to scylla.yaml and restart
Alertmanager not sending emailsSMTP configuration issuesCheck SMTP settings and test with amtool
Dashboard shows connection refusedScyllaDB node downCheck ScyllaDB service status with systemctl status scylla-server

Next steps

You now have comprehensive ScyllaDB monitoring with Prometheus and Grafana. Consider these additional improvements:

Running this in production?

Ready for 24/7 operations? Setting up ScyllaDB monitoring once is straightforward. Keeping it patched, monitored, backed up and tuned across environments is the harder part. Our managed platform covers monitoring, backups and 24/7 response by default.

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.