Configure ScyllaDB cluster monitoring with Prometheus and Grafana dashboards

Intermediate 45 min Jun 13, 2026 24 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Set up comprehensive monitoring for ScyllaDB clusters using Prometheus metrics collection and Grafana visualization dashboards. Configure alerting rules for performance monitoring and health checks.

Prerequisites

  • ScyllaDB cluster running
  • Root or sudo access
  • 4GB RAM minimum
  • Network access between nodes

What this solves

ScyllaDB provides extensive metrics through its built-in monitoring endpoints, but collecting and visualizing these metrics requires proper setup. This tutorial shows you how to configure Prometheus to scrape ScyllaDB metrics, set up Grafana dashboards for cluster visualization, and implement alerting rules for proactive monitoring of your NoSQL database cluster.

Step-by-step configuration

Install Prometheus

First, install Prometheus to collect metrics from your ScyllaDB cluster.

sudo apt update
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
tar xzf prometheus-2.48.0.linux-amd64.tar.gz
sudo mv prometheus-2.48.0.linux-amd64 /opt/prometheus
sudo useradd --system --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /opt/prometheus
sudo dnf update -y
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
tar xzf prometheus-2.48.0.linux-amd64.tar.gz
sudo mv prometheus-2.48.0.linux-amd64 /opt/prometheus
sudo useradd --system --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /opt/prometheus

Configure Prometheus for ScyllaDB

Create the Prometheus configuration file with ScyllaDB scrape targets. ScyllaDB exposes metrics on port 9180 by default.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "scylla_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'scylla'
    static_configs:
      - targets:
        - '203.0.113.10:9180'
        - '203.0.113.11:9180'
        - '203.0.113.12:9180'
    scrape_interval: 10s
    metrics_path: /metrics
    params:
      format: [prometheus]

  - job_name: 'scylla-manager'
    static_configs:
      - targets:
        - '203.0.113.10:56090'
    scrape_interval: 30s

Create ScyllaDB alerting rules

Define alerting rules specific to ScyllaDB performance and health monitoring.

groups:
  • name: scylla.rules
rules: - alert: ScyllaNodeDown expr: up{job="scylla"} == 0 for: 1m labels: severity: critical annotations: summary: "ScyllaDB node is down" description: "ScyllaDB node {{ $labels.instance }} has been down for more than 1 minute." - alert: ScyllaHighCPU expr: scylla_reactor_utilization > 0.8 for: 5m labels: severity: warning annotations: summary: "High CPU utilization on ScyllaDB node" description: "CPU utilization is {{ $value }} on {{ $labels.instance }}" - alert: ScyllaHighLatency expr: scylla_storage_proxy_coordinator_read_latency{quantile="0.99"} > 100000 for: 2m labels: severity: warning annotations: summary: "High read latency detected" description: "99th percentile read latency is {{ $value }}us on {{ $labels.instance }}" - alert: ScyllaLowDiskSpace expr: scylla_node_filesystem_avail_bytes / scylla_node_filesystem_size_bytes < 0.1 for: 1m labels: severity: critical annotations: summary: "Low disk space on ScyllaDB node" description: "Available disk space is below 10% on {{ $labels.instance }}" - alert: ScyllaCompactionBacklog expr: scylla_compaction_manager_pending_tasks > 100 for: 10m labels: severity: warning annotations: summary: "High compaction backlog" description: "Compaction backlog has {{ $value }} pending tasks on {{ $labels.instance }}" - alert: ScyllaHighMemoryUsage expr: scylla_memory_allocated_bytes / scylla_memory_total_bytes > 0.9 for: 5m labels: severity: critical annotations: summary: "High memory usage on ScyllaDB node" description: "Memory usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}" - alert: ScyllaTimeouts expr: rate(scylla_storage_proxy_coordinator_read_timeouts_total[5m]) > 1 for: 2m labels: severity: warning annotations: summary: "High timeout rate detected" description: "Read timeout rate is {{ $value }}/sec on {{ $labels.instance }}" - alert: ScyllaErrorRate expr: rate(scylla_storage_proxy_coordinator_read_errors_total[5m]) > 0.1 for: 1m labels: severity: critical annotations: summary: "High error rate detected" description: "Read error rate is {{ $value }}/sec on {{ $labels.instance }}" - alert: ScyllaStreamingErrors expr: rate(scylla_streaming_total_incoming_bytes[5m]) == 0 and scylla_node_operation_mode{mode="NORMAL"} == 1 for: 15m labels: severity: warning annotations: summary: "No streaming activity detected" description: "No incoming streaming detected on {{ $labels.instance }} during repair/bootstrap" - alert: ScyllaLargePartitions expr: scylla_large_partition_exceeding_threshold_total > 0 for: 1m labels: severity: warning annotations: summary: "Large partitions detected" description: "{{ $value }} large partitions found on {{ $labels.instance }}" - alert: ScyllaGCPressure expr: rate(scylla_memory_free_bytes[5m]) < 0 for: 5m labels: severity: warning annotations: summary: "Memory pressure detected" description: "Decreasing free memory trend on {{ $labels.instance }}" - alert: ScyllaConnectionErrors expr: rate(scylla_cql_connections_rejected_total[5m]) > 1 for: 2m labels: severity: critical annotations: summary: "High connection rejection rate" description: "CQL connection rejection rate is {{ $value }}/sec on {{ $labels.instance }}" - alert: ScyllaRepairProgress expr: scylla_repair_segment_total == 0 and on(instance) scylla_node_operation_mode{mode="NORMAL"} == 1 for: 24h labels: severity: warning annotations: summary: "No repair activity in 24 hours" description: "Node {{ $labels.instance }} has not run repair in over 24 hours"

Create Prometheus systemd service

Set up Prometheus as a systemd service for automatic startup and management.

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus \
  --config.file=/opt/prometheus/prometheus.yml \
  --storage.tsdb.path=/opt/prometheus/data \
  --web.console.templates=/opt/prometheus/consoles \
  --web.console.libraries=/opt/prometheus/console_libraries \
  --web.listen-address=0.0.0.0:9090 \
  --web.enable-lifecycle

[Install]
WantedBy=multi-user.target

Install and configure Grafana

Install Grafana for creating dashboards and visualizations of ScyllaDB metrics.

sudo apt install -y software-properties-common
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install -y grafana
sudo tee /etc/yum.repos.d/grafana.repo <

Start monitoring services

Enable and start both Prometheus and Grafana services.

sudo mkdir -p /opt/prometheus/data
sudo chown prometheus:prometheus /opt/prometheus/data
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
sudo systemctl enable --now grafana-server

Configure Grafana data source

Add Prometheus as a data source in Grafana and import ScyllaDB dashboards.

curl -X POST http://admin:admin@localhost:3000/api/datasources \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Prometheus",
    "type": "prometheus",
    "url": "http://localhost:9090",
    "access": "proxy",
    "basicAuth": false
  }'

Import ScyllaDB dashboard

Create a comprehensive dashboard for ScyllaDB cluster monitoring with key performance indicators.

curl -X POST http://admin:admin@localhost:3000/api/dashboards/db \
  -H "Content-Type: application/json" \
  -d @- <<'EOF'
{
  "dashboard": {
    "id": null,
    "title": "ScyllaDB Cluster Overview",
    "tags": ["scylla", "database"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "Node Status",
        "type": "stat",
        "targets": [
          {
            "expr": "up{job=\"scylla\"}",
            "legendFormat": "{{instance}}"
          }
        ],
        "gridPos": {"h": 4, "w": 6, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "CPU Utilization",
        "type": "graph",
        "targets": [
          {
            "expr": "scylla_reactor_utilization",
            "legendFormat": "{{instance}}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 4}
      },
      {
        "id": 3,
        "title": "Read Latency (99th percentile)",
        "type": "graph",
        "targets": [
          {
            "expr": "scylla_storage_proxy_coordinator_read_latency{quantile=\"0.99\"}",
            "legendFormat": "{{instance}}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 4}
      },
      {
        "id": 4,
        "title": "Write Latency (99th percentile)",
        "type": "graph",
        "targets": [
          {
            "expr": "scylla_storage_proxy_coordinator_write_latency{quantile=\"0.99\"}",
            "legendFormat": "{{instance}}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 12}
      },
      {
        "id": 5,
        "title": "Operations per Second",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(scylla_cql_reads_total[5m])",
            "legendFormat": "Reads - {{instance}}"
          },
          {
            "expr": "rate(scylla_cql_inserts_total[5m])",
            "legendFormat": "Writes - {{instance}}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 12}
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "5s"
  }
}
EOF

Install Alertmanager

Set up Alertmanager to handle alerts generated by Prometheus rules.

wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xzf alertmanager-0.26.0.linux-amd64.tar.gz
sudo mv alertmanager-0.26.0.linux-amd64 /opt/alertmanager
sudo useradd --system --shell /bin/false alertmanager
sudo chown -R alertmanager:alertmanager /opt/alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xzf alertmanager-0.26.0.linux-amd64.tar.gz
sudo mv alertmanager-0.26.0.linux-amd64 /opt/alertmanager
sudo useradd --system --shell /bin/false alertmanager
sudo chown -R alertmanager:alertmanager /opt/alertmanager

Configure Alertmanager

Set up email notifications for ScyllaDB alerts.

global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager@example.com'
  smtp_auth_password: 'your-email-password'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
  routes:
  - match:
      severity: critical
    receiver: 'critical-email'
  - match:
      severity: warning
    receiver: 'warning-email'

receivers:
  • name: 'web.hook'
webhook_configs: - url: 'http://127.0.0.1:5001/'
  • name: 'critical-email'
email_configs: - to: 'admin@example.com' subject: 'CRITICAL: ScyllaDB Alert - {{ .GroupLabels.alertname }}' body: | {{ range .Alerts }} Alert: {{ .Annotations.summary }} Description: {{ .Annotations.description }} Instance: {{ .Labels.instance }} Severity: {{ .Labels.severity }} {{ end }}
  • name: 'warning-email'
email_configs: - to: 'monitoring@example.com' subject: 'WARNING: ScyllaDB Alert - {{ .GroupLabels.alertname }}' body: | {{ range .Alerts }} Alert: {{ .Annotations.summary }} Description: {{ .Annotations.description }} Instance: {{ .Labels.instance }} {{ end }}

Create Alertmanager service

Set up Alertmanager as a systemd service.

[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/opt/alertmanager/alertmanager \
  --config.file=/opt/alertmanager/alertmanager.yml \
  --storage.path=/opt/alertmanager/data

[Install]
WantedBy=multi-user.target

Start Alertmanager

Enable and start the Alertmanager service.

sudo mkdir -p /opt/alertmanager/data
sudo chown alertmanager:alertmanager /opt/alertmanager/data
sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager

Configure ScyllaDB monitoring agent

Install and configure the ScyllaDB monitoring agent for enhanced metrics collection.

wget https://github.com/scylladb/scylla-monitoring/archive/refs/tags/scylla-monitoring-4.5.0.tar.gz
tar xzf scylla-monitoring-4.5.0.tar.gz
sudo mv scylla-monitoring-4.5.0 /opt/scylla-monitoring
sudo chown -R prometheus:prometheus /opt/scylla-monitoring

Import advanced ScyllaDB dashboards

Import official ScyllaDB Grafana dashboards for comprehensive monitoring.

cd /opt/scylla-monitoring
sudo -u prometheus ./start-grafana.sh -s prometheus_servers.yml -n node_exporter_servers.yml -G

Verify your setup

Check that all monitoring components are running and collecting metrics properly.

sudo systemctl status prometheus
sudo systemctl status grafana-server
sudo systemctl status alertmanager
curl http://localhost:9090/api/v1/targets
curl http://localhost:3000/api/health
curl http://localhost:9093/api/v1/status

Access Grafana at http://your-server:3000 (admin/admin) and verify that ScyllaDB metrics are being collected. Check the dashboard shows current cluster status and performance metrics.

Common issues

SymptomCauseFix
No metrics from ScyllaDBWrong port or endpointVerify ScyllaDB metrics endpoint: curl http://node:9180/metrics
Prometheus can't scrape targetsFirewall blocking accessOpen port 9180: sudo ufw allow 9180
Grafana shows no dataData source not configuredCheck Prometheus data source URL in Grafana settings
Alerts not firingAlertmanager not connectedVerify Alertmanager target in Prometheus: http://localhost:9090/alerts
Dashboard import failsJSON format errorUse Grafana UI to import dashboard ID 9614 for ScyllaDB
High memory usageToo many metrics retainedAdjust Prometheus retention: --storage.tsdb.retention.time=30d

Next steps

Running this in production?

Need help with the operational load? Setting this up once is straightforward. Keeping it patched, monitored, backed up and performant across environments is the harder part. See how we run infrastructure like this for European SaaS and e-commerce teams.

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.