Monitor Apache Cassandra cluster with Prometheus and Grafana dashboards

Intermediate 45 min May 01, 2026 64 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Set up comprehensive monitoring for Apache Cassandra clusters using JMX exporter, Prometheus metrics collection, and Grafana dashboards with alerting rules for cluster health.

Prerequisites

  • Apache Cassandra cluster running
  • Root or sudo access
  • Network connectivity between nodes
  • Basic knowledge of Prometheus and Grafana

What this solves

Apache Cassandra clusters generate hundreds of performance and health metrics through JMX, but without proper monitoring, you'll miss critical issues like node failures, disk space problems, or read/write latency spikes. This tutorial configures JMX exporter to expose Cassandra metrics, sets up Prometheus to collect them, and creates Grafana dashboards with alerting rules for comprehensive cluster monitoring.

Step-by-step configuration

Install JMX Prometheus exporter

Download and configure the JMX exporter to expose Cassandra metrics in Prometheus format.

cd /opt
sudo wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.20.0/jmx_prometheus_javaagent-0.20.0.jar
sudo chown cassandra:cassandra jmx_prometheus_javaagent-0.20.0.jar

Create JMX exporter configuration

Configure the JMX exporter to collect essential Cassandra metrics including node health, keyspace metrics, and thread pool statistics.

rules:
  # Node health metrics
  - pattern: "org.apache.cassandra.metrics<>Value"
    name: cassandra_storage_load_bytes
    help: "Total disk space used by node in bytes"
  
  # Read/Write latency
  - pattern: "org.apache.cassandra.metrics<>Count"
    name: cassandra_client_request_latency_total
    labels:
      request_type: "$1"
    help: "Total client request latency count"
  
  - pattern: "org.apache.cassandra.metrics<>(Mean|95thPercentile|99thPercentile)"
    name: cassandra_client_request_latency_seconds
    type: GAUGE
    labels:
      request_type: "$1"
      quantile: "$2"
    help: "Client request latency in seconds"
  
  # Connection metrics
  - pattern: "org.apache.cassandra.metrics<>Value"
    name: cassandra_connection_$2
    labels:
      connection_type: "$1"
    help: "Cassandra connection metrics"
  
  # Keyspace metrics
  - pattern: "org.apache.cassandra.metrics<>(Count|Value)"
    name: cassandra_keyspace_$2
    labels:
      keyspace: "$1"
    help: "Cassandra keyspace metrics"
  
  # Table metrics
  - pattern: "org.apache.cassandra.metrics<>(Count|Value)"
    name: cassandra_table_$3
    labels:
      keyspace: "$1"
      table: "$2"
    help: "Cassandra table metrics"
  
  # Thread pool metrics
  - pattern: "org.apache.cassandra.metrics<>Value"
    name: cassandra_threadpool_$3
    labels:
      pool_type: "$1"
      pool_name: "$2"
    help: "Cassandra thread pool metrics"
  
  # Compaction metrics
  - pattern: "org.apache.cassandra.metrics<>(Count|Value)"
    name: cassandra_compaction_$1
    help: "Cassandra compaction metrics"
  
  # Cache metrics
  - pattern: "org.apache.cassandra.metrics<>(Count|Value)"
    name: cassandra_cache_$2
    labels:
      cache_name: "$1"
    help: "Cassandra cache metrics"

Configure Cassandra with JMX exporter

Add the JMX exporter as a Java agent to Cassandra's JVM startup options.

# Add JMX Prometheus exporter
JVM_OPTS="$JVM_OPTS -javaagent:/opt/jmx_prometheus_javaagent-0.20.0.jar=7070:/opt/cassandra-jmx-config.yaml"

Restart Cassandra service

Restart Cassandra to load the JMX exporter configuration.

sudo systemctl restart cassandra
sudo systemctl status cassandra

Verify JMX exporter is working

Check that the JMX exporter is exposing metrics on port 7070.

curl http://localhost:7070/metrics | grep cassandra_storage_load_bytes
ss -tlnp | grep 7070

Install Prometheus

Install Prometheus to collect metrics from the Cassandra JMX exporter.

sudo apt update
sudo apt install -y prometheus
sudo dnf install -y prometheus2

Configure Prometheus to scrape Cassandra metrics

Add Cassandra nodes to Prometheus configuration for metric collection.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "cassandra_alerts.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'cassandra-cluster'
    static_configs:
      - targets:
        - 'cassandra-node-1:7070'
        - 'cassandra-node-2:7070'
        - 'cassandra-node-3:7070'
    scrape_interval: 30s
    scrape_timeout: 10s
    metrics_path: /metrics
    params:
      format: ['prometheus']

Create Cassandra alerting rules

Define alert rules for critical Cassandra cluster conditions.

groups:
  - name: cassandra_cluster
    rules:
    
    # Node availability
    - alert: CassandraNodeDown
      expr: up{job="cassandra-cluster"} == 0
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Cassandra node {{ $labels.instance }} is down"
        description: "Cassandra node {{ $labels.instance }} has been down for more than 2 minutes."
    
    # High read latency
    - alert: CassandraHighReadLatency
      expr: cassandra_client_request_latency_seconds{request_type="Read", quantile="95thPercentile"} > 0.1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High Cassandra read latency on {{ $labels.instance }}"
        description: "95th percentile read latency is {{ $value }}s on {{ $labels.instance }}."
    
    # High write latency
    - alert: CassandraHighWriteLatency
      expr: cassandra_client_request_latency_seconds{request_type="Write", quantile="95thPercentile"} > 0.05
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High Cassandra write latency on {{ $labels.instance }}"
        description: "95th percentile write latency is {{ $value }}s on {{ $labels.instance }}."
    
    # Disk space usage
    - alert: CassandraHighDiskUsage
      expr: (cassandra_storage_load_bytes / (1024^3)) > 80
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High disk usage on Cassandra node {{ $labels.instance }}"
        description: "Disk usage is {{ $value }}GB on {{ $labels.instance }}."
    
    # Pending compactions
    - alert: CassandraHighPendingCompactions
      expr: cassandra_compaction_PendingTasks > 20
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "High pending compactions on {{ $labels.instance }}"
        description: "{{ $value }} compactions are pending on {{ $labels.instance }}."
    
    # Thread pool queue size
    - alert: CassandraHighThreadPoolQueue
      expr: cassandra_threadpool_PendingTasks{pool_name="MutationStage"} > 100
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High thread pool queue on {{ $labels.instance }}"
        description: "{{ $labels.pool_name }} has {{ $value }} pending tasks on {{ $labels.instance }}."

Start and enable Prometheus

Enable Prometheus to start automatically and verify it's collecting metrics.

sudo systemctl enable --now prometheus
sudo systemctl status prometheus

Install Grafana

Install Grafana for visualizing Cassandra cluster metrics.

sudo apt install -y software-properties-common
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install -y grafana
sudo dnf install -y grafana

Start and enable Grafana

Enable Grafana service and access the web interface.

sudo systemctl enable --now grafana-server
sudo systemctl status grafana-server

Configure Grafana data source

Add Prometheus as a data source in Grafana. Navigate to http://your-server:3000 (admin/admin), then go to Configuration > Data Sources > Add data source.

Name: Prometheus
Type: Prometheus
URL: http://localhost:9090
Access: Server (default)
HTTP Method: GET

Create Cassandra cluster overview dashboard

Create a comprehensive dashboard for monitoring cluster health. In Grafana, go to Dashboards > New > New Dashboard and add these panels.

# Node availability
up{job="cassandra-cluster"}

Total nodes

count(up{job="cassandra-cluster"})

Healthy nodes

count(up{job="cassandra-cluster"} == 1)

Add read/write latency panels

Monitor client request latencies across the cluster.

# Read latency 95th percentile
cassandra_client_request_latency_seconds{request_type="Read", quantile="95thPercentile"}

Write latency 95th percentile

cassandra_client_request_latency_seconds{request_type="Write", quantile="95thPercentile"}

Read throughput

rate(cassandra_client_request_latency_total{request_type="Read"}[5m])

Write throughput

rate(cassandra_client_request_latency_total{request_type="Write"}[5m])

Add storage and compaction panels

Monitor disk usage and compaction activity.

# Disk usage per node (GB)
cassandra_storage_load_bytes / (1024^3)

Pending compactions

cassandra_compaction_PendingTasks

Completed compactions rate

rate(cassandra_compaction_CompletedTasks[5m])

Add thread pool monitoring

Monitor thread pool health and queue sizes.

# Active tasks
cassandra_threadpool_ActiveTasks

Pending tasks

cassandra_threadpool_PendingTasks

Blocked tasks

cassandra_threadpool_CurrentlyBlockedTasks

Configure alerting notifications

Set up notification channels for Grafana alerts. Go to Alerting > Notification channels.

Name: cassandra-alerts
Type: Email
Addresses: ops-team@example.com
Subject: Cassandra Alert - {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}

Create alerting rules in Grafana

Configure dashboard alerts for critical metrics. Edit each panel and go to Alert tab.

Condition: IS BELOW 1
Evaluation: every 1m for 2m
Notifications: Send to cassandra-alerts
Message: Cassandra node is down - check cluster status immediately

Export and save dashboard

Save your dashboard configuration for backup and version control.

curl -u admin:admin http://localhost:3000/api/dashboards/db/cassandra-cluster > cassandra-dashboard.json

Verify your setup

Confirm that your monitoring stack is collecting and displaying Cassandra metrics correctly.

# Check Cassandra JMX exporter
curl -s http://localhost:7070/metrics | grep -c cassandra_

Verify Prometheus targets

curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].health'

Test a Prometheus query

curl -s "http://localhost:9090/api/v1/query?query=up{job='cassandra-cluster'}" | jq '.data.result[].value[1]'

Check Grafana is running

curl -s http://localhost:3000/api/health | jq '.database'

Verify alert rules are loaded

curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].name'

In Grafana, verify you can see:

  • All Cassandra nodes showing as "up" in the cluster status panel
  • Read and write latency metrics updating in real-time
  • Storage usage data for each node
  • Thread pool activity and queue sizes

Common issues

SymptomCauseFix
JMX exporter port not accessibleFirewall blocking port 7070sudo ufw allow 7070 or configure iptables
Prometheus shows targets as downIncorrect hostnames in configUse IP addresses or verify DNS resolution
Missing Cassandra metrics in PrometheusJMX exporter not loaded correctlyCheck /var/log/cassandra/system.log for agent errors
High memory usage after enabling monitoringToo frequent scraping or large metric cardinalityIncrease scrape_interval to 60s and filter unused metrics
Grafana shows no dataData source URL incorrectVerify Prometheus URL is http://localhost:9090
Alerts not firingAlert rule syntax errorsValidate rules with promtool check rules cassandra_alerts.yml
Performance impact: The JMX exporter adds minimal overhead, but frequent scraping can impact Cassandra performance. Start with 30-second intervals and adjust based on your cluster's capacity.

Next steps

Running this in production?

Want this handled for you? Setting up monitoring once is straightforward. Keeping it tuned, maintaining dashboards, responding to alerts 24/7, and scaling the monitoring infrastructure across environments is the harder part. See how we run infrastructure like this for European SaaS and e-commerce teams.

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.