Configure backup monitoring with Prometheus and Grafana for automated infrastructure oversight

Intermediate 45 min Apr 28, 2026 112 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Set up comprehensive backup monitoring using Prometheus metrics collection and Grafana dashboards. This tutorial covers backup exporter configuration, custom metrics creation, and automated alerting for backup failures and performance issues.

Prerequisites

  • Root or sudo access
  • At least 2GB RAM
  • Python 3 installed
  • Existing backup jobs to monitor

What this solves

Infrastructure backups often fail silently, leaving you vulnerable to data loss without warning. This tutorial sets up Prometheus to collect backup metrics and Grafana to visualize backup status, duration, and success rates. You'll get automated alerts when backups fail or take too long to complete.

Step-by-step installation

Update system packages

Start by updating your package manager to ensure you get the latest versions of monitoring tools.

sudo apt update && sudo apt upgrade -y
sudo apt install -y curl wget gpg
sudo dnf update -y
sudo dnf install -y curl wget gpg

Install Prometheus

Download and install Prometheus server to collect backup metrics from various exporters.

cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xzf prometheus-2.45.0.linux-amd64.tar.gz
sudo mv prometheus-2.45.0.linux-amd64/prometheus /usr/local/bin/
sudo mv prometheus-2.45.0.linux-amd64/promtool /usr/local/bin/
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

Create Prometheus user and directories

Set up dedicated user and proper directory permissions for Prometheus to run securely.

sudo groupadd --system prometheus
sudo useradd -s /sbin/nologin --system -g prometheus prometheus
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus

Configure Prometheus for backup monitoring

Create the main Prometheus configuration with backup-specific scrape targets and rules.

global:
  scrape_interval: 30s
  evaluation_interval: 30s

rule_files:
  - "backup_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'backup_exporter'
    static_configs:
      - targets: ['localhost:9101']
    scrape_interval: 60s
    metrics_path: /metrics

  - job_name: 'mysql_backup'
    static_configs:
      - targets: ['localhost:9104']
    scrape_interval: 300s

  - job_name: 'file_backup'
    file_sd_configs:
      - files:
        - '/etc/prometheus/backup_targets.yml'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
      - source_labels: [__address__]
        regex: '(.*)'
        target_label: __address__
        replacement: '${1}:9105'

Create backup alerting rules

Define alerting rules for backup failures, long-running backups, and missing backup metrics.

groups:
  - name: backup_monitoring
    rules:
      - alert: BackupFailed
        expr: backup_last_success == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Backup failed for {{ $labels.job }} on {{ $labels.instance }}"
          description: "Backup job {{ $labels.job }} has failed on instance {{ $labels.instance }}"

      - alert: BackupTooLong
        expr: backup_duration_seconds > 7200
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Backup taking too long for {{ $labels.job }}"
          description: "Backup job {{ $labels.job }} has been running for {{ $value }} seconds"

      - alert: BackupMissing
        expr: up{job=~".backup."} == 0
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Backup exporter down for {{ $labels.job }}"
          description: "Backup exporter {{ $labels.job }} on {{ $labels.instance }} has been down for more than 10 minutes"

      - alert: BackupOld
        expr: (time() - backup_last_success_timestamp) > 86400
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Backup is older than 24 hours for {{ $labels.job }}"
          description: "Last successful backup for {{ $labels.job }} was {{ $value }} seconds ago"

      - alert: BackupSizeChanged
        expr: abs(backup_size_bytes - backup_size_bytes offset 24h) / backup_size_bytes offset 24h > 0.3
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Backup size changed significantly for {{ $labels.job }}"
          description: "Backup size for {{ $labels.job }} changed by {{ $value | humanizePercentage }} compared to yesterday"

Install Node Exporter for system metrics

Node Exporter provides system-level metrics that complement backup monitoring data.

cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz
tar xzf node_exporter-1.6.0.linux-amd64.tar.gz
sudo mv node_exporter-1.6.0.linux-amd64/node_exporter /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/node_exporter

Create custom backup exporter

Build a simple backup metrics exporter that monitors backup job status and metrics.

#!/usr/bin/env python3
import os
import time
import json
from http.server import HTTPServer, BaseHTTPRequestHandler
import subprocess

class BackupMetricsHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == '/metrics':
            self.send_response(200)
            self.send_header('Content-type', 'text/plain')
            self.end_headers()
            
            metrics = self.generate_metrics()
            self.wfile.write(metrics.encode('utf-8'))
        else:
            self.send_response(404)
            self.end_headers()
    
    def generate_metrics(self):
        metrics = []
        
        # Check backup status files
        backup_dirs = ['/var/backups', '/backup', '/opt/backups']
        
        for backup_dir in backup_dirs:
            if os.path.exists(backup_dir):
                status_file = f"{backup_dir}/.backup_status"
                if os.path.exists(status_file):
                    try:
                        with open(status_file, 'r') as f:
                            status = json.load(f)
                        
                        job_name = status.get('job_name', 'unknown')
                        success = 1 if status.get('success', False) else 0
                        duration = status.get('duration_seconds', 0)
                        size_bytes = status.get('size_bytes', 0)
                        timestamp = status.get('timestamp', 0)
                        
                        metrics.append(f'backup_last_success{{job="{job_name}"}} {success}')
                        metrics.append(f'backup_duration_seconds{{job="{job_name}"}} {duration}')
                        metrics.append(f'backup_size_bytes{{job="{job_name}"}} {size_bytes}')
                        metrics.append(f'backup_last_success_timestamp{{job="{job_name}"}} {timestamp}')
                    except Exception as e:
                        metrics.append(f'backup_exporter_errors_total{{error="status_file_read"}} 1')
        
        # Check MySQL backups
        mysql_backup_dir = '/var/backups/mysql'
        if os.path.exists(mysql_backup_dir):
            latest_backup = self.get_latest_backup_file(mysql_backup_dir, '*.sql.gz')
            if latest_backup:
                size = os.path.getsize(latest_backup)
                mtime = os.path.getmtime(latest_backup)
                age = time.time() - mtime
                
                metrics.append(f'backup_last_success{{job="mysql"}} {1 if age < 86400 else 0}')
                metrics.append(f'backup_size_bytes{{job="mysql"}} {size}')
                metrics.append(f'backup_last_success_timestamp{{job="mysql"}} {mtime}')
        
        # Check filesystem backups
        fs_backup_dir = '/var/backups/filesystem'
        if os.path.exists(fs_backup_dir):
            latest_backup = self.get_latest_backup_file(fs_backup_dir, '*.tar.gz')
            if latest_backup:
                size = os.path.getsize(latest_backup)
                mtime = os.path.getmtime(latest_backup)
                age = time.time() - mtime
                
                metrics.append(f'backup_last_success{{job="filesystem"}} {1 if age < 86400 else 0}')
                metrics.append(f'backup_size_bytes{{job="filesystem"}} {size}')
                metrics.append(f'backup_last_success_timestamp{{job="filesystem"}} {mtime}')
        
        return '\n'.join(metrics) + '\n'
    
    def get_latest_backup_file(self, directory, pattern):
        try:
            result = subprocess.run(['find', directory, '-name', pattern, '-type', 'f', '-printf', '%T@ %p\n'], 
                                  capture_output=True, text=True)
            if result.stdout:
                files = result.stdout.strip().split('\n')
                latest = max(files, key=lambda x: float(x.split()[0]))
                return latest.split(' ', 1)[1]
        except Exception:
            pass
        return None

if __name__ == '__main__':
    server = HTTPServer(('localhost', 9101), BackupMetricsHandler)
    server.serve_forever()

Make backup exporter executable

Set proper permissions and ownership for the backup exporter script.

sudo chmod +x /usr/local/bin/backup_exporter.py
sudo chown prometheus:prometheus /usr/local/bin/backup_exporter.py

Create systemd services

Set up systemd service files for Prometheus, Node Exporter, and the backup exporter.

[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus/ \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --web.listen-address=0.0.0.0:9090 \
  --web.enable-lifecycle \
  --storage.tsdb.retention.time=30d

Restart=always
RestartSec=3
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

Create Node Exporter service

Configure Node Exporter to start automatically and provide system metrics.

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --collector.systemd \
  --collector.processes \
  --web.listen-address=:9100

Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Create backup exporter service

Set up the custom backup exporter as a systemd service for automatic startup.

[Unit]
Description=Backup Metrics Exporter
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/bin/python3 /usr/local/bin/backup_exporter.py
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Install Grafana

Add the Grafana repository and install the dashboard server for visualization.

wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install -y grafana
sudo tee /etc/yum.repos.d/grafana.repo << 'EOF'
[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
EOF
sudo dnf install -y grafana

Configure Grafana datasource

Set up Prometheus as the default datasource for Grafana dashboards.

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://localhost:9090
    isDefault: true
    editable: true

Create backup monitoring dashboard

Set up a pre-configured dashboard for backup monitoring with key metrics and alerts.

{
  "dashboard": {
    "id": null,
    "title": "Backup Monitoring",
    "tags": ["backup", "monitoring"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "Backup Success Rate",
        "type": "stat",
        "targets": [
          {
            "expr": "avg(backup_last_success)",
            "legendFormat": "Success Rate"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Backup Duration",
        "type": "graph",
        "targets": [
          {
            "expr": "backup_duration_seconds",
            "legendFormat": "{{ job }}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "yAxes": [
          {
            "unit": "s"
          }
        ]
      },
      {
        "id": 3,
        "title": "Backup Sizes",
        "type": "graph",
        "targets": [
          {
            "expr": "backup_size_bytes",
            "legendFormat": "{{ job }}"
          }
        ],
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8},
        "yAxes": [
          {
            "unit": "bytes"
          }
        ]
      }
    ],
    "time": {
      "from": "now-7d",
      "to": "now"
    },
    "refresh": "1m"
  }
}

Start all services

Enable and start Prometheus, Node Exporter, backup exporter, and Grafana services.

sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
sudo systemctl enable --now node_exporter
sudo systemctl enable --now backup_exporter
sudo systemctl enable --now grafana-server

Create backup status tracking script

This script should be called by your backup jobs to report status to Prometheus.

#!/bin/bash

Usage: backup_status_reporter.sh JOB_NAME SUCCESS DURATION_SECONDS SIZE_BYTES

JOB_NAME="$1" SUCCESS="$2" DURATION="$3" SIZE="$4" TIMESTAMP=$(date +%s)

Create status directory if it doesn't exist

mkdir -p /var/backups

Write status to JSON file

cat > "/var/backups/.backup_status" << EOF { "job_name": "$JOB_NAME", "success": $SUCCESS, "duration_seconds": $DURATION, "size_bytes": $SIZE, "timestamp": $TIMESTAMP } EOF echo "Backup status reported for job: $JOB_NAME"

Make status reporter executable

Set proper permissions for the backup status reporting script.

sudo chmod +x /usr/local/bin/backup_status_reporter.sh
sudo chown root:root /usr/local/bin/backup_status_reporter.sh

Configure firewall rules

Allow access to Prometheus and Grafana web interfaces through the firewall.

sudo ufw allow 9090/tcp comment 'Prometheus'
sudo ufw allow 3000/tcp comment 'Grafana'
sudo ufw reload
sudo firewall-cmd --permanent --add-port=9090/tcp
sudo firewall-cmd --permanent --add-port=3000/tcp
sudo firewall-cmd --reload

Configure backup job integration

Modify existing backup scripts

Update your backup scripts to report status to the monitoring system. Here's an example for MySQL backups:

#!/bin/bash

START_TIME=$(date +%s)
BACKUP_FILE="/var/backups/mysql/mysql_backup_$(date +%Y%m%d_%H%M%S).sql.gz"

Perform the backup

if mysqldump --all-databases --single-transaction | gzip > "$BACKUP_FILE"; then SUCCESS=true SIZE=$(stat -c%s "$BACKUP_FILE") else SUCCESS=false SIZE=0 fi END_TIME=$(date +%s) DURATION=$((END_TIME - START_TIME))

Report status to monitoring

/usr/local/bin/backup_status_reporter.sh "mysql" "$SUCCESS" "$DURATION" "$SIZE" if [ "$SUCCESS" = true ]; then echo "MySQL backup completed successfully" exit 0 else echo "MySQL backup failed" exit 1 fi

Set up automated backup scheduling

Create cron jobs that run your monitored backup scripts on a schedule.

sudo crontab -e
# MySQL backup every night at 2 AM
0 2   * /usr/local/bin/mysql_backup_monitored.sh

Filesystem backup every night at 3 AM

0 3 * /usr/local/bin/filesystem_backup_monitored.sh

Set up alerting

Install Alertmanager

Download and configure Alertmanager to send notifications when backup issues occur.

cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
tar xzf alertmanager-0.25.0.linux-amd64.tar.gz
sudo mv alertmanager-0.25.0.linux-amd64/alertmanager /usr/local/bin/
sudo mv alertmanager-0.25.0.linux-amd64/amtool /usr/local/bin/
sudo mkdir -p /etc/alertmanager
sudo chown prometheus:prometheus /usr/local/bin/alertmanager /usr/local/bin/amtool /etc/alertmanager

Configure Alertmanager

Set up email notifications for backup failures and other critical issues.

global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alerts@example.com'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'backup-alerts'

receivers:
  - name: 'backup-alerts'
    email_configs:
      - to: 'admin@example.com'
        subject: 'Backup Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}

Create Alertmanager service

Set up systemd service for Alertmanager to handle notification delivery.

[Unit]
Description=Alert Manager
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager/

Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Start Alertmanager

Enable and start the Alertmanager service for notification handling.

sudo mkdir -p /var/lib/alertmanager
sudo chown prometheus:prometheus /var/lib/alertmanager
sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager

Verify your setup

Check that all services are running and accessible:

sudo systemctl status prometheus node_exporter backup_exporter grafana-server alertmanager

Test Prometheus web interface

curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool

Check backup metrics are being collected

curl -s http://localhost:9101/metrics | grep backup_

Verify Grafana is accessible

curl -s http://localhost:3000/api/health

Access the web interfaces to confirm everything is working:

  • Prometheus: http://your-server-ip:9090
  • Grafana: http://your-server-ip:3000 (admin/admin initially)
  • Alertmanager: http://your-server-ip:9093
Initial Grafana Setup: Log in with admin/admin, then change the password. The backup dashboard should appear automatically under "Dashboards" if the JSON was loaded correctly.

Common issues

Symptom Cause Fix
Backup exporter shows no data No backup status files exist Run backup scripts with status reporting or create test status files
Prometheus can't scrape backup_exporter Python script failed to start Check sudo systemctl status backup_exporter and install python3 if missing
Grafana dashboard shows no data Prometheus datasource not configured Go to Grafana Settings → Data Sources and verify Prometheus URL is correct
Alerts not firing Alerting rules syntax error Check promtool check rules /etc/prometheus/backup_rules.yml
Email alerts not received SMTP configuration incorrect Verify SMTP settings in /etc/alertmanager/alertmanager.yml and test with local mail
Services fail to start after reboot File permissions incorrect Run sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus

Next steps

Running this in production?

Need this managed? Setting up backup monitoring once is straightforward. Keeping it patched, tuned, and responding to alerts 24/7 across environments is the harder part. See how we run infrastructure like this for European SaaS and e-commerce teams.

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.