Configure backup monitoring with Prometheus and Grafana for automated infrastructure oversight

Intermediate 45 min Apr 18, 2026 185 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Set up comprehensive backup monitoring using Prometheus exporters and Grafana dashboards. Configure automated alerts for backup failures, track success rates, and visualize backup infrastructure health across multiple systems.

Prerequisites

  • Existing Prometheus and Grafana installation
  • Node Exporter running
  • Basic Python 3 environment
  • Access to backup log files

What this solves

Backup systems fail silently more often than they should, leaving you vulnerable to data loss without warning. This tutorial shows you how to monitor backup jobs using Prometheus metrics collection and Grafana visualization, with automated alerting when backups fail or miss their schedules.

Step-by-step configuration

Install backup monitoring dependencies

Start by installing the required packages for backup monitoring and metric collection.

sudo apt update
sudo apt install -y python3-pip curl wget jq
pip3 install prometheus_client psutil
sudo dnf update -y
sudo dnf install -y python3-pip curl wget jq
pip3 install prometheus_client psutil

Create backup status exporter

Create a Python script that exports backup job metrics to Prometheus format.

#!/usr/bin/env python3
import json
import os
import time
from datetime import datetime, timedelta
from prometheus_client import CollectorRegistry, Gauge, generate_latest, write_to_textfile
import glob

Configuration

BACKUP_LOG_DIR = '/var/log/backups' METRICS_DIR = '/var/lib/prometheus/node-exporter' BACKUP_CONFIGS = '/etc/backup-monitor.json'

Prometheus metrics

registry = CollectorRegistry() backup_success = Gauge('backup_job_success', 'Backup job success status (1=success, 0=failure)', ['job_name', 'backup_type'], registry=registry) backup_duration = Gauge('backup_job_duration_seconds', 'Backup job duration in seconds', ['job_name', 'backup_type'], registry=registry) backup_size_bytes = Gauge('backup_size_bytes', 'Backup size in bytes', ['job_name', 'backup_type'], registry=registry) backup_last_run = Gauge('backup_job_last_run_timestamp', 'Unix timestamp of last backup run', ['job_name', 'backup_type'], registry=registry) backup_files_count = Gauge('backup_files_count', 'Number of files in backup', ['job_name', 'backup_type'], registry=registry) def load_backup_config(): """Load backup job configurations""" try: with open(BACKUP_CONFIGS, 'r') as f: return json.load(f) except FileNotFoundError: return {"jobs": []} def parse_backup_logs(): """Parse backup log files and extract metrics""" config = load_backup_config() for job in config.get('jobs', []): job_name = job['name'] backup_type = job.get('type', 'unknown') log_pattern = job.get('log_pattern', f'/var/log/backups/{job_name}*.log') # Find most recent log file log_files = glob.glob(log_pattern) if not log_files: # No logs found - mark as failed backup_success.labels(job_name=job_name, backup_type=backup_type).set(0) continue latest_log = max(log_files, key=os.path.getctime) try: with open(latest_log, 'r') as f: log_content = f.read() # Parse log based on backup type if backup_type == 'mysql': parse_mysql_backup_log(log_content, job_name, backup_type) elif backup_type == 'postgres': parse_postgres_backup_log(log_content, job_name, backup_type) elif backup_type == 'filesystem': parse_filesystem_backup_log(log_content, job_name, backup_type) else: parse_generic_backup_log(log_content, job_name, backup_type) except Exception as e: print(f"Error parsing log for {job_name}: {e}") backup_success.labels(job_name=job_name, backup_type=backup_type).set(0) def parse_mysql_backup_log(content, job_name, backup_type): """Parse MySQL backup logs""" lines = content.strip().split('\n') success = 0 duration = 0 size_bytes = 0 last_run = time.time() for line in lines: if 'backup completed successfully' in line.lower(): success = 1 elif 'backup failed' in line.lower() or 'error' in line.lower(): success = 0 elif 'duration:' in line.lower(): try: duration = float(line.split('duration:')[1].strip().split()[0]) except: pass elif 'size:' in line.lower(): try: size_str = line.split('size:')[1].strip().split()[0] if 'mb' in size_str.lower(): size_bytes = float(size_str.replace('MB', '').replace('mb', '')) 1024 1024 elif 'gb' in size_str.lower(): size_bytes = float(size_str.replace('GB', '').replace('gb', '')) 1024 1024 * 1024 except: pass backup_success.labels(job_name=job_name, backup_type=backup_type).set(success) backup_duration.labels(job_name=job_name, backup_type=backup_type).set(duration) backup_size_bytes.labels(job_name=job_name, backup_type=backup_type).set(size_bytes) backup_last_run.labels(job_name=job_name, backup_type=backup_type).set(last_run) def parse_postgres_backup_log(content, job_name, backup_type): """Parse PostgreSQL backup logs""" lines = content.strip().split('\n') success = 0 duration = 0 size_bytes = 0 last_run = time.time() for line in lines: if 'pg_dump: last built-in oid is' in line or 'pg_dump: dumping contents of table' in line: success = 1 elif 'pg_dump: error:' in line or 'connection to database failed' in line: success = 0 elif 'total time:' in line.lower(): try: duration = float(line.split(':')[-1].strip().replace('s', '')) except: pass # Get file size if backup file path is in log for line in lines: if '.sql' in line and os.path.exists(line.strip()): try: size_bytes = os.path.getsize(line.strip()) except: pass break backup_success.labels(job_name=job_name, backup_type=backup_type).set(success) backup_duration.labels(job_name=job_name, backup_type=backup_type).set(duration) backup_size_bytes.labels(job_name=job_name, backup_type=backup_type).set(size_bytes) backup_last_run.labels(job_name=job_name, backup_type=backup_type).set(last_run) def parse_filesystem_backup_log(content, job_name, backup_type): """Parse filesystem backup logs (rsync, tar, etc.)""" lines = content.strip().split('\n') success = 0 duration = 0 size_bytes = 0 files_count = 0 last_run = time.time() for line in lines: if 'backup completed' in line.lower() or 'sent ' in line and 'bytes' in line: success = 1 elif 'backup failed' in line.lower() or 'rsync error' in line.lower(): success = 0 elif 'total size is' in line: try: size_bytes = int(line.split('total size is')[1].strip().split()[0].replace(',', '')) except: pass elif 'total transferred file size' in line: try: size_bytes = int(line.split('total transferred file size')[1].strip().split()[0].replace(',', '')) except: pass elif 'number of files' in line: try: files_count = int(line.split('number of files')[1].strip().split()[0].replace(',', '')) except: pass backup_success.labels(job_name=job_name, backup_type=backup_type).set(success) backup_duration.labels(job_name=job_name, backup_type=backup_type).set(duration) backup_size_bytes.labels(job_name=job_name, backup_type=backup_type).set(size_bytes) backup_last_run.labels(job_name=job_name, backup_type=backup_type).set(last_run) backup_files_count.labels(job_name=job_name, backup_type=backup_type).set(files_count) def parse_generic_backup_log(content, job_name, backup_type): """Parse generic backup logs""" lines = content.strip().split('\n') success = 0 last_run = time.time() # Simple success detection for line in lines: if any(word in line.lower() for word in ['success', 'completed', 'finished', 'done']): success = 1 break elif any(word in line.lower() for word in ['error', 'failed', 'abort']): success = 0 break backup_success.labels(job_name=job_name, backup_type=backup_type).set(success) backup_last_run.labels(job_name=job_name, backup_type=backup_type).set(last_run) def main(): """Main function to collect and export metrics""" # Create metrics directory if it doesn't exist os.makedirs(METRICS_DIR, exist_ok=True) # Parse backup logs and update metrics parse_backup_logs() # Write metrics to file for node_exporter metrics_file = os.path.join(METRICS_DIR, 'backup_metrics.prom') write_to_textfile(metrics_file, registry) print(f"Backup metrics exported to {metrics_file}") if __name__ == '__main__': main()

Create backup monitoring configuration

Define your backup jobs in a configuration file for the exporter to monitor.

{
  "jobs": [
    {
      "name": "mysql_daily",
      "type": "mysql",
      "log_pattern": "/var/log/backups/mysql_*.log",
      "schedule": "0 2   *",
      "retention_days": 30
    },
    {
      "name": "postgres_nightly",
      "type": "postgres",
      "log_pattern": "/var/log/backups/postgres_*.log",
      "schedule": "0 1   *",
      "retention_days": 14
    },
    {
      "name": "web_files",
      "type": "filesystem",
      "log_pattern": "/var/log/backups/web_backup_*.log",
      "schedule": "0 3   *",
      "retention_days": 7
    },
    {
      "name": "config_backup",
      "type": "filesystem",
      "log_pattern": "/var/log/backups/config_*.log",
      "schedule": "0 4   *",
      "retention_days": 90
    }
  ]
}

Make exporter executable and create directories

Set proper permissions and create required directories for the monitoring system.

sudo chmod +x /usr/local/bin/backup-exporter.py
sudo mkdir -p /var/lib/prometheus/node-exporter
sudo mkdir -p /var/log/backups
sudo chown -R prometheus:prometheus /var/lib/prometheus

Configure Node Exporter for backup metrics

Configure Node Exporter to include the textfile collector for backup metrics.

sudo apt install -y prometheus-node-exporter
sudo dnf install -y node_exporter
ARGS="--collector.textfile.directory=/var/lib/prometheus/node-exporter"
sudo systemctl restart prometheus-node-exporter
sudo systemctl enable prometheus-node-exporter

Create systemd timer for metric collection

Set up automated metric collection that runs every 5 minutes to keep data current.

[Unit]
Description=Backup Metrics Exporter
Wants=backup-exporter.timer

[Service]
Type=oneshot
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/backup-exporter.py
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
[Unit]
Description=Run backup metrics exporter every 5 minutes
Requires=backup-exporter.service

[Timer]
OnCalendar=*:0/5
Persistent=true

[Install]
WantedBy=timers.target
sudo systemctl daemon-reload
sudo systemctl enable --now backup-exporter.timer
sudo systemctl status backup-exporter.timer

Configure Prometheus to scrape backup metrics

Add backup monitoring to your existing Prometheus configuration.

# Add this job to your existing prometheus.yml
scrape_configs:
  - job_name: 'backup-monitoring'
    static_configs:
      - targets: ['localhost:9100']
    scrape_interval: 30s
    metrics_path: /metrics
    params:
      collect[]:
        - textfile
sudo systemctl reload prometheus

Create Prometheus alerting rules for backups

Define alert rules to notify when backups fail or miss their scheduled runs.

groups:
  • name: backup_monitoring
rules: - alert: BackupJobFailed expr: backup_job_success == 0 for: 5m labels: severity: critical service: backup annotations: summary: "Backup job {{ $labels.job_name }} failed" description: "Backup job {{ $labels.job_name }} of type {{ $labels.backup_type }} has failed. Check logs for details." - alert: BackupJobMissing expr: time() - backup_job_last_run_timestamp > 86400 for: 30m labels: severity: warning service: backup annotations: summary: "Backup job {{ $labels.job_name }} hasn't run in 24 hours" description: "Backup job {{ $labels.job_name }} last ran {{ $value | humanizeDuration }} ago. Expected daily runs." - alert: BackupSizeAnomaly expr: | ( backup_size_bytes / (avg_over_time(backup_size_bytes[7d]) != 0) ) < 0.5 or ( backup_size_bytes / (avg_over_time(backup_size_bytes[7d]) != 0) ) > 2 for: 1h labels: severity: warning service: backup annotations: summary: "Backup size anomaly for {{ $labels.job_name }}" description: "Backup job {{ $labels.job_name }} size is {{ $value | humanizePercentage }} of the 7-day average, indicating potential issues." - alert: BackupDurationHigh expr: backup_job_duration_seconds > 7200 # 2 hours for: 15m labels: severity: warning service: backup annotations: summary: "Backup job {{ $labels.job_name }} taking too long" description: "Backup job {{ $labels.job_name }} has been running for {{ $value | humanizeDuration }}, which is unusually long."
sudo systemctl reload prometheus

Import Grafana dashboard for backup monitoring

Create a comprehensive Grafana dashboard to visualize backup metrics and trends.

{
  "dashboard": {
    "id": null,
    "title": "Backup Monitoring Dashboard",
    "tags": ["backup", "monitoring"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "Backup Success Rate",
        "type": "stat",
        "targets": [
          {
            "expr": "avg(backup_job_success) * 100",
            "legendFormat": "Success Rate %"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "color": {
              "mode": "thresholds"
            },
            "thresholds": {
              "steps": [
                {"color": "red", "value": 0},
                {"color": "yellow", "value": 80},
                {"color": "green", "value": 95}
              ]
            },
            "unit": "percent"
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Backup Jobs Status",
        "type": "table",
        "targets": [
          {
            "expr": "backup_job_success",
            "legendFormat": "{{job_name}} ({{backup_type}})",
            "format": "table",
            "instant": true
          }
        ],
        "transformations": [
          {
            "id": "organize",
            "options": {
              "excludeByName": {"Time": true, "__name__": true, "instance": true, "job": true},
              "renameByName": {
                "job_name": "Backup Job",
                "backup_type": "Type",
                "Value": "Status"
              }
            }
          }
        ],
        "fieldConfig": {
          "overrides": [
            {
              "matcher": {"id": "byName", "options": "Status"},
              "properties": [
                {
                  "id": "custom.displayMode",
                  "value": "color-background"
                },
                {
                  "id": "thresholds",
                  "value": {
                    "steps": [
                      {"color": "red", "value": 0},
                      {"color": "green", "value": 1}
                    ]
                  }
                }
              ]
            }
          ]
        },
        "gridPos": {"h": 8, "w": 18, "x": 6, "y": 0}
      },
      {
        "id": 3,
        "title": "Backup Size Trends",
        "type": "timeseries",
        "targets": [
          {
            "expr": "backup_size_bytes",
            "legendFormat": "{{job_name}} Size"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "bytes"
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8}
      },
      {
        "id": 4,
        "title": "Backup Duration",
        "type": "timeseries",
        "targets": [
          {
            "expr": "backup_job_duration_seconds",
            "legendFormat": "{{job_name}} Duration"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "s"
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8}
      },
      {
        "id": 5,
        "title": "Last Backup Run",
        "type": "table",
        "targets": [
          {
            "expr": "backup_job_last_run_timestamp * 1000",
            "legendFormat": "{{job_name}}",
            "format": "table",
            "instant": true
          }
        ],
        "transformations": [
          {
            "id": "organize",
            "options": {
              "excludeByName": {"Time": true, "__name__": true, "instance": true, "job": true},
              "renameByName": {
                "job_name": "Backup Job",
                "backup_type": "Type",
                "Value": "Last Run"
              }
            }
          }
        ],
        "fieldConfig": {
          "overrides": [
            {
              "matcher": {"id": "byName", "options": "Last Run"},
              "properties": [
                {
                  "id": "unit",
                  "value": "dateTimeAsLocal"
                }
              ]
            }
          ]
        },
        "gridPos": {"h": 6, "w": 24, "x": 0, "y": 16}
      }
    ],
    "time": {
      "from": "now-24h",
      "to": "now"
    },
    "refresh": "30s"
  }
}
# Import dashboard via Grafana API
curl -X POST -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_GRAFANA_API_KEY" \
  -d @/tmp/backup-dashboard.json \
  http://localhost:3000/api/dashboards/db

Configure Alertmanager for backup notifications

Set up email notifications when backup jobs fail or miss their schedules.

global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alerts@example.com'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'backup-alerts'
  routes:
  - match:
      service: backup
    receiver: 'backup-team'

receivers:
  • name: 'backup-alerts'
email_configs: - to: 'admin@example.com' subject: 'Backup Alert: {{ .GroupLabels.alertname }}' body: | {{ range .Alerts }} Alert: {{ .Annotations.summary }} Description: {{ .Annotations.description }} Severity: {{ .Labels.severity }} Job: {{ .Labels.job_name }} {{ end }}
  • name: 'backup-team'
email_configs: - to: 'backup-team@example.com' subject: 'BACKUP ISSUE: {{ .GroupLabels.alertname }}' body: | Backup monitoring has detected an issue: {{ range .Alerts }} Job: {{ .Labels.job_name }} Type: {{ .Labels.backup_type }} Issue: {{ .Annotations.summary }} Details: {{ .Annotations.description }} Severity: {{ .Labels.severity }} {{ end }} Please investigate immediately.
sudo systemctl reload alertmanager

Verify your setup

Test the backup monitoring system to ensure metrics are being collected and displayed correctly.

# Run the backup exporter manually to test
sudo -u prometheus /usr/local/bin/backup-exporter.py

Check that metrics file was created

ls -la /var/lib/prometheus/node-exporter/backup_metrics.prom

View current metrics

cat /var/lib/prometheus/node-exporter/backup_metrics.prom

Check Node Exporter is serving backup metrics

curl -s http://localhost:9100/metrics | grep backup_

Verify Prometheus is scraping the metrics

curl -s "http://localhost:9090/api/v1/query?query=backup_job_success" | jq .

Check alerting rules are loaded

curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[] | select(.name=="backup_monitoring")'

Test timer is running

sudo systemctl status backup-exporter.timer journalctl -u backup-exporter.service -f

Common issues

SymptomCauseFix
No backup metrics in PrometheusNode Exporter not configured for textfile collectorAdd --collector.textfile.directory to Node Exporter args and restart
Backup exporter fails with permission deniedWrong file ownership or permissionssudo chown prometheus:prometheus /var/lib/prometheus/node-exporter
Metrics file empty or not updatingBackup logs not found or unreadableCheck log paths in /etc/backup-monitor.json and ensure logs exist
False positive alertsLog parsing not matching your backup formatCustomize parsing functions in backup-exporter.py for your log format
Grafana dashboard shows no dataData source not configured correctlyVerify Prometheus data source in Grafana points to correct URL
Email alerts not sendingSMTP configuration incorrectTest SMTP settings: echo "test" | mail -s "test" admin@example.com
Note: You can extend the backup exporter to monitor additional backup types by adding custom parsing functions. The system supports MySQL, PostgreSQL, filesystem backups, and generic log formats out of the box.

Next steps

Running this in production?

Want this handled for you? Setting up backup monitoring once is straightforward. Keeping it patched, monitored, backed up and tuned across environments is the harder part. See how we run infrastructure like this for European SaaS and e-commerce teams.

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.