Configure Backup Monitoring with Prometheus and Grafana

Set up comprehensive backup monitoring using Prometheus metrics collection and Grafana dashboards. This tutorial covers backup exporter configuration, custom metrics creation, and automated alerting for backup failures and performance issues.

Prerequisites

Root or sudo access
At least 2GB RAM
Python 3 installed
Existing backup jobs to monitor

What this solves

Infrastructure backups often fail silently, leaving you vulnerable to data loss without warning. This tutorial sets up Prometheus to collect backup metrics and Grafana to visualize backup status, duration, and success rates. You'll get automated alerts when backups fail or take too long to complete.

Step-by-step installation

Update system packages

Start by updating your package manager to ensure you get the latest versions of monitoring tools.

sudo apt update && sudo apt upgrade -y
sudo apt install -y curl wget gpg

sudo dnf update -y
sudo dnf install -y curl wget gpg

Install Prometheus

Download and install Prometheus server to collect backup metrics from various exporters.

cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xzf prometheus-2.45.0.linux-amd64.tar.gz
sudo mv prometheus-2.45.0.linux-amd64/prometheus /usr/local/bin/
sudo mv prometheus-2.45.0.linux-amd64/promtool /usr/local/bin/
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

Create Prometheus user and directories

Set up dedicated user and proper directory permissions for Prometheus to run securely.

sudo groupadd --system prometheus
sudo useradd -s /sbin/nologin --system -g prometheus prometheus
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus

Configure Prometheus for backup monitoring

Create the main Prometheus configuration with backup-specific scrape targets and rules.

global:
  scrape_interval: 30s
  evaluation_interval: 30s

rule_files:
  - "backup_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'backup_exporter'
    static_configs:
      - targets: ['localhost:9101']
    scrape_interval: 60s
    metrics_path: /metrics

  - job_name: 'mysql_backup'
    static_configs:
      - targets: ['localhost:9104']
    scrape_interval: 300s

  - job_name: 'file_backup'
    file_sd_configs:
      - files:
        - '/etc/prometheus/backup_targets.yml'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
      - source_labels: [__address__]
        regex: '(.*)'
        target_label: __address__
        replacement: '${1}:9105'

Create backup alerting rules

Define alerting rules for backup failures, long-running backups, and missing backup metrics.

groups:
  - name: backup_monitoring
    rules:
      - alert: BackupFailed
        expr: backup_last_success == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Backup failed for {{ $labels.job }} on {{ $labels.instance }}"
          description: "Backup job {{ $labels.job }} has failed on instance {{ $labels.instance }}"

      - alert: BackupTooLong
        expr: backup_duration_seconds > 7200
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Backup taking too long for {{ $labels.job }}"
          description: "Backup job {{ $labels.job }} has been running for {{ $value }} seconds"

      - alert: BackupMissing
        expr: up{job=~".backup."} == 0
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Backup exporter down for {{ $labels.job }}"
          description: "Backup exporter {{ $labels.job }} on {{ $labels.instance }} has been down for more than 10 minutes"

      - alert: BackupOld
        expr: (time() - backup_last_success_timestamp) > 86400
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Backup is older than 24 hours for {{ $labels.job }}"
          description: "Last successful backup for {{ $labels.job }} was {{ $value }} seconds ago"

      - alert: BackupSizeChanged
        expr: abs(backup_size_bytes - backup_size_bytes offset 24h) / backup_size_bytes offset 24h > 0.3
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Backup size changed significantly for {{ $labels.job }}"
          description: "Backup size for {{ $labels.job }} changed by {{ $value | humanizePercentage }} compared to yesterday"

Install Node Exporter for system metrics

Node Exporter provides system-level metrics that complement backup monitoring data.

cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz
tar xzf node_exporter-1.6.0.linux-amd64.tar.gz
sudo mv node_exporter-1.6.0.linux-amd64/node_exporter /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/node_exporter

Create custom backup exporter

Build a simple backup metrics exporter that monitors backup job status and metrics.

#!/usr/bin/env python3
import os
import time
import json
from http.server import HTTPServer, BaseHTTPRequestHandler
import subprocess

class BackupMetricsHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == '/metrics':
            self.send_response(200)
            self.send_header('Content-type', 'text/plain')
            self.end_headers()
            
            metrics = self.generate_metrics()
            self.wfile.write(metrics.encode('utf-8'))
        else:
            self.send_response(404)
            self.end_headers()
    
    def generate_metrics(self):
        metrics = []
        
        # Check backup status files
        backup_dirs = ['/var/backups', '/backup', '/opt/backups']
        
        for backup_dir in backup_dirs:
            if os.path.exists(backup_dir):
                status_file = f"{backup_dir}/.backup_status"
                if os.path.exists(status_file):
                    try:
                        with open(status_file, 'r') as f:
                            status = json.load(f)
                        
                        job_name = status.get('job_name', 'unknown')
                        success = 1 if status.get('success', False) else 0
                        duration = status.get('duration_seconds', 0)
                        size_bytes = status.get('size_bytes', 0)
                        timestamp = status.get('timestamp', 0)
                        
                        metrics.append(f'backup_last_success{{job="{job_name}"}} {success}')
                        metrics.append(f'backup_duration_seconds{{job="{job_name}"}} {duration}')
                        metrics.append(f'backup_size_bytes{{job="{job_name}"}} {size_bytes}')
                        metrics.append(f'backup_last_success_timestamp{{job="{job_name}"}} {timestamp}')
                    except Exception as e:
                        metrics.append(f'backup_exporter_errors_total{{error="status_file_read"}} 1')
        
        # Check MySQL backups
        mysql_backup_dir = '/var/backups/mysql'
        if os.path.exists(mysql_backup_dir):
            latest_backup = self.get_latest_backup_file(mysql_backup_dir, '*.sql.gz')
            if latest_backup:
                size = os.path.getsize(latest_backup)
                mtime = os.path.getmtime(latest_backup)
                age = time.time() - mtime
                
                metrics.append(f'backup_last_success{{job="mysql"}} {1 if age < 86400 else 0}')
                metrics.append(f'backup_size_bytes{{job="mysql"}} {size}')
                metrics.append(f'backup_last_success_timestamp{{job="mysql"}} {mtime}')
        
        # Check filesystem backups
        fs_backup_dir = '/var/backups/filesystem'
        if os.path.exists(fs_backup_dir):
            latest_backup = self.get_latest_backup_file(fs_backup_dir, '*.tar.gz')
            if latest_backup:
                size = os.path.getsize(latest_backup)
                mtime = os.path.getmtime(latest_backup)
                age = time.time() - mtime
                
                metrics.append(f'backup_last_success{{job="filesystem"}} {1 if age < 86400 else 0}')
                metrics.append(f'backup_size_bytes{{job="filesystem"}} {size}')
                metrics.append(f'backup_last_success_timestamp{{job="filesystem"}} {mtime}')
        
        return '\n'.join(metrics) + '\n'
    
    def get_latest_backup_file(self, directory, pattern):
        try:
            result = subprocess.run(['find', directory, '-name', pattern, '-type', 'f', '-printf', '%T@ %p\n'], 
                                  capture_output=True, text=True)
            if result.stdout:
                files = result.stdout.strip().split('\n')
                latest = max(files, key=lambda x: float(x.split()[0]))
                return latest.split(' ', 1)[1]
        except Exception:
            pass
        return None

if __name__ == '__main__':
    server = HTTPServer(('localhost', 9101), BackupMetricsHandler)
    server.serve_forever()

Make backup exporter executable

Set proper permissions and ownership for the backup exporter script.

sudo chmod +x /usr/local/bin/backup_exporter.py
sudo chown prometheus:prometheus /usr/local/bin/backup_exporter.py

Create systemd services

Set up systemd service files for Prometheus, Node Exporter, and the backup exporter.

[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus/ \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --web.listen-address=0.0.0.0:9090 \
  --web.enable-lifecycle \
  --storage.tsdb.retention.time=30d

Restart=always
RestartSec=3
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

Create Node Exporter service

Configure Node Exporter to start automatically and provide system metrics.

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --collector.systemd \
  --collector.processes \
  --web.listen-address=:9100

Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Create backup exporter service

Set up the custom backup exporter as a systemd service for automatic startup.

[Unit]
Description=Backup Metrics Exporter
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/bin/python3 /usr/local/bin/backup_exporter.py
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Install Grafana

Add the Grafana repository and install the dashboard server for visualization.

wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install -y grafana

sudo tee /etc/yum.repos.d/grafana.repo << 'EOF'
[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
EOF
sudo dnf install -y grafana

Configure Grafana datasource

Set up Prometheus as the default datasource for Grafana dashboards.

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://localhost:9090
    isDefault: true
    editable: true

Create backup monitoring dashboard

Set up a pre-configured dashboard for backup monitoring with key metrics and alerts.

{
  "dashboard": {
    "id": null,
    "title": "Backup Monitoring",
    "tags": ["backup", "monitoring"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "Backup Success Rate",
        "type": "stat",
        "targets": [
          {
            "expr": "avg(backup_last_success)",
            "legendFormat": "Success Rate"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Backup Duration",
        "type": "graph",
        "targets": [
          {
            "expr": "backup_duration_seconds",
            "legendFormat": "{{ job }}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "yAxes": [
          {
            "unit": "s"
          }
        ]
      },
      {
        "id": 3,
        "title": "Backup Sizes",
        "type": "graph",
        "targets": [
          {
            "expr": "backup_size_bytes",
            "legendFormat": "{{ job }}"
          }
        ],
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8},
        "yAxes": [
          {
            "unit": "bytes"
          }
        ]
      }
    ],
    "time": {
      "from": "now-7d",
      "to": "now"
    },
    "refresh": "1m"
  }
}

Start all services

Enable and start Prometheus, Node Exporter, backup exporter, and Grafana services.

sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
sudo systemctl enable --now node_exporter
sudo systemctl enable --now backup_exporter
sudo systemctl enable --now grafana-server

Create backup status tracking script

This script should be called by your backup jobs to report status to Prometheus.

#!/bin/bash

Usage: backup_status_reporter.sh JOB_NAME SUCCESS DURATION_SECONDS SIZE_BYTES

JOB_NAME="$1"
SUCCESS="$2"
DURATION="$3"
SIZE="$4"
TIMESTAMP=$(date +%s)

Create status directory if it doesn't exist
mkdir -p /var/backups

Write status to JSON file
cat > "/var/backups/.backup_status" << EOF
{
  "job_name": "$JOB_NAME",
  "success": $SUCCESS,
  "duration_seconds": $DURATION,
  "size_bytes": $SIZE,
  "timestamp": $TIMESTAMP
}
EOF

echo "Backup status reported for job: $JOB_NAME"

Make status reporter executable

Set proper permissions for the backup status reporting script.

sudo chmod +x /usr/local/bin/backup_status_reporter.sh
sudo chown root:root /usr/local/bin/backup_status_reporter.sh

Configure firewall rules

Allow access to Prometheus and Grafana web interfaces through the firewall.

sudo ufw allow 9090/tcp comment 'Prometheus'
sudo ufw allow 3000/tcp comment 'Grafana'
sudo ufw reload

sudo firewall-cmd --permanent --add-port=9090/tcp
sudo firewall-cmd --permanent --add-port=3000/tcp
sudo firewall-cmd --reload

Configure backup job integration

Modify existing backup scripts

Update your backup scripts to report status to the monitoring system. Here's an example for MySQL backups:

#!/bin/bash

START_TIME=$(date +%s)
BACKUP_FILE="/var/backups/mysql/mysql_backup_$(date +%Y%m%d_%H%M%S).sql.gz"

Perform the backup
if mysqldump --all-databases --single-transaction | gzip > "$BACKUP_FILE"; then
    SUCCESS=true
    SIZE=$(stat -c%s "$BACKUP_FILE")
else
    SUCCESS=false
    SIZE=0
fi

END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))

Report status to monitoring
/usr/local/bin/backup_status_reporter.sh "mysql" "$SUCCESS" "$DURATION" "$SIZE"

if [ "$SUCCESS" = true ]; then
    echo "MySQL backup completed successfully"
    exit 0
else
    echo "MySQL backup failed"
    exit 1
fi

Set up automated backup scheduling

Create cron jobs that run your monitored backup scripts on a schedule.

sudo crontab -e

# MySQL backup every night at 2 AM
0 2   * /usr/local/bin/mysql_backup_monitored.sh

Filesystem backup every night at 3 AM  
0 3   * /usr/local/bin/filesystem_backup_monitored.sh

Set up alerting

Install Alertmanager

Download and configure Alertmanager to send notifications when backup issues occur.

cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
tar xzf alertmanager-0.25.0.linux-amd64.tar.gz
sudo mv alertmanager-0.25.0.linux-amd64/alertmanager /usr/local/bin/
sudo mv alertmanager-0.25.0.linux-amd64/amtool /usr/local/bin/
sudo mkdir -p /etc/alertmanager
sudo chown prometheus:prometheus /usr/local/bin/alertmanager /usr/local/bin/amtool /etc/alertmanager

Configure Alertmanager

Set up email notifications for backup failures and other critical issues.

global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alerts@example.com'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'backup-alerts'

receivers:
  - name: 'backup-alerts'
    email_configs:
      - to: 'admin@example.com'
        subject: 'Backup Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}

Create Alertmanager service

Set up systemd service for Alertmanager to handle notification delivery.

[Unit]
Description=Alert Manager
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager/

Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Start Alertmanager

Enable and start the Alertmanager service for notification handling.

sudo mkdir -p /var/lib/alertmanager
sudo chown prometheus:prometheus /var/lib/alertmanager
sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager

Verify your setup

Check that all services are running and accessible:

sudo systemctl status prometheus node_exporter backup_exporter grafana-server alertmanager

Test Prometheus web interface
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool

Check backup metrics are being collected
curl -s http://localhost:9101/metrics | grep backup_

Verify Grafana is accessible
curl -s http://localhost:3000/api/health

Access the web interfaces to confirm everything is working:

Prometheus: http://your-server-ip:9090
Grafana: http://your-server-ip:3000 (admin/admin initially)
Alertmanager: http://your-server-ip:9093

Initial Grafana Setup: Log in with admin/admin, then change the password. The backup dashboard should appear automatically under "Dashboards" if the JSON was loaded correctly.

Common issues

Symptom	Cause	Fix
Backup exporter shows no data	No backup status files exist	Run backup scripts with status reporting or create test status files
Prometheus can't scrape backup_exporter	Python script failed to start	Check `sudo systemctl status backup_exporter` and install python3 if missing
Grafana dashboard shows no data	Prometheus datasource not configured	Go to Grafana Settings → Data Sources and verify Prometheus URL is correct
Alerts not firing	Alerting rules syntax error	Check `promtool check rules /etc/prometheus/backup_rules.yml`
Email alerts not received	SMTP configuration incorrect	Verify SMTP settings in `/etc/alertmanager/alertmanager.yml` and test with local mail
Services fail to start after reboot	File permissions incorrect	Run `sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus`

Next steps

Configure advanced Prometheus alerting rules for system resource monitoring
Monitor MySQL performance with Prometheus to complement backup monitoring
Set up automated backup verification to ensure backup integrity
Implement advanced Grafana alerting with Slack and Teams integration
Configure long-term metrics storage for historical backup analysis

Running this in production?

Need this managed? Setting up backup monitoring once is straightforward. Keeping it patched, tuned, and responding to alerts 24/7 across environments is the harder part. See how we run infrastructure like this for European SaaS and e-commerce teams.

Automated install script

Run this to automate the entire setup

install.sh

#!/usr/bin/env bash
set -euo pipefail

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color

# Configuration
PROMETHEUS_VERSION="2.45.0"
GRAFANA_VERSION="10.2.0"
PROMETHEUS_USER="prometheus"
GRAFANA_USER="grafana"

# Usage message
usage() {
    echo "Usage: $0 [OPTIONS]"
    echo "Options:"
    echo "  -h, --help     Show this help message"
    echo "  -v, --version  Set Prometheus version (default: $PROMETHEUS_VERSION)"
    exit 1
}

# Parse arguments
while [[ $# -gt 0 ]]; do
    case $1 in
        -h|--help)
            usage
            ;;
        -v|--version)
            PROMETHEUS_VERSION="$2"
            shift 2
            ;;
        *)
            echo -e "${RED}Unknown option: $1${NC}" >&2
            usage
            ;;
    esac
done

# Cleanup function for rollback
cleanup() {
    echo -e "${YELLOW}[ERROR] Installation failed. Cleaning up...${NC}"
    systemctl stop prometheus 2>/dev/null || true
    systemctl stop grafana-server 2>/dev/null || true
    userdel -r $PROMETHEUS_USER 2>/dev/null || true
    userdel -r $GRAFANA_USER 2>/dev/null || true
    rm -rf /etc/prometheus /var/lib/prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
    echo -e "${RED}Cleanup completed${NC}"
}

trap cleanup ERR

# Check prerequisites
if [[ $EUID -ne 0 ]]; then
   echo -e "${RED}This script must be run as root${NC}" >&2
   exit 1
fi

# Auto-detect distribution
if [ -f /etc/os-release ]; then
    . /etc/os-release
    case "$ID" in
        ubuntu|debian)
            PKG_MGR="apt"
            PKG_INSTALL="apt install -y"
            PKG_UPDATE="apt update && apt upgrade -y"
            FIREWALL_CMD="ufw"
            ;;
        almalinux|rocky|centos|rhel|ol|fedora)
            PKG_MGR="dnf"
            PKG_INSTALL="dnf install -y"
            PKG_UPDATE="dnf update -y"
            FIREWALL_CMD="firewall-cmd"
            ;;
        amzn)
            PKG_MGR="yum"
            PKG_INSTALL="yum install -y"
            PKG_UPDATE="yum update -y"
            FIREWALL_CMD="firewall-cmd"
            ;;
        *)
            echo -e "${RED}Unsupported distribution: $ID${NC}" >&2
            exit 1
            ;;
    esac
else
    echo -e "${RED}Cannot detect distribution${NC}" >&2
    exit 1
fi

echo -e "${GREEN}Starting backup monitoring setup for $ID...${NC}"

# Step 1: Update system packages
echo -e "${YELLOW}[1/8] Updating system packages...${NC}"
$PKG_UPDATE
$PKG_INSTALL curl wget gpg tar

# Step 2: Install Prometheus
echo -e "${YELLOW}[2/8] Installing Prometheus...${NC}"
cd /tmp
wget -q "https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz"
tar xzf "prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz"
mv "prometheus-${PROMETHEUS_VERSION}.linux-amd64/prometheus" /usr/local/bin/
mv "prometheus-${PROMETHEUS_VERSION}.linux-amd64/promtool" /usr/local/bin/
chmod 755 /usr/local/bin/prometheus /usr/local/bin/promtool

# Step 3: Create Prometheus user and directories
echo -e "${YELLOW}[3/8] Creating Prometheus user and directories...${NC}"
groupadd --system $PROMETHEUS_USER || true
useradd -s /sbin/nologin --system -g $PROMETHEUS_USER $PROMETHEUS_USER || true
mkdir -p /etc/prometheus /var/lib/prometheus
chown -R $PROMETHEUS_USER:$PROMETHEUS_USER /etc/prometheus /var/lib/prometheus
chmod 755 /etc/prometheus /var/lib/prometheus

# Step 4: Configure Prometheus
echo -e "${YELLOW}[4/8] Configuring Prometheus...${NC}"
cat > /etc/prometheus/prometheus.yml << 'EOF'
global:
  scrape_interval: 30s
  evaluation_interval: 30s

rule_files:
  - "backup_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'backup_exporter'
    static_configs:
      - targets: ['localhost:9101']
    scrape_interval: 60s
    metrics_path: /metrics
EOF

# Step 5: Create backup alerting rules
echo -e "${YELLOW}[5/8] Creating backup alerting rules...${NC}"
cat > /etc/prometheus/backup_rules.yml << 'EOF'
groups:
  - name: backup_monitoring
    rules:
      - alert: BackupFailed
        expr: backup_last_success == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Backup failed for {{ $labels.job }} on {{ $labels.instance }}"
          description: "Backup job {{ $labels.job }} has failed on instance {{ $labels.instance }}"

      - alert: BackupTooLong
        expr: backup_duration_seconds > 7200
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Backup taking too long for {{ $labels.job }}"
          description: "Backup job {{ $labels.job }} has been running for {{ $value }} seconds"

      - alert: BackupMissing
        expr: up{job=~".*backup.*"} == 0
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Backup exporter down for {{ $labels.job }}"
          description: "Backup exporter {{ $labels.job }} on {{ $labels.instance }} has been down for more than 10 minutes"
EOF

chown -R $PROMETHEUS_USER:$PROMETHEUS_USER /etc/prometheus/
chmod 644 /etc/prometheus/prometheus.yml /etc/prometheus/backup_rules.yml

# Step 6: Create Prometheus systemd service
echo -e "${YELLOW}[6/8] Creating Prometheus systemd service...${NC}"
cat > /etc/systemd/system/prometheus.service << EOF
[Unit]
Description=Prometheus Monitoring System
After=network.target

[Service]
Type=simple
User=$PROMETHEUS_USER
Group=$PROMETHEUS_USER
ExecStart=/usr/local/bin/prometheus \\
  --config.file=/etc/prometheus/prometheus.yml \\
  --storage.tsdb.path=/var/lib/prometheus/ \\
  --web.console.templates=/etc/prometheus/consoles \\
  --web.console.libraries=/etc/prometheus/console_libraries \\
  --web.listen-address=0.0.0.0:9090 \\
  --web.enable-lifecycle
Restart=always
RestartSec=3
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
EOF

# Step 7: Install Grafana
echo -e "${YELLOW}[7/8] Installing Grafana...${NC}"
if [[ "$PKG_MGR" == "apt" ]]; then
    curl -s https://packages.grafana.com/gpg.key | gpg --dearmor -o /usr/share/keyrings/grafana-archive-keyring.gpg
    echo "deb [signed-by=/usr/share/keyrings/grafana-archive-keyring.gpg] https://packages.grafana.com/oss/deb stable main" > /etc/apt/sources.list.d/grafana.list
    apt update
    $PKG_INSTALL grafana
else
    cat > /etc/yum.repos.d/grafana.repo << EOF
[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
EOF
    $PKG_INSTALL grafana
fi

# Step 8: Configure firewall and start services
echo -e "${YELLOW}[8/8] Configuring firewall and starting services...${NC}"
systemctl daemon-reload
systemctl enable prometheus grafana-server
systemctl start prometheus grafana-server

# Configure firewall
if [[ "$FIREWALL_CMD" == "ufw" ]]; then
    if command -v ufw >/dev/null 2>&1; then
        ufw allow 9090/tcp comment "Prometheus"
        ufw allow 3000/tcp comment "Grafana"
    fi
elif [[ "$FIREWALL_CMD" == "firewall-cmd" ]]; then
    if systemctl is-active --quiet firewalld; then
        firewall-cmd --permanent --add-port=9090/tcp
        firewall-cmd --permanent --add-port=3000/tcp
        firewall-cmd --reload
    fi
fi

# Final verification
echo -e "${YELLOW}Verifying installation...${NC}"
sleep 5

if systemctl is-active --quiet prometheus; then
    echo -e "${GREEN}✓ Prometheus is running${NC}"
else
    echo -e "${RED}✗ Prometheus failed to start${NC}"
    exit 1
fi

if systemctl is-active --quiet grafana-server; then
    echo -e "${GREEN}✓ Grafana is running${NC}"
else
    echo -e "${RED}✗ Grafana failed to start${NC}"
    exit 1
fi

if curl -s http://localhost:9090/-/ready | grep -q "ready"; then
    echo -e "${GREEN}✓ Prometheus is responding${NC}"
else
    echo -e "${RED}✗ Prometheus is not responding${NC}"
    exit 1
fi

# Disable trap
trap - ERR

echo -e "${GREEN}Installation completed successfully!${NC}"
echo -e "${GREEN}Access Grafana at: http://$(hostname -I | awk '{print $1}'):3000${NC}"
echo -e "${GREEN}Access Prometheus at: http://$(hostname -I | awk '{print $1}'):9090${NC}"
echo -e "${YELLOW}Default Grafana credentials: admin/admin${NC}"

Review the script before running. Execute with: bash install.sh

#backup monitoring #prometheus #grafana #backup alerting #infrastructure monitoring

Configure backup monitoring with Prometheus and Grafana for automated infrastructure oversight