Set up comprehensive backup monitoring using Prometheus metrics collection and Grafana dashboards. This tutorial covers backup exporter configuration, custom metrics creation, and automated alerting for backup failures and performance issues.
Prerequisites
- Root or sudo access
- At least 2GB RAM
- Python 3 installed
- Existing backup jobs to monitor
What this solves
Infrastructure backups often fail silently, leaving you vulnerable to data loss without warning. This tutorial sets up Prometheus to collect backup metrics and Grafana to visualize backup status, duration, and success rates. You'll get automated alerts when backups fail or take too long to complete.
Step-by-step installation
Update system packages
Start by updating your package manager to ensure you get the latest versions of monitoring tools.
sudo apt update && sudo apt upgrade -y
sudo apt install -y curl wget gpg
Install Prometheus
Download and install Prometheus server to collect backup metrics from various exporters.
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xzf prometheus-2.45.0.linux-amd64.tar.gz
sudo mv prometheus-2.45.0.linux-amd64/prometheus /usr/local/bin/
sudo mv prometheus-2.45.0.linux-amd64/promtool /usr/local/bin/
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus
Create Prometheus user and directories
Set up dedicated user and proper directory permissions for Prometheus to run securely.
sudo groupadd --system prometheus
sudo useradd -s /sbin/nologin --system -g prometheus prometheus
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
Configure Prometheus for backup monitoring
Create the main Prometheus configuration with backup-specific scrape targets and rules.
global:
scrape_interval: 30s
evaluation_interval: 30s
rule_files:
- "backup_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']
- job_name: 'backup_exporter'
static_configs:
- targets: ['localhost:9101']
scrape_interval: 60s
metrics_path: /metrics
- job_name: 'mysql_backup'
static_configs:
- targets: ['localhost:9104']
scrape_interval: 300s
- job_name: 'file_backup'
file_sd_configs:
- files:
- '/etc/prometheus/backup_targets.yml'
relabel_configs:
- source_labels: [__address__]
target_label: instance
- source_labels: [__address__]
regex: '(.*)'
target_label: __address__
replacement: '${1}:9105'
Create backup alerting rules
Define alerting rules for backup failures, long-running backups, and missing backup metrics.
groups:
- name: backup_monitoring
rules:
- alert: BackupFailed
expr: backup_last_success == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Backup failed for {{ $labels.job }} on {{ $labels.instance }}"
description: "Backup job {{ $labels.job }} has failed on instance {{ $labels.instance }}"
- alert: BackupTooLong
expr: backup_duration_seconds > 7200
for: 0m
labels:
severity: warning
annotations:
summary: "Backup taking too long for {{ $labels.job }}"
description: "Backup job {{ $labels.job }} has been running for {{ $value }} seconds"
- alert: BackupMissing
expr: up{job=~".backup."} == 0
for: 10m
labels:
severity: critical
annotations:
summary: "Backup exporter down for {{ $labels.job }}"
description: "Backup exporter {{ $labels.job }} on {{ $labels.instance }} has been down for more than 10 minutes"
- alert: BackupOld
expr: (time() - backup_last_success_timestamp) > 86400
for: 0m
labels:
severity: warning
annotations:
summary: "Backup is older than 24 hours for {{ $labels.job }}"
description: "Last successful backup for {{ $labels.job }} was {{ $value }} seconds ago"
- alert: BackupSizeChanged
expr: abs(backup_size_bytes - backup_size_bytes offset 24h) / backup_size_bytes offset 24h > 0.3
for: 0m
labels:
severity: warning
annotations:
summary: "Backup size changed significantly for {{ $labels.job }}"
description: "Backup size for {{ $labels.job }} changed by {{ $value | humanizePercentage }} compared to yesterday"
Install Node Exporter for system metrics
Node Exporter provides system-level metrics that complement backup monitoring data.
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz
tar xzf node_exporter-1.6.0.linux-amd64.tar.gz
sudo mv node_exporter-1.6.0.linux-amd64/node_exporter /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/node_exporter
Create custom backup exporter
Build a simple backup metrics exporter that monitors backup job status and metrics.
#!/usr/bin/env python3
import os
import time
import json
from http.server import HTTPServer, BaseHTTPRequestHandler
import subprocess
class BackupMetricsHandler(BaseHTTPRequestHandler):
def do_GET(self):
if self.path == '/metrics':
self.send_response(200)
self.send_header('Content-type', 'text/plain')
self.end_headers()
metrics = self.generate_metrics()
self.wfile.write(metrics.encode('utf-8'))
else:
self.send_response(404)
self.end_headers()
def generate_metrics(self):
metrics = []
# Check backup status files
backup_dirs = ['/var/backups', '/backup', '/opt/backups']
for backup_dir in backup_dirs:
if os.path.exists(backup_dir):
status_file = f"{backup_dir}/.backup_status"
if os.path.exists(status_file):
try:
with open(status_file, 'r') as f:
status = json.load(f)
job_name = status.get('job_name', 'unknown')
success = 1 if status.get('success', False) else 0
duration = status.get('duration_seconds', 0)
size_bytes = status.get('size_bytes', 0)
timestamp = status.get('timestamp', 0)
metrics.append(f'backup_last_success{{job="{job_name}"}} {success}')
metrics.append(f'backup_duration_seconds{{job="{job_name}"}} {duration}')
metrics.append(f'backup_size_bytes{{job="{job_name}"}} {size_bytes}')
metrics.append(f'backup_last_success_timestamp{{job="{job_name}"}} {timestamp}')
except Exception as e:
metrics.append(f'backup_exporter_errors_total{{error="status_file_read"}} 1')
# Check MySQL backups
mysql_backup_dir = '/var/backups/mysql'
if os.path.exists(mysql_backup_dir):
latest_backup = self.get_latest_backup_file(mysql_backup_dir, '*.sql.gz')
if latest_backup:
size = os.path.getsize(latest_backup)
mtime = os.path.getmtime(latest_backup)
age = time.time() - mtime
metrics.append(f'backup_last_success{{job="mysql"}} {1 if age < 86400 else 0}')
metrics.append(f'backup_size_bytes{{job="mysql"}} {size}')
metrics.append(f'backup_last_success_timestamp{{job="mysql"}} {mtime}')
# Check filesystem backups
fs_backup_dir = '/var/backups/filesystem'
if os.path.exists(fs_backup_dir):
latest_backup = self.get_latest_backup_file(fs_backup_dir, '*.tar.gz')
if latest_backup:
size = os.path.getsize(latest_backup)
mtime = os.path.getmtime(latest_backup)
age = time.time() - mtime
metrics.append(f'backup_last_success{{job="filesystem"}} {1 if age < 86400 else 0}')
metrics.append(f'backup_size_bytes{{job="filesystem"}} {size}')
metrics.append(f'backup_last_success_timestamp{{job="filesystem"}} {mtime}')
return '\n'.join(metrics) + '\n'
def get_latest_backup_file(self, directory, pattern):
try:
result = subprocess.run(['find', directory, '-name', pattern, '-type', 'f', '-printf', '%T@ %p\n'],
capture_output=True, text=True)
if result.stdout:
files = result.stdout.strip().split('\n')
latest = max(files, key=lambda x: float(x.split()[0]))
return latest.split(' ', 1)[1]
except Exception:
pass
return None
if __name__ == '__main__':
server = HTTPServer(('localhost', 9101), BackupMetricsHandler)
server.serve_forever()
Make backup exporter executable
Set proper permissions and ownership for the backup exporter script.
sudo chmod +x /usr/local/bin/backup_exporter.py
sudo chown prometheus:prometheus /usr/local/bin/backup_exporter.py
Create systemd services
Set up systemd service files for Prometheus, Node Exporter, and the backup exporter.
[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/ \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.listen-address=0.0.0.0:9090 \
--web.enable-lifecycle \
--storage.tsdb.retention.time=30d
Restart=always
RestartSec=3
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
Create Node Exporter service
Configure Node Exporter to start automatically and provide system metrics.
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes \
--web.listen-address=:9100
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target
Create backup exporter service
Set up the custom backup exporter as a systemd service for automatic startup.
[Unit]
Description=Backup Metrics Exporter
After=network.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/bin/python3 /usr/local/bin/backup_exporter.py
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target
Install Grafana
Add the Grafana repository and install the dashboard server for visualization.
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install -y grafana
Configure Grafana datasource
Set up Prometheus as the default datasource for Grafana dashboards.
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://localhost:9090
isDefault: true
editable: true
Create backup monitoring dashboard
Set up a pre-configured dashboard for backup monitoring with key metrics and alerts.
{
"dashboard": {
"id": null,
"title": "Backup Monitoring",
"tags": ["backup", "monitoring"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Backup Success Rate",
"type": "stat",
"targets": [
{
"expr": "avg(backup_last_success)",
"legendFormat": "Success Rate"
}
],
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
},
{
"id": 2,
"title": "Backup Duration",
"type": "graph",
"targets": [
{
"expr": "backup_duration_seconds",
"legendFormat": "{{ job }}"
}
],
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"yAxes": [
{
"unit": "s"
}
]
},
{
"id": 3,
"title": "Backup Sizes",
"type": "graph",
"targets": [
{
"expr": "backup_size_bytes",
"legendFormat": "{{ job }}"
}
],
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 8},
"yAxes": [
{
"unit": "bytes"
}
]
}
],
"time": {
"from": "now-7d",
"to": "now"
},
"refresh": "1m"
}
}
Start all services
Enable and start Prometheus, Node Exporter, backup exporter, and Grafana services.
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
sudo systemctl enable --now node_exporter
sudo systemctl enable --now backup_exporter
sudo systemctl enable --now grafana-server
Create backup status tracking script
This script should be called by your backup jobs to report status to Prometheus.
#!/bin/bash
Usage: backup_status_reporter.sh JOB_NAME SUCCESS DURATION_SECONDS SIZE_BYTES
JOB_NAME="$1"
SUCCESS="$2"
DURATION="$3"
SIZE="$4"
TIMESTAMP=$(date +%s)
Create status directory if it doesn't exist
mkdir -p /var/backups
Write status to JSON file
cat > "/var/backups/.backup_status" << EOF
{
"job_name": "$JOB_NAME",
"success": $SUCCESS,
"duration_seconds": $DURATION,
"size_bytes": $SIZE,
"timestamp": $TIMESTAMP
}
EOF
echo "Backup status reported for job: $JOB_NAME"
Make status reporter executable
Set proper permissions for the backup status reporting script.
sudo chmod +x /usr/local/bin/backup_status_reporter.sh
sudo chown root:root /usr/local/bin/backup_status_reporter.sh
Configure firewall rules
Allow access to Prometheus and Grafana web interfaces through the firewall.
sudo ufw allow 9090/tcp comment 'Prometheus'
sudo ufw allow 3000/tcp comment 'Grafana'
sudo ufw reload
Configure backup job integration
Modify existing backup scripts
Update your backup scripts to report status to the monitoring system. Here's an example for MySQL backups:
#!/bin/bash
START_TIME=$(date +%s)
BACKUP_FILE="/var/backups/mysql/mysql_backup_$(date +%Y%m%d_%H%M%S).sql.gz"
Perform the backup
if mysqldump --all-databases --single-transaction | gzip > "$BACKUP_FILE"; then
SUCCESS=true
SIZE=$(stat -c%s "$BACKUP_FILE")
else
SUCCESS=false
SIZE=0
fi
END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))
Report status to monitoring
/usr/local/bin/backup_status_reporter.sh "mysql" "$SUCCESS" "$DURATION" "$SIZE"
if [ "$SUCCESS" = true ]; then
echo "MySQL backup completed successfully"
exit 0
else
echo "MySQL backup failed"
exit 1
fi
Set up automated backup scheduling
Create cron jobs that run your monitored backup scripts on a schedule.
sudo crontab -e
# MySQL backup every night at 2 AM
0 2 * /usr/local/bin/mysql_backup_monitored.sh
Filesystem backup every night at 3 AM
0 3 * /usr/local/bin/filesystem_backup_monitored.sh
Set up alerting
Install Alertmanager
Download and configure Alertmanager to send notifications when backup issues occur.
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
tar xzf alertmanager-0.25.0.linux-amd64.tar.gz
sudo mv alertmanager-0.25.0.linux-amd64/alertmanager /usr/local/bin/
sudo mv alertmanager-0.25.0.linux-amd64/amtool /usr/local/bin/
sudo mkdir -p /etc/alertmanager
sudo chown prometheus:prometheus /usr/local/bin/alertmanager /usr/local/bin/amtool /etc/alertmanager
Configure Alertmanager
Set up email notifications for backup failures and other critical issues.
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alerts@example.com'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'backup-alerts'
receivers:
- name: 'backup-alerts'
email_configs:
- to: 'admin@example.com'
subject: 'Backup Alert: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
Create Alertmanager service
Set up systemd service for Alertmanager to handle notification delivery.
[Unit]
Description=Alert Manager
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager/
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target
Start Alertmanager
Enable and start the Alertmanager service for notification handling.
sudo mkdir -p /var/lib/alertmanager
sudo chown prometheus:prometheus /var/lib/alertmanager
sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager
Verify your setup
Check that all services are running and accessible:
sudo systemctl status prometheus node_exporter backup_exporter grafana-server alertmanager
Test Prometheus web interface
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool
Check backup metrics are being collected
curl -s http://localhost:9101/metrics | grep backup_
Verify Grafana is accessible
curl -s http://localhost:3000/api/health
Access the web interfaces to confirm everything is working:
- Prometheus:
http://your-server-ip:9090 - Grafana:
http://your-server-ip:3000(admin/admin initially) - Alertmanager:
http://your-server-ip:9093
Common issues
| Symptom | Cause | Fix |
|---|---|---|
| Backup exporter shows no data | No backup status files exist | Run backup scripts with status reporting or create test status files |
| Prometheus can't scrape backup_exporter | Python script failed to start | Check sudo systemctl status backup_exporter and install python3 if missing |
| Grafana dashboard shows no data | Prometheus datasource not configured | Go to Grafana Settings → Data Sources and verify Prometheus URL is correct |
| Alerts not firing | Alerting rules syntax error | Check promtool check rules /etc/prometheus/backup_rules.yml |
| Email alerts not received | SMTP configuration incorrect | Verify SMTP settings in /etc/alertmanager/alertmanager.yml and test with local mail |
| Services fail to start after reboot | File permissions incorrect | Run sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus |
Next steps
- Configure advanced Prometheus alerting rules for system resource monitoring
- Monitor MySQL performance with Prometheus to complement backup monitoring
- Set up automated backup verification to ensure backup integrity
- Implement advanced Grafana alerting with Slack and Teams integration
- Configure long-term metrics storage for historical backup analysis
Running this in production?
Automated install script
Run this to automate the entire setup
#!/usr/bin/env bash
set -euo pipefail
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# Configuration
PROMETHEUS_VERSION="2.45.0"
GRAFANA_VERSION="10.2.0"
PROMETHEUS_USER="prometheus"
GRAFANA_USER="grafana"
# Usage message
usage() {
echo "Usage: $0 [OPTIONS]"
echo "Options:"
echo " -h, --help Show this help message"
echo " -v, --version Set Prometheus version (default: $PROMETHEUS_VERSION)"
exit 1
}
# Parse arguments
while [[ $# -gt 0 ]]; do
case $1 in
-h|--help)
usage
;;
-v|--version)
PROMETHEUS_VERSION="$2"
shift 2
;;
*)
echo -e "${RED}Unknown option: $1${NC}" >&2
usage
;;
esac
done
# Cleanup function for rollback
cleanup() {
echo -e "${YELLOW}[ERROR] Installation failed. Cleaning up...${NC}"
systemctl stop prometheus 2>/dev/null || true
systemctl stop grafana-server 2>/dev/null || true
userdel -r $PROMETHEUS_USER 2>/dev/null || true
userdel -r $GRAFANA_USER 2>/dev/null || true
rm -rf /etc/prometheus /var/lib/prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
echo -e "${RED}Cleanup completed${NC}"
}
trap cleanup ERR
# Check prerequisites
if [[ $EUID -ne 0 ]]; then
echo -e "${RED}This script must be run as root${NC}" >&2
exit 1
fi
# Auto-detect distribution
if [ -f /etc/os-release ]; then
. /etc/os-release
case "$ID" in
ubuntu|debian)
PKG_MGR="apt"
PKG_INSTALL="apt install -y"
PKG_UPDATE="apt update && apt upgrade -y"
FIREWALL_CMD="ufw"
;;
almalinux|rocky|centos|rhel|ol|fedora)
PKG_MGR="dnf"
PKG_INSTALL="dnf install -y"
PKG_UPDATE="dnf update -y"
FIREWALL_CMD="firewall-cmd"
;;
amzn)
PKG_MGR="yum"
PKG_INSTALL="yum install -y"
PKG_UPDATE="yum update -y"
FIREWALL_CMD="firewall-cmd"
;;
*)
echo -e "${RED}Unsupported distribution: $ID${NC}" >&2
exit 1
;;
esac
else
echo -e "${RED}Cannot detect distribution${NC}" >&2
exit 1
fi
echo -e "${GREEN}Starting backup monitoring setup for $ID...${NC}"
# Step 1: Update system packages
echo -e "${YELLOW}[1/8] Updating system packages...${NC}"
$PKG_UPDATE
$PKG_INSTALL curl wget gpg tar
# Step 2: Install Prometheus
echo -e "${YELLOW}[2/8] Installing Prometheus...${NC}"
cd /tmp
wget -q "https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz"
tar xzf "prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz"
mv "prometheus-${PROMETHEUS_VERSION}.linux-amd64/prometheus" /usr/local/bin/
mv "prometheus-${PROMETHEUS_VERSION}.linux-amd64/promtool" /usr/local/bin/
chmod 755 /usr/local/bin/prometheus /usr/local/bin/promtool
# Step 3: Create Prometheus user and directories
echo -e "${YELLOW}[3/8] Creating Prometheus user and directories...${NC}"
groupadd --system $PROMETHEUS_USER || true
useradd -s /sbin/nologin --system -g $PROMETHEUS_USER $PROMETHEUS_USER || true
mkdir -p /etc/prometheus /var/lib/prometheus
chown -R $PROMETHEUS_USER:$PROMETHEUS_USER /etc/prometheus /var/lib/prometheus
chmod 755 /etc/prometheus /var/lib/prometheus
# Step 4: Configure Prometheus
echo -e "${YELLOW}[4/8] Configuring Prometheus...${NC}"
cat > /etc/prometheus/prometheus.yml << 'EOF'
global:
scrape_interval: 30s
evaluation_interval: 30s
rule_files:
- "backup_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'backup_exporter'
static_configs:
- targets: ['localhost:9101']
scrape_interval: 60s
metrics_path: /metrics
EOF
# Step 5: Create backup alerting rules
echo -e "${YELLOW}[5/8] Creating backup alerting rules...${NC}"
cat > /etc/prometheus/backup_rules.yml << 'EOF'
groups:
- name: backup_monitoring
rules:
- alert: BackupFailed
expr: backup_last_success == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Backup failed for {{ $labels.job }} on {{ $labels.instance }}"
description: "Backup job {{ $labels.job }} has failed on instance {{ $labels.instance }}"
- alert: BackupTooLong
expr: backup_duration_seconds > 7200
for: 0m
labels:
severity: warning
annotations:
summary: "Backup taking too long for {{ $labels.job }}"
description: "Backup job {{ $labels.job }} has been running for {{ $value }} seconds"
- alert: BackupMissing
expr: up{job=~".*backup.*"} == 0
for: 10m
labels:
severity: critical
annotations:
summary: "Backup exporter down for {{ $labels.job }}"
description: "Backup exporter {{ $labels.job }} on {{ $labels.instance }} has been down for more than 10 minutes"
EOF
chown -R $PROMETHEUS_USER:$PROMETHEUS_USER /etc/prometheus/
chmod 644 /etc/prometheus/prometheus.yml /etc/prometheus/backup_rules.yml
# Step 6: Create Prometheus systemd service
echo -e "${YELLOW}[6/8] Creating Prometheus systemd service...${NC}"
cat > /etc/systemd/system/prometheus.service << EOF
[Unit]
Description=Prometheus Monitoring System
After=network.target
[Service]
Type=simple
User=$PROMETHEUS_USER
Group=$PROMETHEUS_USER
ExecStart=/usr/local/bin/prometheus \\
--config.file=/etc/prometheus/prometheus.yml \\
--storage.tsdb.path=/var/lib/prometheus/ \\
--web.console.templates=/etc/prometheus/consoles \\
--web.console.libraries=/etc/prometheus/console_libraries \\
--web.listen-address=0.0.0.0:9090 \\
--web.enable-lifecycle
Restart=always
RestartSec=3
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
EOF
# Step 7: Install Grafana
echo -e "${YELLOW}[7/8] Installing Grafana...${NC}"
if [[ "$PKG_MGR" == "apt" ]]; then
curl -s https://packages.grafana.com/gpg.key | gpg --dearmor -o /usr/share/keyrings/grafana-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/grafana-archive-keyring.gpg] https://packages.grafana.com/oss/deb stable main" > /etc/apt/sources.list.d/grafana.list
apt update
$PKG_INSTALL grafana
else
cat > /etc/yum.repos.d/grafana.repo << EOF
[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
EOF
$PKG_INSTALL grafana
fi
# Step 8: Configure firewall and start services
echo -e "${YELLOW}[8/8] Configuring firewall and starting services...${NC}"
systemctl daemon-reload
systemctl enable prometheus grafana-server
systemctl start prometheus grafana-server
# Configure firewall
if [[ "$FIREWALL_CMD" == "ufw" ]]; then
if command -v ufw >/dev/null 2>&1; then
ufw allow 9090/tcp comment "Prometheus"
ufw allow 3000/tcp comment "Grafana"
fi
elif [[ "$FIREWALL_CMD" == "firewall-cmd" ]]; then
if systemctl is-active --quiet firewalld; then
firewall-cmd --permanent --add-port=9090/tcp
firewall-cmd --permanent --add-port=3000/tcp
firewall-cmd --reload
fi
fi
# Final verification
echo -e "${YELLOW}Verifying installation...${NC}"
sleep 5
if systemctl is-active --quiet prometheus; then
echo -e "${GREEN}✓ Prometheus is running${NC}"
else
echo -e "${RED}✗ Prometheus failed to start${NC}"
exit 1
fi
if systemctl is-active --quiet grafana-server; then
echo -e "${GREEN}✓ Grafana is running${NC}"
else
echo -e "${RED}✗ Grafana failed to start${NC}"
exit 1
fi
if curl -s http://localhost:9090/-/ready | grep -q "ready"; then
echo -e "${GREEN}✓ Prometheus is responding${NC}"
else
echo -e "${RED}✗ Prometheus is not responding${NC}"
exit 1
fi
# Disable trap
trap - ERR
echo -e "${GREEN}Installation completed successfully!${NC}"
echo -e "${GREEN}Access Grafana at: http://$(hostname -I | awk '{print $1}'):3000${NC}"
echo -e "${GREEN}Access Prometheus at: http://$(hostname -I | awk '{print $1}'):9090${NC}"
echo -e "${YELLOW}Default Grafana credentials: admin/admin${NC}"
Review the script before running. Execute with: bash install.sh