Set up comprehensive monitoring for scheduled tasks using Prometheus node_exporter and custom metrics collection. Configure Grafana dashboards and alerting rules to track job success, failures, and missed executions across your infrastructure.
Prerequisites
- Root or sudo access
- Basic familiarity with cron and systemd
- Prometheus and Grafana knowledge helpful
What this solves
Scheduled tasks like cron jobs and systemd timers are critical for system maintenance, backups, and automated workflows. When they fail silently, you might not notice until data is lost or systems break. This tutorial shows you how to monitor both cron jobs and systemd timers using Prometheus metrics collection and Grafana alerting, giving you visibility into job execution status, runtime duration, and failure patterns.
Step-by-step installation
Install Prometheus node_exporter
Node_exporter provides the foundation for collecting system metrics including systemd service status. Download and install the latest version.
sudo apt update
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
sudo chown root:root /usr/local/bin/node_exporter
sudo chmod 755 /usr/local/bin/node_exporter
Create node_exporter service user
Create a dedicated system user for running node_exporter securely without shell access.
sudo useradd --no-create-home --shell /bin/false node_exporter
Configure node_exporter systemd service
Create a systemd service file that enables systemd collector and textfile collector for custom metrics.
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter --collector.systemd --collector.textfile.directory=/var/lib/node_exporter/textfile_collector
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target
Create textfile collector directory
Set up the directory where custom job metrics will be written. The node_exporter user needs read access to collect these files.
sudo mkdir -p /var/lib/node_exporter/textfile_collector
sudo chown node_exporter:node_exporter /var/lib/node_exporter/textfile_collector
sudo chmod 755 /var/lib/node_exporter/textfile_collector
Start and enable node_exporter
Start the node_exporter service and enable it to run on boot.
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporter
Create job monitoring script
Create a helper script that your cron jobs and systemd timers can use to report their execution status to Prometheus.
#!/bin/bash
JOB_NAME="$1"
JOB_STATUS="$2" # success, failed, or running
JOB_DURATION="$3" # optional duration in seconds
METRIC_FILE="/var/lib/node_exporter/textfile_collector/${JOB_NAME}.prom"
TIMESTAMP=$(date +%s)
if [ -z "$JOB_NAME" ] || [ -z "$JOB_STATUS" ]; then
echo "Usage: $0 [duration_seconds]"
exit 1
fi
Create temporary file
TEMP_FILE="$(mktemp)"
Write job status metric
echo "# HELP job_last_status Last execution status of scheduled job (1=success, 0=failed)" >> "$TEMP_FILE"
echo "# TYPE job_last_status gauge" >> "$TEMP_FILE"
if [ "$JOB_STATUS" = "success" ]; then
echo "job_last_status{job=\"$JOB_NAME\"} 1" >> "$TEMP_FILE"
else
echo "job_last_status{job=\"$JOB_NAME\"} 0" >> "$TEMP_FILE"
fi
Write timestamp metric
echo "# HELP job_last_run_timestamp Unix timestamp of last job execution" >> "$TEMP_FILE"
echo "# TYPE job_last_run_timestamp gauge" >> "$TEMP_FILE"
echo "job_last_run_timestamp{job=\"$JOB_NAME\"} $TIMESTAMP" >> "$TEMP_FILE"
Write duration metric if provided
if [ -n "$JOB_DURATION" ]; then
echo "# HELP job_duration_seconds Duration of last job execution in seconds" >> "$TEMP_FILE"
echo "# TYPE job_duration_seconds gauge" >> "$TEMP_FILE"
echo "job_duration_seconds{job=\"$JOB_NAME\"} $JOB_DURATION" >> "$TEMP_FILE"
fi
Atomically move to final location
sudo mv "$TEMP_FILE" "$METRIC_FILE"
sudo chown node_exporter:node_exporter "$METRIC_FILE"
sudo chmod 644 "$METRIC_FILE"
Make job monitoring script executable
Set proper permissions on the monitoring script so it can be executed by cron jobs and systemd services.
sudo chmod 755 /usr/local/bin/job_monitor
Create wrapper script for cron jobs
Create a wrapper script that measures execution time and reports job status automatically.
#!/bin/bash
JOB_NAME="$1"
shift
COMMAND="$*"
if [ -z "$JOB_NAME" ] || [ -z "$COMMAND" ]; then
echo "Usage: $0 "
exit 1
fi
Record start time
START_TIME=$(date +%s)
Mark job as running
/usr/local/bin/job_monitor "$JOB_NAME" "running"
Execute the command and capture exit code
eval "$COMMAND"
EXIT_CODE=$?
Calculate duration
END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))
Report final status
if [ $EXIT_CODE -eq 0 ]; then
/usr/local/bin/job_monitor "$JOB_NAME" "success" "$DURATION"
else
/usr/local/bin/job_monitor "$JOB_NAME" "failed" "$DURATION"
fi
exit $EXIT_CODE
Make cron wrapper executable
Set execute permissions on the cron wrapper script.
sudo chmod 755 /usr/local/bin/cron_wrapper
Create example monitored cron job
Add a sample cron job that demonstrates the monitoring setup. Replace with your actual backup or maintenance tasks.
crontab -e
Add this line to monitor a daily backup job:
# Daily backup with monitoring
0 2 * /usr/local/bin/cron_wrapper "daily_backup" "/usr/local/bin/backup_script.sh"
Create monitored systemd timer service
Create a systemd service that uses the monitoring script. This example monitors log cleanup.
[Unit]
Description=Clean old log files
[Service]
Type=oneshot
ExecStartPre=/usr/local/bin/job_monitor "log_cleanup" "running"
ExecStart=/bin/bash -c 'find /var/log -name "*.log.gz" -mtime +30 -delete'
ExecStartPost=/bin/bash -c 'if [ $EXIT_STATUS -eq 0 ]; then /usr/local/bin/job_monitor "log_cleanup" "success"; else /usr/local/bin/job_monitor "log_cleanup" "failed"; fi'
User=root
Create systemd timer for log cleanup
Create the timer that schedules the log cleanup service to run weekly.
[Unit]
Description=Run log cleanup weekly
Requires=log-cleanup.service
[Timer]
OnCalendar=weekly
Persistent=true
[Install]
WantedBy=timers.target
Enable systemd timer
Enable and start the systemd timer so it runs according to schedule.
sudo systemctl daemon-reload
sudo systemctl enable --now log-cleanup.timer
sudo systemctl list-timers log-cleanup.timer
Configure Prometheus server
Add your monitored servers to Prometheus configuration. This assumes you have Prometheus installed and running.
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'node-jobs'
static_configs:
- targets: ['localhost:9100', '203.0.113.10:9100']
scrape_interval: 30s
metrics_path: /metrics
Create Prometheus alerting rules
Define alerts for failed jobs, missing jobs, and long-running tasks.
groups:
- name: job_monitoring
rules:
- alert: JobFailed
expr: job_last_status == 0
for: 1m
labels:
severity: warning
annotations:
summary: "Job {{ $labels.job }} failed"
description: "Job {{ $labels.job }} on {{ $labels.instance }} has failed"
- alert: JobMissing
expr: (time() - job_last_run_timestamp) > 86400
for: 5m
labels:
severity: critical
annotations:
summary: "Job {{ $labels.job }} hasn't run in 24 hours"
description: "Job {{ $labels.job }} on {{ $labels.instance }} last ran {{ $value | humanizeDuration }} ago"
- alert: JobRunningTooLong
expr: job_duration_seconds > 3600
for: 0m
labels:
severity: warning
annotations:
summary: "Job {{ $labels.job }} running too long"
description: "Job {{ $labels.job }} on {{ $labels.instance }} took {{ $value | humanizeDuration }} to complete"
Install and configure Grafana
Install Grafana for visualizing job metrics and creating dashboards.
sudo apt install -y software-properties-common
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install -y grafana
Start Grafana service
Enable and start Grafana to access the web interface on port 3000.
sudo systemctl enable --now grafana-server
sudo systemctl status grafana-server
Configure Grafana dashboard
Create a dashboard JSON that visualizes job execution status, success rates, and execution times. Save this as a dashboard import.
{
"dashboard": {
"id": null,
"title": "Job Monitoring",
"tags": ["jobs", "monitoring"],
"timezone": "browser",
"panels": [
{
"title": "Job Success Rate",
"type": "stat",
"targets": [
{
"expr": "avg(job_last_status) * 100",
"legendFormat": "Success Rate %"
}
],
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0}
},
{
"title": "Job Execution Times",
"type": "graph",
"targets": [
{
"expr": "job_duration_seconds",
"legendFormat": "{{ job }}"
}
],
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 4}
},
{
"title": "Failed Jobs",
"type": "table",
"targets": [
{
"expr": "job_last_status == 0",
"format": "table"
}
],
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 4}
}
],
"time": {
"from": "now-24h",
"to": "now"
},
"refresh": "30s"
}
}
Verify your setup
Check that all components are running and collecting metrics properly.
# Verify node_exporter is running
sudo systemctl status node_exporter
Check metrics are being collected
curl -s http://localhost:9100/metrics | grep job_
Verify systemd timer is active
sudo systemctl list-timers log-cleanup.timer
Test job monitoring manually
/usr/local/bin/job_monitor "test_job" "success" "30"
cat /var/lib/node_exporter/textfile_collector/test_job.prom
Check Grafana is accessible
curl -I http://localhost:3000
Configure alerting
Set up Alertmanager
Install and configure Alertmanager to handle alert notifications from Prometheus rules.
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xvfz alertmanager-0.26.0.linux-amd64.tar.gz
sudo cp alertmanager-0.26.0.linux-amd64/alertmanager /usr/local/bin/
sudo chmod 755 /usr/local/bin/alertmanager
Configure email notifications
Set up Alertmanager to send email notifications when jobs fail or go missing.
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts@example.com'
smtp_auth_password: 'your-smtp-password'
route:
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'job-alerts'
receivers:
- name: 'job-alerts'
email_configs:
- to: 'sysadmin@example.com'
subject: 'Job Alert: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Job: {{ .Labels.job }}
Instance: {{ .Labels.instance }}
{{ end }}
Common issues
| Symptom | Cause | Fix |
|---|---|---|
| Metrics not appearing | Node_exporter can't read textfile directory | Check permissions: sudo chown -R node_exporter:node_exporter /var/lib/node_exporter |
| Job status always shows 0 | Monitoring script failing silently | Test manually: /usr/local/bin/job_monitor "test" "success" |
| Systemd timer not firing | Timer not enabled or service has errors | sudo systemctl enable log-cleanup.timer and check journalctl -u log-cleanup.service |
| Alerts not sending | Alertmanager configuration or SMTP issues | Check Alertmanager logs: journalctl -u alertmanager |
| Permission denied on metric files | Wrong ownership on textfile collector directory | sudo chown node_exporter:node_exporter /var/lib/node_exporter/textfile_collector/*.prom |
Next steps
- Configure Prometheus long-term storage with Thanos for historical job data retention
- Set up Prometheus and Grafana monitoring stack with Docker compose for containerized deployments
- Configure advanced Grafana dashboards and alerting with custom visualizations
- Monitor system backup jobs with Prometheus alerts for critical data protection tasks
- Implement Prometheus multi-cluster federation for monitoring jobs across multiple servers
Running this in production?
Automated install script
Run this to automate the entire setup
#!/usr/bin/env bash
set -euo pipefail
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Configuration
NODE_EXPORTER_VERSION="1.7.0"
NODE_EXPORTER_USER="node_exporter"
TEXTFILE_DIR="/var/lib/node_exporter/textfile_collector"
MONITOR_SCRIPT_PATH="/usr/local/bin/monitor-job"
# Cleanup function
cleanup() {
echo -e "${RED}[ERROR] Installation failed. Cleaning up...${NC}"
systemctl stop node_exporter 2>/dev/null || true
systemctl disable node_exporter 2>/dev/null || true
rm -f /etc/systemd/system/node_exporter.service
rm -f /usr/local/bin/node_exporter
rm -f "$MONITOR_SCRIPT_PATH"
rm -rf /var/lib/node_exporter
userdel "$NODE_EXPORTER_USER" 2>/dev/null || true
systemctl daemon-reload
}
trap cleanup ERR
# Check if running as root or with sudo
check_privileges() {
if [[ $EUID -ne 0 ]]; then
echo -e "${RED}This script must be run as root or with sudo${NC}"
exit 1
fi
}
# Auto-detect distribution
detect_distro() {
if [ -f /etc/os-release ]; then
. /etc/os-release
case "$ID" in
ubuntu|debian)
PKG_MGR="apt"
PKG_UPDATE="apt update"
PKG_INSTALL="apt install -y"
;;
almalinux|rocky|centos|rhel|ol|fedora)
PKG_MGR="dnf"
PKG_UPDATE="dnf makecache"
PKG_INSTALL="dnf install -y"
# Try yum if dnf is not available
if ! command -v dnf &> /dev/null; then
PKG_MGR="yum"
PKG_UPDATE="yum makecache"
PKG_INSTALL="yum install -y"
fi
;;
amzn)
PKG_MGR="yum"
PKG_UPDATE="yum makecache"
PKG_INSTALL="yum install -y"
;;
*)
echo -e "${RED}Unsupported distribution: $ID${NC}"
exit 1
;;
esac
echo -e "${BLUE}Detected distribution: $PRETTY_NAME${NC}"
else
echo -e "${RED}Cannot detect distribution. /etc/os-release not found.${NC}"
exit 1
fi
}
# Update package repositories
update_packages() {
echo -e "${BLUE}[1/8] Updating package repositories...${NC}"
$PKG_UPDATE
}
# Install required packages
install_dependencies() {
echo -e "${BLUE}[2/8] Installing dependencies...${NC}"
$PKG_INSTALL wget tar
}
# Download and install node_exporter
install_node_exporter() {
echo -e "${BLUE}[3/8] Downloading and installing node_exporter...${NC}"
cd /tmp
wget -q "https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz"
tar xzf "node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz"
cp "node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter" /usr/local/bin/
chown root:root /usr/local/bin/node_exporter
chmod 755 /usr/local/bin/node_exporter
# Clean up
rm -rf "node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64"*
}
# Create node_exporter system user
create_user() {
echo -e "${BLUE}[4/8] Creating node_exporter system user...${NC}"
if ! id "$NODE_EXPORTER_USER" &>/dev/null; then
useradd --no-create-home --shell /bin/false --system "$NODE_EXPORTER_USER"
fi
}
# Create textfile collector directory
create_directories() {
echo -e "${BLUE}[5/8] Creating directories...${NC}"
mkdir -p "$TEXTFILE_DIR"
chown "$NODE_EXPORTER_USER:$NODE_EXPORTER_USER" "$TEXTFILE_DIR"
chmod 755 "$TEXTFILE_DIR"
# Create parent directory with proper permissions
chown "$NODE_EXPORTER_USER:$NODE_EXPORTER_USER" "$(dirname $TEXTFILE_DIR)"
}
# Create systemd service
create_service() {
echo -e "${BLUE}[6/8] Creating systemd service...${NC}"
cat > /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter --collector.systemd --collector.textfile.directory=/var/lib/node_exporter/textfile_collector
Restart=always
RestartSec=3
NoNewPrivileges=yes
PrivateTmp=yes
ProtectSystem=strict
ProtectHome=yes
ReadWritePaths=/var/lib/node_exporter
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable node_exporter
systemctl start node_exporter
}
# Create job monitoring script
create_monitor_script() {
echo -e "${BLUE}[7/8] Creating job monitoring script...${NC}"
cat > "$MONITOR_SCRIPT_PATH" << 'EOF'
#!/bin/bash
JOB_NAME="$1"
JOB_STATUS="$2" # success, failed, or running
JOB_DURATION="$3" # optional duration in seconds
METRIC_FILE="/var/lib/node_exporter/textfile_collector/${JOB_NAME}.prom"
TIMESTAMP=$(date +%s)
if [ -z "$JOB_NAME" ] || [ -z "$JOB_STATUS" ]; then
echo "Usage: $0 <job_name> <status> [duration_seconds]"
echo "Status: success, failed, or running"
exit 1
fi
# Create temporary file
TEMP_FILE="$(mktemp)"
# Write job status metric
echo "# HELP job_last_status Last execution status of scheduled job (1=success, 0=failed)" >> "$TEMP_FILE"
echo "# TYPE job_last_status gauge" >> "$TEMP_FILE"
if [ "$JOB_STATUS" = "success" ]; then
echo "job_last_status{job=\"$JOB_NAME\"} 1" >> "$TEMP_FILE"
else
echo "job_last_status{job=\"$JOB_NAME\"} 0" >> "$TEMP_FILE"
fi
# Write timestamp metric
echo "# HELP job_last_run_timestamp Unix timestamp of last job execution" >> "$TEMP_FILE"
echo "# TYPE job_last_run_timestamp gauge" >> "$TEMP_FILE"
echo "job_last_run_timestamp{job=\"$JOB_NAME\"} $TIMESTAMP" >> "$TEMP_FILE"
# Write duration metric if provided
if [ -n "$JOB_DURATION" ]; then
echo "# HELP job_duration_seconds Duration of last job execution in seconds" >> "$TEMP_FILE"
echo "# TYPE job_duration_seconds gauge" >> "$TEMP_FILE"
echo "job_duration_seconds{job=\"$JOB_NAME\"} $JOB_DURATION" >> "$TEMP_FILE"
fi
# Atomically move file to final location
mv "$TEMP_FILE" "$METRIC_FILE"
chmod 644 "$METRIC_FILE"
EOF
chmod 755 "$MONITOR_SCRIPT_PATH"
}
# Verify installation
verify_installation() {
echo -e "${BLUE}[8/8] Verifying installation...${NC}"
# Check if node_exporter is running
if systemctl is-active --quiet node_exporter; then
echo -e "${GREEN}✓ node_exporter service is running${NC}"
else
echo -e "${RED}✗ node_exporter service is not running${NC}"
exit 1
fi
# Check if metrics endpoint is accessible
sleep 2
if curl -s http://localhost:9100/metrics | grep -q "node_exporter_build_info"; then
echo -e "${GREEN}✓ node_exporter metrics endpoint is accessible${NC}"
else
echo -e "${RED}✗ node_exporter metrics endpoint is not accessible${NC}"
exit 1
fi
# Check if monitor script exists and is executable
if [[ -x "$MONITOR_SCRIPT_PATH" ]]; then
echo -e "${GREEN}✓ Job monitoring script is installed${NC}"
else
echo -e "${RED}✗ Job monitoring script is not properly installed${NC}"
exit 1
fi
# Test monitor script
if "$MONITOR_SCRIPT_PATH" test_job success 10; then
echo -e "${GREEN}✓ Job monitoring script is working${NC}"
else
echo -e "${RED}✗ Job monitoring script test failed${NC}"
exit 1
fi
echo -e "${GREEN}Installation completed successfully!${NC}"
echo -e "${YELLOW}Usage examples:${NC}"
echo " # Report successful job:"
echo " $MONITOR_SCRIPT_PATH backup_job success 120"
echo " # Report failed job:"
echo " $MONITOR_SCRIPT_PATH backup_job failed"
echo " # Add to crontab:"
echo " 0 2 * * * /path/to/backup.sh && $MONITOR_SCRIPT_PATH backup success || $MONITOR_SCRIPT_PATH backup failed"
echo -e "${YELLOW}node_exporter is running on http://localhost:9100/metrics${NC}"
}
# Main execution
main() {
echo -e "${GREEN}Prometheus Job Monitoring Setup${NC}"
echo "=================================="
check_privileges
detect_distro
update_packages
install_dependencies
install_node_exporter
create_user
create_directories
create_service
create_monitor_script
verify_installation
}
main "$@"
Review the script before running. Execute with: bash install.sh