Monitor cron jobs and systemd timers with Prometheus and Grafana alerting

Intermediate 45 min May 13, 2026 85 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Set up comprehensive monitoring for scheduled tasks using Prometheus node_exporter and custom metrics collection. Configure Grafana dashboards and alerting rules to track job success, failures, and missed executions across your infrastructure.

Prerequisites

  • Root or sudo access
  • Basic familiarity with cron and systemd
  • Prometheus and Grafana knowledge helpful

What this solves

Scheduled tasks like cron jobs and systemd timers are critical for system maintenance, backups, and automated workflows. When they fail silently, you might not notice until data is lost or systems break. This tutorial shows you how to monitor both cron jobs and systemd timers using Prometheus metrics collection and Grafana alerting, giving you visibility into job execution status, runtime duration, and failure patterns.

Step-by-step installation

Install Prometheus node_exporter

Node_exporter provides the foundation for collecting system metrics including systemd service status. Download and install the latest version.

sudo apt update
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
sudo chown root:root /usr/local/bin/node_exporter
sudo chmod 755 /usr/local/bin/node_exporter
sudo dnf update -y
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
sudo chown root:root /usr/local/bin/node_exporter
sudo chmod 755 /usr/local/bin/node_exporter

Create node_exporter service user

Create a dedicated system user for running node_exporter securely without shell access.

sudo useradd --no-create-home --shell /bin/false node_exporter

Configure node_exporter systemd service

Create a systemd service file that enables systemd collector and textfile collector for custom metrics.

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter --collector.systemd --collector.textfile.directory=/var/lib/node_exporter/textfile_collector
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Create textfile collector directory

Set up the directory where custom job metrics will be written. The node_exporter user needs read access to collect these files.

sudo mkdir -p /var/lib/node_exporter/textfile_collector
sudo chown node_exporter:node_exporter /var/lib/node_exporter/textfile_collector
sudo chmod 755 /var/lib/node_exporter/textfile_collector

Start and enable node_exporter

Start the node_exporter service and enable it to run on boot.

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporter

Create job monitoring script

Create a helper script that your cron jobs and systemd timers can use to report their execution status to Prometheus.

#!/bin/bash

JOB_NAME="$1"
JOB_STATUS="$2"  # success, failed, or running
JOB_DURATION="$3"  # optional duration in seconds
METRIC_FILE="/var/lib/node_exporter/textfile_collector/${JOB_NAME}.prom"
TIMESTAMP=$(date +%s)

if [ -z "$JOB_NAME" ] || [ -z "$JOB_STATUS" ]; then
    echo "Usage: $0   [duration_seconds]"
    exit 1
fi

Create temporary file

TEMP_FILE="$(mktemp)"

Write job status metric

echo "# HELP job_last_status Last execution status of scheduled job (1=success, 0=failed)" >> "$TEMP_FILE" echo "# TYPE job_last_status gauge" >> "$TEMP_FILE" if [ "$JOB_STATUS" = "success" ]; then echo "job_last_status{job=\"$JOB_NAME\"} 1" >> "$TEMP_FILE" else echo "job_last_status{job=\"$JOB_NAME\"} 0" >> "$TEMP_FILE" fi

Write timestamp metric

echo "# HELP job_last_run_timestamp Unix timestamp of last job execution" >> "$TEMP_FILE" echo "# TYPE job_last_run_timestamp gauge" >> "$TEMP_FILE" echo "job_last_run_timestamp{job=\"$JOB_NAME\"} $TIMESTAMP" >> "$TEMP_FILE"

Write duration metric if provided

if [ -n "$JOB_DURATION" ]; then echo "# HELP job_duration_seconds Duration of last job execution in seconds" >> "$TEMP_FILE" echo "# TYPE job_duration_seconds gauge" >> "$TEMP_FILE" echo "job_duration_seconds{job=\"$JOB_NAME\"} $JOB_DURATION" >> "$TEMP_FILE" fi

Atomically move to final location

sudo mv "$TEMP_FILE" "$METRIC_FILE" sudo chown node_exporter:node_exporter "$METRIC_FILE" sudo chmod 644 "$METRIC_FILE"

Make job monitoring script executable

Set proper permissions on the monitoring script so it can be executed by cron jobs and systemd services.

sudo chmod 755 /usr/local/bin/job_monitor

Create wrapper script for cron jobs

Create a wrapper script that measures execution time and reports job status automatically.

#!/bin/bash

JOB_NAME="$1"
shift
COMMAND="$*"

if [ -z "$JOB_NAME" ] || [ -z "$COMMAND" ]; then
    echo "Usage: $0  "
    exit 1
fi

Record start time

START_TIME=$(date +%s)

Mark job as running

/usr/local/bin/job_monitor "$JOB_NAME" "running"

Execute the command and capture exit code

eval "$COMMAND" EXIT_CODE=$?

Calculate duration

END_TIME=$(date +%s) DURATION=$((END_TIME - START_TIME))

Report final status

if [ $EXIT_CODE -eq 0 ]; then /usr/local/bin/job_monitor "$JOB_NAME" "success" "$DURATION" else /usr/local/bin/job_monitor "$JOB_NAME" "failed" "$DURATION" fi exit $EXIT_CODE

Make cron wrapper executable

Set execute permissions on the cron wrapper script.

sudo chmod 755 /usr/local/bin/cron_wrapper

Create example monitored cron job

Add a sample cron job that demonstrates the monitoring setup. Replace with your actual backup or maintenance tasks.

crontab -e

Add this line to monitor a daily backup job:

# Daily backup with monitoring
0 2   * /usr/local/bin/cron_wrapper "daily_backup" "/usr/local/bin/backup_script.sh"

Create monitored systemd timer service

Create a systemd service that uses the monitoring script. This example monitors log cleanup.

[Unit]
Description=Clean old log files

[Service]
Type=oneshot
ExecStartPre=/usr/local/bin/job_monitor "log_cleanup" "running"
ExecStart=/bin/bash -c 'find /var/log -name "*.log.gz" -mtime +30 -delete'
ExecStartPost=/bin/bash -c 'if [ $EXIT_STATUS -eq 0 ]; then /usr/local/bin/job_monitor "log_cleanup" "success"; else /usr/local/bin/job_monitor "log_cleanup" "failed"; fi'
User=root

Create systemd timer for log cleanup

Create the timer that schedules the log cleanup service to run weekly.

[Unit]
Description=Run log cleanup weekly
Requires=log-cleanup.service

[Timer]
OnCalendar=weekly
Persistent=true

[Install]
WantedBy=timers.target

Enable systemd timer

Enable and start the systemd timer so it runs according to schedule.

sudo systemctl daemon-reload
sudo systemctl enable --now log-cleanup.timer
sudo systemctl list-timers log-cleanup.timer

Configure Prometheus server

Add your monitored servers to Prometheus configuration. This assumes you have Prometheus installed and running.

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'node-jobs'
    static_configs:
      - targets: ['localhost:9100', '203.0.113.10:9100']
    scrape_interval: 30s
    metrics_path: /metrics

Create Prometheus alerting rules

Define alerts for failed jobs, missing jobs, and long-running tasks.

groups:
  - name: job_monitoring
    rules:
      - alert: JobFailed
        expr: job_last_status == 0
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Job {{ $labels.job }} failed"
          description: "Job {{ $labels.job }} on {{ $labels.instance }} has failed"
      
      - alert: JobMissing
        expr: (time() - job_last_run_timestamp) > 86400
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Job {{ $labels.job }} hasn't run in 24 hours"
          description: "Job {{ $labels.job }} on {{ $labels.instance }} last ran {{ $value | humanizeDuration }} ago"
      
      - alert: JobRunningTooLong
        expr: job_duration_seconds > 3600
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Job {{ $labels.job }} running too long"
          description: "Job {{ $labels.job }} on {{ $labels.instance }} took {{ $value | humanizeDuration }} to complete"

Install and configure Grafana

Install Grafana for visualizing job metrics and creating dashboards.

sudo apt install -y software-properties-common
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install -y grafana
sudo dnf install -y grafana
sudo systemctl daemon-reload

Start Grafana service

Enable and start Grafana to access the web interface on port 3000.

sudo systemctl enable --now grafana-server
sudo systemctl status grafana-server

Configure Grafana dashboard

Create a dashboard JSON that visualizes job execution status, success rates, and execution times. Save this as a dashboard import.

{
  "dashboard": {
    "id": null,
    "title": "Job Monitoring",
    "tags": ["jobs", "monitoring"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Job Success Rate",
        "type": "stat",
        "targets": [
          {
            "expr": "avg(job_last_status) * 100",
            "legendFormat": "Success Rate %"
          }
        ],
        "gridPos": {"h": 4, "w": 6, "x": 0, "y": 0}
      },
      {
        "title": "Job Execution Times",
        "type": "graph",
        "targets": [
          {
            "expr": "job_duration_seconds",
            "legendFormat": "{{ job }}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 4}
      },
      {
        "title": "Failed Jobs",
        "type": "table",
        "targets": [
          {
            "expr": "job_last_status == 0",
            "format": "table"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 4}
      }
    ],
    "time": {
      "from": "now-24h",
      "to": "now"
    },
    "refresh": "30s"
  }
}

Verify your setup

Check that all components are running and collecting metrics properly.

# Verify node_exporter is running
sudo systemctl status node_exporter

Check metrics are being collected

curl -s http://localhost:9100/metrics | grep job_

Verify systemd timer is active

sudo systemctl list-timers log-cleanup.timer

Test job monitoring manually

/usr/local/bin/job_monitor "test_job" "success" "30" cat /var/lib/node_exporter/textfile_collector/test_job.prom

Check Grafana is accessible

curl -I http://localhost:3000

Configure alerting

Set up Alertmanager

Install and configure Alertmanager to handle alert notifications from Prometheus rules.

wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xvfz alertmanager-0.26.0.linux-amd64.tar.gz
sudo cp alertmanager-0.26.0.linux-amd64/alertmanager /usr/local/bin/
sudo chmod 755 /usr/local/bin/alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xvfz alertmanager-0.26.0.linux-amd64.tar.gz
sudo cp alertmanager-0.26.0.linux-amd64/alertmanager /usr/local/bin/
sudo chmod 755 /usr/local/bin/alertmanager

Configure email notifications

Set up Alertmanager to send email notifications when jobs fail or go missing.

global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'your-smtp-password'

route:
  group_by: ['alertname', 'job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'job-alerts'

receivers:
  - name: 'job-alerts'
    email_configs:
      - to: 'sysadmin@example.com'
        subject: 'Job Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Job: {{ .Labels.job }}
          Instance: {{ .Labels.instance }}
          {{ end }}

Common issues

SymptomCauseFix
Metrics not appearingNode_exporter can't read textfile directoryCheck permissions: sudo chown -R node_exporter:node_exporter /var/lib/node_exporter
Job status always shows 0Monitoring script failing silentlyTest manually: /usr/local/bin/job_monitor "test" "success"
Systemd timer not firingTimer not enabled or service has errorssudo systemctl enable log-cleanup.timer and check journalctl -u log-cleanup.service
Alerts not sendingAlertmanager configuration or SMTP issuesCheck Alertmanager logs: journalctl -u alertmanager
Permission denied on metric filesWrong ownership on textfile collector directorysudo chown node_exporter:node_exporter /var/lib/node_exporter/textfile_collector/*.prom
Never use chmod 777. It gives every user on the system full access to your files. Instead, fix ownership with chown and use minimal permissions like 644 for metric files and 755 for directories.

Next steps

Running this in production?

Want this handled for you? Setting this up once is straightforward. Keeping it patched, monitored, backed up and performant across environments is the harder part. See how we run infrastructure like this for European teams.

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.