Configure Apache Airflow monitoring with Prometheus alerts and Grafana dashboards

Intermediate 45 min Apr 01, 2026 553 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Set up comprehensive monitoring for Apache Airflow with Prometheus metrics collection, StatsD integration, and custom Grafana dashboards. Configure automated alerting for DAG failures, task timeouts, and system health issues.

Prerequisites

  • Apache Airflow 2.x installed
  • Python 3.8+ with pip
  • Root or sudo access
  • At least 4GB RAM available
  • Basic understanding of Airflow DAGs

What this solves

Apache Airflow generates critical workflow metrics that need monitoring for production environments. This tutorial configures comprehensive Airflow monitoring using Prometheus for metrics collection, StatsD for real-time statistics, and Grafana for visualization. You'll set up automated alerts for DAG failures, task execution issues, and resource bottlenecks to ensure reliable workflow orchestration.

Step-by-step configuration

Install monitoring dependencies

Install the required packages for Prometheus, StatsD, and monitoring tools.

sudo apt update
sudo apt install -y prometheus prometheus-node-exporter statsd python3-statsd
sudo apt install -y grafana
sudo dnf install -y epel-release
sudo dnf install -y prometheus node_exporter statsd python3-statsd
sudo dnf install -y grafana

Install Airflow monitoring dependencies

Install the Python packages needed for Airflow metrics export and StatsD integration.

pip install apache-airflow[statsd,prometheus]
pip install prometheus_client statsd

Configure Airflow for StatsD metrics

Update the Airflow configuration to enable StatsD metrics export for monitoring DAG and task performance.

[metrics]
statsd_on = True
statsd_host = localhost
statsd_port = 8125
statsd_prefix = airflow
statsd_allow_list = 
statsd_custom_client_path = 

[scheduler]
statsd_on = True
statsd_host = localhost
statsd_port = 8125
statsd_prefix = airflow.scheduler

[webserver]
statsd_on = True
statsd_host = localhost
statsd_port = 8125
statsd_prefix = airflow.webserver

Create StatsD configuration

Configure StatsD to collect Airflow metrics and export them in Prometheus format.

{
  graphitePort: 2003
, graphiteHost: "127.0.0.1"
, port: 8125
, backends: [ "./backends/prometheus" ]
, prometheus: {
    prefix: "airflow_",
    port: 9102
  }
, deleteIdleStats: true
, deleteGauges: true
, deleteTimers: true
, deleteSets: true
, deleteCounters: true
}

Configure Airflow metrics endpoint

Create a custom metrics endpoint for Airflow to expose Prometheus-compatible metrics directly.

from airflow.plugins_manager import AirflowPlugin
from airflow.www.app import csrf
from flask import Blueprint, Response
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
from airflow.models import DagRun, TaskInstance, DagModel
from airflow.utils.db import provide_session
from prometheus_client import Gauge, Counter
import logging

log = logging.getLogger(__name__)

Define metrics

dag_run_total = Counter('airflow_dag_runs_total', 'Total DAG runs', ['dag_id', 'state']) task_instance_total = Counter('airflow_task_instances_total', 'Total task instances', ['dag_id', 'task_id', 'state']) dag_run_duration = Gauge('airflow_dag_run_duration_seconds', 'DAG run duration', ['dag_id']) task_duration = Gauge('airflow_task_duration_seconds', 'Task duration', ['dag_id', 'task_id']) prometheus_blueprint = Blueprint('prometheus', __name__) @prometheus_blueprint.route('/metrics') @provide_session def metrics(session=None): """Prometheus metrics endpoint""" try: # Update DAG run metrics dag_runs = session.query(DagRun).all() for dag_run in dag_runs: dag_run_total.labels(dag_id=dag_run.dag_id, state=dag_run.state)._value._value += 1 if dag_run.end_date and dag_run.start_date: duration = (dag_run.end_date - dag_run.start_date).total_seconds() dag_run_duration.labels(dag_id=dag_run.dag_id).set(duration) # Update task instance metrics task_instances = session.query(TaskInstance).all() for ti in task_instances: task_instance_total.labels(dag_id=ti.dag_id, task_id=ti.task_id, state=ti.state)._value._value += 1 if ti.end_date and ti.start_date: duration = (ti.end_date - ti.start_date).total_seconds() task_duration.labels(dag_id=ti.dag_id, task_id=ti.task_id).set(duration) return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST) except Exception as e: log.error(f"Error generating metrics: {e}") return Response("Error generating metrics", status=500) class PrometheusMetricsPlugin(AirflowPlugin): name = "prometheus_metrics" flask_blueprints = [prometheus_blueprint]

Configure Prometheus targets

Add Airflow metrics endpoints to Prometheus scraping configuration for data collection.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "airflow_alerts.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'airflow-webserver'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/metrics'
    scrape_interval: 30s

  - job_name: 'airflow-statsd'
    static_configs:
      - targets: ['localhost:9102']
    scrape_interval: 15s

  - job_name: 'airflow-scheduler'
    static_configs:
      - targets: ['localhost:8793']
    metrics_path: '/health'
    scrape_interval: 30s

Create Airflow alerting rules

Define comprehensive alerting rules for DAG failures, task timeouts, and system health issues.

groups:
  • name: airflow_alerts
rules: - alert: AirflowDagFailed expr: increase(airflow_dag_runs_total{state="failed"}[5m]) > 0 for: 1m labels: severity: critical annotations: summary: "Airflow DAG {{ $labels.dag_id }} failed" description: "DAG {{ $labels.dag_id }} has failed runs in the last 5 minutes" - alert: AirflowTaskFailed expr: increase(airflow_task_instances_total{state="failed"}[5m]) > 0 for: 2m labels: severity: warning annotations: summary: "Airflow task {{ $labels.task_id }} in DAG {{ $labels.dag_id }} failed" description: "Task {{ $labels.task_id }} in DAG {{ $labels.dag_id }} has failed" - alert: AirflowSchedulerDown expr: up{job="airflow-scheduler"} == 0 for: 1m labels: severity: critical annotations: summary: "Airflow scheduler is down" description: "Airflow scheduler has been down for more than 1 minute" - alert: AirflowWebserverDown expr: up{job="airflow-webserver"} == 0 for: 2m labels: severity: warning annotations: summary: "Airflow webserver is down" description: "Airflow webserver has been down for more than 2 minutes" - alert: AirflowDagRunDuration expr: airflow_dag_run_duration_seconds > 3600 for: 5m labels: severity: warning annotations: summary: "Airflow DAG {{ $labels.dag_id }} running too long" description: "DAG {{ $labels.dag_id }} has been running for more than 1 hour" - alert: AirflowTaskQueueHigh expr: airflow_executor_queued_tasks > 100 for: 5m labels: severity: warning annotations: summary: "High number of queued Airflow tasks" description: "More than 100 tasks are queued in Airflow executor" - alert: AirflowDagImportErrors expr: airflow_dag_processing_import_errors > 0 for: 1m labels: severity: critical annotations: summary: "Airflow DAG import errors detected" description: "{{ $value }} DAG import errors detected in Airflow"

Start monitoring services

Enable and start Prometheus, StatsD, and Grafana services for monitoring collection.

sudo systemctl enable --now prometheus
sudo systemctl enable --now node_exporter
sudo systemctl enable --now statsd
sudo systemctl enable --now grafana-server

Restart Airflow services with new configuration

sudo systemctl restart airflow-webserver sudo systemctl restart airflow-scheduler

Configure Grafana data source

Add Prometheus as a data source in Grafana for Airflow metrics visualization.

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://localhost:9090
    isDefault: true
    editable: false

Create Airflow Grafana dashboard

Import a comprehensive Airflow dashboard with DAG metrics, task performance, and system health panels.

{
  "dashboard": {
    "id": null,
    "title": "Apache Airflow Monitoring",
    "tags": ["airflow"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "DAG Runs by State",
        "type": "stat",
        "targets": [
          {
            "expr": "sum by (state) (airflow_dag_runs_total)",
            "legendFormat": "{{state}}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Task Instances by State",
        "type": "piechart",
        "targets": [
          {
            "expr": "sum by (state) (airflow_task_instances_total)",
            "legendFormat": "{{state}}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
      },
      {
        "id": 3,
        "title": "DAG Run Duration",
        "type": "graph",
        "targets": [
          {
            "expr": "airflow_dag_run_duration_seconds",
            "legendFormat": "{{dag_id}}"
          }
        ],
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
      },
      {
        "id": 4,
        "title": "Failed DAGs (Last 24h)",
        "type": "table",
        "targets": [
          {
            "expr": "increase(airflow_dag_runs_total{state=\"failed\"}[24h]) > 0",
            "format": "table"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 16}
      },
      {
        "id": 5,
        "title": "System Resources",
        "type": "graph",
        "targets": [
          {
            "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "CPU Usage %"
          },
          {
            "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
            "legendFormat": "Memory Usage %"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 16}
      }
    ],
    "time": {"from": "now-24h", "to": "now"},
    "refresh": "30s"
  }
}

Configure Grafana dashboard provisioning

Set up automatic dashboard provisioning to load the Airflow monitoring dashboard on startup.

apiVersion: 1

providers:
  - name: 'airflow'
    orgId: 1
    folder: 'Airflow'
    type: file
    disableDeletion: false
    editable: true
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /etc/grafana/provisioning/dashboards

Set up Grafana alerts

Configure Grafana notification channels and alert rules for Airflow monitoring.

notifiers:
  - name: email-alerts
    type: email
    uid: email-alerts
    orgId: 1
    isDefault: true
    settings:
      addresses: "admin@example.com"
      subject: "Airflow Alert"
      uploadImage: true

Restart services with new configuration

Restart all services to apply the monitoring configuration and begin collecting metrics.

sudo systemctl restart prometheus
sudo systemctl restart grafana-server
sudo systemctl restart statsd

Verify services are running

sudo systemctl status prometheus grafana-server statsd

Verify your setup

Check that all monitoring components are working correctly and collecting Airflow metrics.

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets

Verify Airflow metrics endpoint

curl http://localhost:8080/metrics

Check StatsD metrics

curl http://localhost:9102/metrics

Test Grafana connectivity

curl http://localhost:3000/api/health

Verify Airflow services

sudo systemctl status airflow-webserver airflow-scheduler

Check Prometheus rules

prometheus-tool query "airflow_dag_runs_total"
Note: Access Grafana at http://your-server:3000 with default credentials admin/admin. Prometheus is available at http://your-server:9090 for query testing and rule verification.

Common issues

SymptomCauseFix
Airflow metrics not appearingStatsD configuration incorrectCheck airflow.cfg statsd settings and restart services
Prometheus can't scrape targetsFirewall blocking portsOpen ports 8080, 9102, 8793 for Airflow metrics
Grafana dashboard shows no dataPrometheus data source misconfiguredVerify Prometheus URL in datasource configuration
Alerts not firingAlert rules syntax errorsUse promtool check rules airflow_alerts.yml
StatsD metrics missingStatsD backend not configuredInstall statsd prometheus backend with npm install statsd-prometheus-backend
Task duration metrics emptyTask instances not completingCheck task logs and DAG execution history
Security: Configure authentication for Grafana and Prometheus in production environments. Use HTTPS endpoints and restrict network access to monitoring ports using firewall rules.

Next steps

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.