Set up Apache Airflow performance monitoring with DataDog agent integration and custom dashboards

Intermediate 45 min Apr 25, 2026 15 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Configure comprehensive Apache Airflow monitoring using DataDog agent to track DAG performance, task execution metrics, and resource utilization with custom dashboards and automated alerting for production workflow management.

Prerequisites

  • Apache Airflow installed and running
  • DataDog account with API key
  • Python 3.8+ with pip
  • Root or sudo access

What this solves

Apache Airflow generates extensive metrics about DAG execution, task performance, and system resource usage, but these metrics aren't automatically collected or visualized. DataDog provides comprehensive monitoring for Airflow deployments, tracking everything from task success rates to scheduler performance. This integration helps you identify bottlenecks, monitor SLA compliance, and maintain healthy workflow orchestration in production environments.

Step-by-step configuration

Install DataDog agent

Download and install the DataDog agent using the official installation script. Replace YOUR_API_KEY with your actual DataDog API key from the DataDog console.

DD_API_KEY=YOUR_API_KEY bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)"
sudo systemctl enable datadog-agent
sudo systemctl start datadog-agent
DD_API_KEY=YOUR_API_KEY bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)"
sudo systemctl enable datadog-agent
sudo systemctl start datadog-agent

Configure Airflow metrics collection

Enable Airflow's StatsD metrics by configuring the airflow.cfg file. This allows Airflow to send metrics to the DataDog agent's StatsD server.

[metrics]
statsd_on = True
statsd_host = localhost
statsd_port = 8125
statsd_prefix = airflow
statsd_allow_list = scheduler,executor,dagrun,taskinstance

Create DataDog Airflow integration configuration

Configure the DataDog agent to collect Airflow metrics by creating the integration configuration file.

init_config:

instances:
  - url: http://localhost:8080
    username: admin
    password: admin
    tags:
      - environment:production
      - airflow_cluster:main
    collect_health_metrics: true
    collect_task_metrics: true
    collect_dag_metrics: true

Configure DataDog agent for StatsD

Enable and configure the DataDog agent's DogStatsD server to receive metrics from Airflow.

# DogStatsD configuration
dogstatsd_config:
  enabled: true
  bind_host: localhost
  port: 8125
  non_local_traffic: false
  

Tags for all metrics

tags: - datacenter:us-east-1 - environment:production - service:airflow

Create custom Airflow metrics script

Create a custom script to collect additional Airflow metrics that aren't available through the standard integration.

#!/usr/bin/env python3

import time
from datetime import datetime, timedelta
from airflow import settings
from airflow.models import DagRun, TaskInstance, DagModel
from datadog import initialize, statsd

Initialize DataDog

options = { 'statsd_host': 'localhost', 'statsd_port': 8125, } initialize(**options) def collect_airflow_metrics(): session = settings.Session() try: # Count active DAGs active_dags = session.query(DagModel).filter(DagModel.is_active == True).count() statsd.gauge('airflow.dags.active', active_dags) # Count running DAG runs running_dag_runs = session.query(DagRun).filter(DagRun.state == 'running').count() statsd.gauge('airflow.dag_runs.running', running_dag_runs) # Count failed tasks in last hour one_hour_ago = datetime.utcnow() - timedelta(hours=1) failed_tasks = session.query(TaskInstance).filter( TaskInstance.state == 'failed', TaskInstance.end_date >= one_hour_ago ).count() statsd.gauge('airflow.tasks.failed_last_hour', failed_tasks) # Average task duration by DAG for dag in session.query(DagModel).filter(DagModel.is_active == True): avg_duration = session.query(TaskInstance).filter( TaskInstance.dag_id == dag.dag_id, TaskInstance.state == 'success', TaskInstance.end_date >= one_hour_ago ).with_entities(func.avg(TaskInstance.duration)).scalar() if avg_duration: statsd.gauge(f'airflow.dag.{dag.dag_id}.avg_task_duration', float(avg_duration)) finally: session.close() if __name__ == '__main__': collect_airflow_metrics()

Set up automated metrics collection

Create a systemd timer to run the custom metrics collection script every minute.

[Unit]
Description=Airflow Custom Metrics Collection
After=network.target

[Service]
Type=oneshot
User=airflow
Group=airflow
ExecStart=/usr/bin/python3 /opt/airflow/scripts/custom_metrics.py
Environment=AIRFLOW_HOME=/opt/airflow
WorkingDirectory=/opt/airflow
[Unit]
Description=Run Airflow Custom Metrics Collection
Requires=airflow-metrics.service

[Timer]
OnBootSec=1min
OnUnitActiveSec=1min
Unit=airflow-metrics.service

[Install]
WantedBy=timers.target

Enable custom metrics collection

Enable and start the systemd timer for automated metrics collection.

sudo systemctl daemon-reload
sudo systemctl enable airflow-metrics.timer
sudo systemctl start airflow-metrics.timer
sudo systemctl status airflow-metrics.timer

Configure log collection

Configure DataDog to collect Airflow logs for centralized log analysis and alerting.

logs:
  - type: file
    path: "/opt/airflow/logs/scheduler/*.log"
    service: airflow-scheduler
    source: airflow
    log_processing_rules:
      - type: multi_line
        name: airflow_scheduler
        pattern: '\d{4}-\d{2}-\d{2}'
        
  - type: file
    path: "/opt/airflow/logs/dag_processor_manager/*.log"
    service: airflow-dag-processor
    source: airflow
    
  - type: file
    path: "/opt/airflow/logs////.log"
    service: airflow-tasks
    source: airflow
    tags:
      - log_type:task_execution

Restart DataDog agent

Restart the DataDog agent to apply all configuration changes and begin collecting metrics.

sudo systemctl restart datadog-agent
sudo systemctl status datadog-agent
sudo datadog-agent status

Restart Airflow services

Restart Airflow components to enable StatsD metrics collection.

sudo systemctl restart airflow-scheduler
sudo systemctl restart airflow-webserver
sudo systemctl restart airflow-worker

Create custom DataDog dashboards

Import Airflow dashboard template

Use the DataDog API or web interface to create a comprehensive Airflow monitoring dashboard. Save this JSON configuration for dashboard creation.

{
  "title": "Apache Airflow Performance Monitor",
  "description": "Comprehensive Airflow monitoring dashboard",
  "widgets": [
    {
      "definition": {
        "type": "timeseries",
        "requests": [
          {
            "q": "avg:airflow.dag_runs.running{*}",
            "display_type": "line",
            "style": {
              "palette": "dog_classic",
              "line_type": "solid",
              "line_width": "normal"
            }
          }
        ],
        "title": "Running DAG Runs",
        "show_legend": false
      },
      "layout": {
        "x": 0,
        "y": 0,
        "width": 4,
        "height": 2
      }
    },
    {
      "definition": {
        "type": "query_value",
        "requests": [
          {
            "q": "avg:airflow.tasks.failed_last_hour{*}",
            "aggregator": "last"
          }
        ],
        "title": "Failed Tasks (Last Hour)",
        "autoscale": true,
        "precision": 0
      },
      "layout": {
        "x": 4,
        "y": 0,
        "width": 2,
        "height": 2
      }
    }
  ]
}

Configure alerting and notifications

Create Airflow performance alerts

Set up DataDog monitors to alert on critical Airflow performance issues and failures.

curl -X POST "https://api.datadoghq.com/api/v1/monitor" \
-H "Content-Type: application/json" \
-H "DD-API-KEY: YOUR_API_KEY" \
-H "DD-APPLICATION-KEY: YOUR_APP_KEY" \
-d '{
  "type": "metric alert",
  "query": "avg(last_5m):avg:airflow.tasks.failed_last_hour{*} > 10",
  "name": "High Airflow Task Failure Rate",
  "message": "@slack-alerts Airflow is experiencing high task failure rates. Current: {{value}} failed tasks in the last hour.",
  "tags": ["service:airflow", "alert_type:performance"],
  "options": {
    "thresholds": {
      "critical": 10,
      "warning": 5
    },
    "notify_audit": false,
    "require_full_window": true,
    "new_host_delay": 300,
    "include_tags": true,
    "escalation_message": "@pagerduty-airflow Airflow task failures continue to exceed threshold."
  }
}'

Configure scheduler health monitoring

Create alerts to monitor Airflow scheduler health and responsiveness.

{
  "type": "service check",
  "query": "\"airflow.scheduler.heartbeat\".over(\"*\").last(2).count_by_status()",
  "name": "Airflow Scheduler Health Check",
  "message": "@slack-critical The Airflow scheduler appears to be down or unresponsive. Please check the scheduler service immediately.",
  "tags": ["service:airflow", "component:scheduler"],
  "options": {
    "thresholds": {
      "ok": 1,
      "critical": 1
    },
    "no_data_timeframe": 10,
    "notify_no_data": true
  }
}

Set up SLA violation alerts

Configure monitoring for DAG SLA violations to ensure workflow compliance.

{
  "type": "log alert",
  "query": "logs(\"service:airflow source:airflow\").index(\"*\").rollup(\"count\").by(\"dag_id\").last(\"15m\") > 0",
  "name": "Airflow SLA Violations",
  "message": "@team-data-engineering SLA violation detected for DAG: {{dag_id.name}}. Review task performance and resource allocation.",
  "tags": ["service:airflow", "alert_type:sla"],
  "options": {
    "enable_logs_sample": true,
    "escalation_message": "@manager-data SLA violations continue for DAG: {{dag_id.name}}"
  }
}

Verify your setup

# Check DataDog agent status
sudo datadog-agent status

Verify Airflow metrics are being sent

sudo datadog-agent check airflow

Check custom metrics collection

sudo systemctl status airflow-metrics.timer sudo journalctl -u airflow-metrics.service -n 20

Verify StatsD metrics

echo "airflow.test.metric:1|c" | nc -u localhost 8125

Check Airflow configuration

airflow config get-value metrics statsd_on

Visit your DataDog dashboard to confirm metrics are flowing and alerts are configured. You should see Airflow metrics under the "Metrics Explorer" and can create custom dashboards using the collected data.

Common issues

SymptomCauseFix
No metrics in DataDogStatsD not enabled in AirflowCheck statsd_on = True in airflow.cfg and restart services
Permission denied on log filesDataDog agent can't read logssudo chown -R dd-agent:airflow /opt/airflow/logs
Custom metrics script failsMissing Python dependenciespip install datadog apache-airflow
High metric ingestion costsToo many custom metricsFilter metrics using statsd_allow_list in airflow.cfg
Dashboard shows no dataIncorrect metric namesUse DataDog Metrics Explorer to verify metric names

Next steps

Running this in production?

Want this handled for you? Setting up monitoring once is straightforward. Keeping it tuned, cost-optimized, and responding to alerts across environments is the harder part. See how we run infrastructure like this for European teams.

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.