Apache Airflow

automationdevops

Python-based workflow orchestration platform for authoring, scheduling, and monitoring data pipelines. Define workflows as code using DAGs, then run them on a schedule or trigger with dependencies

#workflows#data-pipelines#etl#scheduling#python#self-hosted

Quick Start

curl -LfO https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml && mkdir -p ./dags ./logs ./plugins && echo -e "AIRFLOW_UID=$(id -u)" > .env && docker compose up airflow-init && docker compose up -d

Overview

Apache Airflow is a workflow orchestration platform where pipelines are written as Python code rather than configured through a UI. You define a Directed Acyclic Graph (DAG) that describes which tasks to run, in what order, and with what dependencies. Airflow handles the scheduling, execution, retry logic, and monitoring. The web interface provides a visual representation of each DAG and a history of every run with task-level logs.

The Python-as-code model is both the platform’s greatest strength and the reason it is not for everyone. Any workflow logic expressible in Python — conditionals, dynamic task generation, branching based on upstream results — is straightforward to implement. Hundreds of built-in operators cover connections to AWS, Google Cloud, Azure, Spark, Postgres, HTTP endpoints, and most common data infrastructure. If a pre-built operator does not exist, a PythonOperator runs any arbitrary Python function.

The typical use case is data engineering: extract data from source systems, transform it, and load it to a destination on a schedule. ML teams use Airflow to chain data preparation, feature engineering, training, and evaluation steps with dependency tracking. Anywhere a sequence of tasks has explicit ordering requirements and needs reliable scheduling with retry logic and alerting, Airflow fits.

The operational cost is real. A production-grade Airflow deployment needs a metadata database (PostgreSQL is standard), a message broker (Redis or RabbitMQ) for the distributed executor, a scheduler process, a webserver, and one or more worker nodes. The Docker Compose setup from the official docs handles this, but it runs six or more containers and needs meaningful hardware.

For teams running straightforward cron jobs, Cronicle or n8n is a lower-overhead alternative. Airflow earns its complexity when workflows have non-trivial dependencies, require distributed execution, or need the full observability its UI provides.

Apache Airflow: Pros & Cons

Pros (The Wins)Cons (The Friction)
Python DAGs:
Full programmatic control;
branching, loops, dynamic tasks.
Heavy infrastructure:
Scheduler, workers, PostgreSQL,
and Redis all required.
Rich web UI:
Visual DAG graph, run history,
and task-level log viewer.
Not for real-time:
Optimised for batch workflows;
sub-minute scheduling unsupported.
Hundreds of operators:
AWS, GCP, Spark, Postgres,
and most data tools covered.
Steep learning curve:
Executors, XComs, connections
— the model takes time to learn.
45.6k stars:
Industry-standard across data
engineering and ML pipelines.
DAG maintenance cost:
Complex pipelines in Python
accumulate technical debt fast.

Use Cases

Specific ways to use Apache Airflow for your workflow.

01
Orchestrate a data pipeline that extracts from multiple sources, transforms, and loads to a data warehouse on a schedule
02
Coordinate ML model training workflows where each step depends on the previous one completing successfully
03
Run daily ETL jobs across a distributed set of workers with retry logic and failure alerting built in
04
Replace a collection of interdependent cron jobs with a dependency graph that handles ordering and failure recovery

Deployment Strategy

Recommended ways to host Apache Airflow in your own environment.

docker
self-hosted