You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
As your data pipelines grow, you need a system to schedule, monitor, and manage them. Workflow orchestration tools handle dependencies, retries, alerting, and scheduling so you can focus on the logic. This lesson covers Apache Airflow, DAGs, tasks, scheduling, retries, and Prefect as an alternative.
Without orchestration, you end up with:
An orchestrator solves all of these problems.
Airflow is the most widely used orchestration tool in data engineering.
| Concept | Description |
|---|---|
| DAG | Directed Acyclic Graph — defines the workflow |
| Task | A single unit of work in the DAG |
| Operator | The type of task (Python, Bash, SQL, etc.) |
| Schedule | When the DAG runs (cron expression) |
| XCom | Cross-communication — pass data between tasks |
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta
default_args = {
"owner": "data-engineering",
"depends_on_past": False,
"email_on_failure": True,
"email": ["data-team@company.com"],
"retries": 3,
"retry_delay": timedelta(minutes=5),
}
with DAG(
dag_id="daily_customer_pipeline",
default_args=default_args,
description="Extract, clean, and load customer data daily",
schedule="0 6 * * *", # Every day at 06:00 UTC
start_date=datetime(2024, 1, 1),
catchup=False,
tags=["customers", "daily"],
) as dag:
extract = PythonOperator(
task_id="extract_customers",
python_callable=extract_customers,
)
validate = PythonOperator(
task_id="validate_raw_data",
python_callable=validate_raw_data,
)
transform = PythonOperator(
task_id="transform_customers",
python_callable=transform_customers,
)
load = PythonOperator(
task_id="load_to_warehouse",
python_callable=load_to_warehouse,
)
notify = BashOperator(
task_id="send_notification",
bash_command='echo "Pipeline complete for {{ ds }}"',
)
# Define dependencies
extract >> validate >> transform >> load >> notify
extract ──▶ validate ──▶ transform ──▶ load ──▶ notify
# Linear: A >> B >> C
extract >> transform >> load
# Fan-out: A >> [B, C]
extract >> [validate_schema, validate_quality]
# Fan-in: [B, C] >> D
[validate_schema, validate_quality] >> transform
# Complex DAG
extract >> [validate_schema, validate_quality]
[validate_schema, validate_quality] >> transform
transform >> [load_warehouse, load_datalake]
[load_warehouse, load_datalake] >> notify
┌──────────────────┐
│ extract │
└────────┬─────────┘
┌─────┴─────┐
▼ ▼
┌────────────┐ ┌──────────────┐
│ validate │ │ validate │
│ schema │ │ quality │
└─────┬──────┘ └──────┬───────┘
└──────┬────────┘
▼
┌──────────────┐
│ transform │
└──────┬───────┘
┌────┴─────┐
▼ ▼
┌────────────┐ ┌──────────┐
│ load │ │ load │
│ warehouse │ │ datalake │
└─────┬──────┘ └────┬─────┘
└──────┬───────┘
▼
┌──────────────┐
│ notify │
└──────────────┘
Airflow uses cron expressions for scheduling.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.