Recipe

Apache Airflow primer

Airflow is the workhorse scheduler for data pipelines that need to run reliably, retry sensibly, and stay observable as they grow. This primer walks through the mental model, your first DAG, and the operational habits that keep Meridian workflows healthy in production.

1.The DAG mental model

A DAG (directed acyclic graph) is a set of tasks plus the edges between them. Airflow does not run your business logic itself; it schedules and supervises operators that do. Think of the scheduler as a calendar, the executor as a dispatcher, and tasks as the work being dispatched. Idempotency at the task layer is the single most important property to design for.

2.Your first hourly pipeline

The snippet below defines a three-step ETL DAG that pulls from Meridian, normalizes the payload, and lands rows in a warehouse. Note the explicit catchup=False and the chained >> operator that expresses dependencies.

from datetime import datetime
from airflow import DAG
from airflow.operators.python import PythonOperator

def ingest():
    print("Pulling rows from Meridian...")

def transform():
    print("Normalizing payloads...")

def load():
    print("Writing to warehouse...")

with DAG(
    dag_id="meridian_etl",
    schedule="@hourly",
    start_date=datetime(2026, 1, 1),
    catchup=False,
    tags=["meridian", "etl"],
) as dag:
    t1 = PythonOperator(task_id="ingest", python_callable=ingest)
    t2 = PythonOperator(task_id="transform", python_callable=transform)
    t3 = PythonOperator(task_id="load", python_callable=load)

    t1 >> t2 >> t3

3.Operational habits

Set sane retries with exponential backoff, alert on SLA misses rather than raw failures, and pin task durations with timeouts so a stuck sensor never blocks a slot. Keep secrets in a backend like Vault or AWS Secrets Manager, version your DAG files alongside the rest of the repo, and run a smoke DAG on every deploy so a broken import surfaces before the next scheduled run.