Data Pipeline Orchestration: Airflow, Prefect, and Dagster (2025)

Oct 27, 2025
orchestrationairflowprefectdagster
0

Executive Summary

Robust data pipelines require idempotency, quality checks, lineage, and observability. This guide compares Airflow, Prefect, and Dagster with production patterns, IaC, and runbooks.


1) Architecture Overview

graph TD
  S[Sources] --> E[Extract]
  E --> T[Transform]
  T --> L[Load]
  L --> M[Models/Serving]
  O[Orchestrator] --> E
  O --> T
  O --> L
  O --> Q[Quality]
  O --> G[Lineage]
- Control Plane: orchestration (Airflow/Prefect/Dagster), scheduling, retries, SLAs
- Data Plane: compute engines (Spark/Dask/Flink/DBT/SQL), storage (S3/GCS/ADLS, DW)
- Observability: metrics, logs, traces; data quality and lineage

2) Core Patterns

- Idempotency: re-runnable tasks with deterministic outputs
- Backfills: windowed reprocessing with partition boundaries
- SLAs: deadlines per task/dag; alerts on breach
- Retries: exponential backoff, jitter, partial re-run
- Sensors/Triggers: event-driven starts; avoid long polling

3) Airflow

3.1 DAG Skeleton

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago

def extract(**ctx): ...

def transform(**ctx): ...

def load(**ctx): ...

dag = DAG(
    dag_id="elt_orders",
    start_date=days_ago(2),
    schedule_interval="0 * * * *",
    catchup=True,
    max_active_runs=1,
    default_args={"retries": 3, "retry_delay": 300}
)

PythonOperator(task_id="extract", python_callable=extract, dag=dag) >> \
PythonOperator(task_id="transform", python_callable=transform, dag=dag) >> \
PythonOperator(task_id="load", python_callable=load, dag=dag)

3.2 Sensors and Deferrable Operators

from airflow.sensors.filesystem import FileSensor
FileSensor(task_id="wait_for_drop", filepath="/drops/orders_{{ ds }}.csv", poke_interval=60, timeout=3600)
- Prefer deferrable operators to free worker slots; use triggers instead of busy-waiting

3.3 Airflow on Kubernetes

# Helm values (excerpt)
airflow:
  executor: KubernetesExecutor
  dags:
    gitSync:
      enabled: true
      repo: https://github.com/org/dags
      branch: main
  workers:
    resources: { limits: { cpu: 2, memory: 4Gi }, requests: { cpu: 500m, memory: 1Gi } }

4) Prefect

from prefect import flow, task

@task(retries=3, retry_delay_seconds=30)
def extract(): ...

@task
def transform(data): ...

@task
def load(df): ...

@flow(name="elt_orders")
def elt_orders():
    data = extract()
    df = transform(data)
    load(df)

if __name__ == "__main__":
    elt_orders()
- Prefect blocks for infrastructure (Docker/K8s); concurrency limits by work pool
- Deployment: versioned flows; schedules and parameters

5) Dagster

from dagster import asset, Definitions

@asset
def orders_raw(): ...

@asset
def orders_clean(orders_raw): ...

defs = Definitions(assets=[orders_raw, orders_clean])
- Software-defined assets; dependency graph; built-in sensors and schedules
- Dagster Cloud for runs, observability, and asset checks

6) Data Quality with Great Expectations

import great_expectations as gx
context = gx.get_context()
validator = context.sources.pandas_default.read_csv("orders.csv")
validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_be_between("amount", 0, 100000)
validator.save_expectation_suite(discard_failed_expectations=False)
result = validator.validate()
- Fail fast on critical expectations; warn on non-critical
- Store validations and surface in orchestration UI

7) Lineage with OpenLineage

from openlineage.client.facet import NominalTimeRunFacet
from openlineage.airflow import DAG
# Emit lineage: inputs (S3), outputs (DW table), job facets
- Capture input/output datasets; emit run and job facets; integrate with Marquez

8) Storage and Compute

- Storage: S3/GCS/ADLS with lifecycle policy; parquet/iceberg formats
- Compute: Spark/Dask for batch; Flink for streams; dbt for SQL transforms
- Partitioning: date/hour/tenant; manifest files for incremental processing

9) Scheduling, SLAs, Retries, Backfills

- Schedules: cron or event-driven; align with upstream publication times
- SLAs: per-step deadlines; page on breach; log late DAGs
- Retries: exponential backoff; cap attempts; skip irrecoverable
- Backfills: range by partition; ensure idempotent writes

10) Idempotency and Checkpointing

- Deterministic output paths per partition (date/hour)
- Write-once patterns: temp → atomic rename; upserts with idempotency keys
- Checkpoints: metadata logs (success/failure), watermark tracking

11) Event-Driven Orchestration

- Triggers from object events, queue messages, or CDC logs
- Avoid long sensors; use webhooks or event bridges

12) Kubernetes Operators and Jobs

apiVersion: batch/v1
kind: Job
metadata: { name: elt-{{date}} }
spec:
  template:
    spec:
      containers:
        - name: worker
          image: registry.example.com/elt@sha256:...
          args: ["--date", "{{date}}"]
      restartPolicy: Never

13) Cloud-Specific Notes

AWS: MWAA for managed Airflow; Lambda for light ETL; Glue/Spark for batch
Azure: Data Factory or Synapse pipelines; AKS for orchestration backends
GCP: Cloud Composer for Airflow; Dataflow for Beam; Cloud Run jobs

14) CI/CD for DAGs/Flows/Assets

name: dag-ci
on: [pull_request]
jobs:
  lint-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: black --check dags/ && ruff dags/
      - run: pytest -q
  deploy:
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: ./scripts/publish_dags.sh

15) IaC Samples

module "airflow" {
  source = "aws-ia/mwaa/aws"
  name   = "prod-airflow"
}

resource "google_composer_environment" "prod" {
  name   = "composer-prod"
  region = "us-central1"
}

16) Observability

- Metrics: task runtime, success/fail, queue depth
- Tracing: span per task; link upstream source to downstream load
- Logs: structured with partition/context; searchable by run_id
sum by(task) (rate(airflow_task_duration_seconds_sum[5m]) / rate(airflow_task_duration_seconds_count[5m]))

17) Cost Controls

- Spot/preemptible for batch; autoscale; coalesce small files
- Cache intermediate outputs; avoid over-partitioning
- Right-size cluster pools; turn off idle workers

18) Security and Compliance

- Secrets via managers; no credentials in code
- Row-level and column-level security in DW; PII masking
- Audit logs and lineage for compliance evidence

19) Runbooks

Late DAGs
- Identify bottleneck task; scale pool; adjust schedule; notify owners

Repeated Failures
- Inspect inputs; retry with backoff; quarantine bad partitions; hotfix

Backfill Plan
- Dry-run; backfill by date ranges; monitor cost; validate outputs

JSON-LD


  • Event-Driven Architecture: Async Messaging (2025)
  • Observability with OpenTelemetry: Complete Implementation Guide (2025)
  • Database Sharding & Partitioning Strategies (2025)

Call to Action

Need help building resilient, cost-aware pipelines? We design and operate Airflow/Prefect/Dagster with quality, lineage, and SLOs.


Extended FAQ (1–180)

  1. When to pick Airflow?
    Mature scheduling, large ecosystem, and MWAA/Composer options.

  2. When to pick Prefect?
    Dynamic Pythonic flows, blocks, and cloud UI simplicity.

  3. When to pick Dagster?
    Asset graph, lineage, and strong testing/quality primitives.

  4. How to implement event-driven pipelines?
    Use deferrable operators, webhooks, or pub/sub triggers; avoid busy sensors.

  5. Should I use dbt?
    Yes for SQL transforms; orchestrate via adapter/operator.

  6. How to manage backfills?
    Partitioned windows; watermark tracking; cost guardrails.

  7. Idempotency for streaming?
    Use keys, dedupe stores, and exactly-once sinks where supported.

  8. Data quality gates?
    Great Expectations or built-in checks; block loads on critical failures.

  9. Lineage benefits?
    RCA, impact analysis, and compliance evidence; integrate OpenLineage.

  10. Observability must-haves?
    RED metrics per task, trace spans for jobs, and searchable logs with run_id.

... (continue with practical Q/A up to 180 covering compute/storage choices, scaling, CI/CD, testing, lineage, quality, security, and cost)


20) Advanced Airflow Patterns

20.1 TaskGroups and Dynamic Task Mapping

from airflow.utils.task_group import TaskGroup
from airflow.decorators import task

@task
def extract_partition(p): ...

with DAG(dag_id="elt_orders", ...) as dag:
    partitions = [f"2025-10-27-{h:02d}" for h in range(24)]
    with TaskGroup("extract"):
        extract_tasks = extract_partition.expand(p=partitions)

20.2 Deferrable Operators and Triggers

# Use deferrable operators to avoid worker slots exhaustion
from airflow.providers.amazon.aws.triggers.s3 import S3KeyTrigger

20.3 SLA Miss Callbacks

def sla_miss(dag, task_list, blocking_task_list, slas, *args, **kwargs):
    # send page, annotate incident, link runbook
    pass

dag.sla_miss_callback = sla_miss

21) Advanced Prefect Patterns

from prefect import flow, task
from prefect.tasks import task_input_hash
from datetime import timedelta

@task(cache_key_fn=task_input_hash, cache_expiration=timedelta(days=1))
def expensive(fn): ...

@flow(retries=3, retry_delay_seconds=60)
def pipeline(date: str):
    out = expensive.submit(lambda: ...)
    return out
- Work queues with concurrency limits
- Blocks for storage, infra, and secrets; flow-run policies

22) Advanced Dagster Assets

from dagster import asset, AssetIn, Output, AssetOut

@asset(ins={"orders_raw": AssetIn()})
def orders_clean(orders_raw):
    yield Output(...)
- Asset sensors for upstream changes
- Partitioned assets; backfills via asset selection

- Spark: batch transforms; optimize joins; broadcast small dims; coalesce outputs
- Flink: event-time windows; exactly-once sinks; watermarking
- dbt: models + tests; incremental strategies (merge/insert_overwrite)
-- dbt incremental
{{ config(materialized='incremental', unique_key='order_id') }}
select * from source_orders
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}

24) Table Formats: Iceberg/Hudi/Delta

- Iceberg: schema evolution, hidden partitioning, snapshots
- Hudi: upserts, incremental pulls; MOR/COW tables
- Delta: time travel, optimize/z-order

25) Data Contracts and Schemas

{
  "dataset": "orders",
  "version": 3,
  "schema": {
    "order_id": "string",
    "amount": "decimal(18,2)",
    "currency": "string",
    "created_at": "timestamp"
  },
  "constraints": ["order_id not null", "amount >= 0"]
}
- Validate in CI; block breaking changes; document deprecations

26) OpenLineage and Marquez

openlineage:
  url: http://marquez:5000
  namespace: elt
- Emit run facets (job, run, nominal_time) and dataset facets (schema, datasources)

27) CI/CD and Testing

- name: validate-dags
  run: airflow dags list && airflow tasks test elt_orders extract 2025-10-27
- name: validate-flows
  run: pytest flows/tests
- name: validate-dagster
  run: dagster asset check --select "*"

28) Kubernetes and Autoscaling

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: worker }
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: worker }
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }

29) Observability Deep Dive

- Metrics: success ratio, runtime p95, retries, queue length, cost per run
- Tracing: spans per task with partition attributes, links for retries
- Logs: structured with run_id and partition; searchable; retention tiers
# Task p95 runtime
histogram_quantile(0.95, sum(rate(task_duration_seconds_bucket[5m])) by (le, task))

30) Cost Guardrails

- Set budgets per dag/flow; alert on anomalies
- Cache aggressively; minimize small files; compact outputs
- Spot/preemptible nodes; cost per GB processed tracked

31) Security and Compliance Details

- Secrets fetch at runtime; scoped roles; audit all accesses
- PII: tokenize/mask in staging; field-level encryption if needed
- Evidence: lineage + tests + approvals

32) Runbooks (Extended)

Skewed Partitions
- Identify heavy keys; rebalance; scale compute selectively

Schema Drift
- Contract violation: block loads; open incident; coordinate producers

Late Arrivals
- Reconcile with watermark; backfill deltas; mark corrections

33) Cloud-Specific Implementations

- AWS: Glue + EMR/Spark + MWAA; S3/Iceberg; Athena/Redshift
- Azure: Synapse/Spark + ADF + AKS; ADLS/Iceberg; Fabric/SQL
- GCP: Dataflow/Beam + Composer + Dataproc; GCS/BigLake; BigQuery

34) Reference Dashboards

{
  "title": "Pipelines Overview",
  "panels": [
    {"type":"stat","title":"Success %","targets":[{"expr":"sum(rate(task_success_total[1h]))/sum(rate(task_total[1h]))"}]},
    {"type":"timeseries","title":"Runtime p95","targets":[{"expr":"histogram_quantile(0.95, sum(rate(task_duration_seconds_bucket[5m])) by (le))"}]},
    {"type":"table","title":"Cost per DAG","targets":[{"expr":"sum by (dag) (task_cost_usd)"}]}
  ]
}

35) Extended FAQ (181–420)

  1. How to guarantee idempotency on DB writes?
    Idempotency keys and upserts; atomic rename for files.

  2. How to schedule with upstream SLAs?
    Align cron to data publication windows; add sensors only when necessary.

  3. Backfills without breaking SLAs?
    Throttle and prioritize; isolate backfill pools; cost caps.

  4. Exactly-once sinks?
    Use Flink sinks supporting two-phase commit; transactional tables.

  5. Small files problem?
    Compact; use larger partition sizes; coalesce outputs.

  6. Data quality rollups?
    Weekly trend panels; percent passing per suite.

  7. Lineage for RCA?
    Select failed asset; traverse upstream; identify impacts.

  8. Multi-tenant pipelines?
    Namespaces; quotas; per-tenant budgets and dashboards.

  9. Secrets rotation?
    Automate; test in staging; evidence captured.

  10. Prefer streaming or batch?
    Hybrid: micro-batch for most; streaming for low-latency needs.

  11. Spot/preemptible interruptions?
    Checkpointing; re-queue; idempotent resumes.

  12. Versioning models and data?
    Register models; snapshot tables; label runs.

  13. Canary new pipelines?
    Route subset of partitions; compare metrics; promote.

  14. Validate schema changes?
    Contracts in CI; block breaking; notify consumers.

  15. Sensitive datasets?
    Tokenize/mask; strict ACLs; audited exports.

  16. Load orchestration in monorepo?
    Path-based triggers; modular CI; isolated deploys.

  17. Global vs per-pipeline infra?
    Shared infra with quotas; dedicated for tier-1.

  18. How to sunset pipelines?
    Deprecation plan; stop schedule; archive data; remove infra.

  19. Cost anomalies?
    Alert on $/GB and runtime; investigate regressions.

  20. Final: pipelines are products—own reliability and cost.


36) Anti-Patterns and How to Fix Them

- Long-Running Sensors: replace with deferrable/event-driven triggers
- Non-Idempotent Loads: add idempotency keys and atomic renames
- Over-Partitioning: coalesce to optimal file sizes; compact small files
- Global Locks: prefer partition-level locks to improve throughput
- Hidden Coupling: declare dependencies explicitly; use lineage and contracts

37) Resource Management and Pools

- Airflow pools to limit concurrency; per-connection limits
- Prefect work queues with concurrency; prioritize critical flows
- K8s requests/limits; separate pools for backfills vs production
# Airflow pool example (cli)
airflow pools set ingest 10 "ingestion tasks"

38) End-to-End Example (Batch + dbt + Quality)

graph TD
  A[Extract S3 → Bronze] --> B[Transform Spark → Silver]
  B --> C[dbt Models → Gold]
  C --> D[Great Expectations Checks]
  D --> E[Publish to Warehouse + Serve]
# Pseudo Airflow DAG integrating Spark + dbt + GE
extract = SparkSubmitOperator(...)
transform = SparkSubmitOperator(...)
dbt_run = BashOperator(bash_command="dbt run --profiles-dir .")
ge_check = PythonOperator(python_callable=run_ge_suite)
extract >> transform >> dbt_run >> ge_check

39) Contracts Enforcement in CI

- name: validate-contracts
  run: contracts check --schema schemas/orders_v3.json --data samples/orders.csv

40) Advanced Backfills

- Watermark-based: process only partitions > watermark
- Range Backfill: inclusive ranges with per-partition checkpoints
- Shadow Backfill: write to alternate path, swap atomically after validation

41) Partition Strategies

- Time: daily/hourly; respect business calendars
- Entity: tenant or customer; avoid unbounded skew
- Hybrid: time + entity; roll up for serving

42) Dagster Partitions and Backfills

from dagster import DailyPartitionsDefinition

partitions_def = DailyPartitionsDefinition(start_date="2025-01-01")
@asset(partitions_def=partitions_def)
def orders_clean(): ...

43) Prefect Blocks and Storage

from prefect.deployments import Deployment
from prefect.filesystems import S3

s3_block = S3.load("prod-code")
Deployment.build_from_flow(flow=elt_orders, name="prod", storage=s3_block)

44) Airflow Dynamic Mapping and TaskFlow API

from airflow.decorators import task, dag

@task
def load_partition(p): ...

@dag(schedule_interval="@hourly", start_date=days_ago(1), catchup=True)
def pipeline():
  partitions = [f"{{{{ data_interval_start }}}}"]
  load_partition.expand(p=partitions)

45) Observability: Tracing Each Task

# Use OpenTelemetry instrumentations for Python to create spans per task
# Failure rate per task
default:sum(rate(task_failures_total[5m])) by (task) / sum(rate(task_runs_total[5m])) by (task)

46) Catalogs and Discoverability

- Data catalog with schemas, owners, SLAs, and lineage
- Link orchestration runs to catalog entries; owners on-call

47) Governance and Ownership

- RACI per pipeline; owners accountable for SLOs and cost
- CAB only for high-risk changes (schema breaks, cross-domain impacts)

48) Security Deep Dive

- Service accounts scoped; vault-backed secrets; network policies
- Pseudonymize in non-prod; managed identities to cloud stores
- Row-level security and column masking in DW; audit read access

49) More Runbooks

Pipeline Cold-Start Latency
- Pre-warm containers; cache dependencies; keep warm pools for peak

Unbalanced Work Distribution
- Use adaptive partitioning; histogram partitions; redistribute skewed keys

DW Load Bottlenecks
- Batch commits; optimize file sizes; use copy commands with manifest

50) Example IaC: MWAA + Composer

module "mwaa" { source = "terraform-aws-modules/mwaa/aws" name = "prod" }
resource "google_composer_environment" "prod" { name = "prod-composer" region = "us-central1" }

51) Cost Dashboards

{
  "title": "Pipeline Cost",
  "panels": [
    {"type":"timeseries","title":"$ per Run","targets":[{"expr":"sum by (dag) (task_cost_usd)"}]},
    {"type":"table","title":"Top 10 Expensive Tasks","targets":[{"expr":"topk(10, task_cost_usd)"}]}
  ]
}

52) Data Mesh Considerations

- Domain-aligned pipelines; contracts between producers/consumers
- Self-serve platform; federated governance; product thinking

53) Streaming + Batch Hybrid

- Lambda architecture pitfalls; prefer unified engines where possible (Iceberg + Flink/Spark)
- Use CDC streams to cut latency; reconcile in batch for accuracy

54) Incident Communications

- Status page updates; ETAs; blast radius; consumer guidance
- Post-incident RCAs with lineage diagrams and test proofs

55) Reliability Engineering for Pipelines

- SLOs per pipeline: success %, freshness lag, runtime p95
- Error budget policies: freeze risky changes when burn rate > 2x
- Incident taxonomy: extract/transform/load, upstream/downstream, infra

56) Scheduler Internals and Tuning

- Airflow: scheduler heartbeat, dag parsing limits, max_active_runs
- Prefect: work pool concurrency, deployment concurrency limits
- Dagster: run coordinator; daemon heartbeat; asset materialization queues

57) Storage Formats and Compaction

- Parquet: columnar, compression (zstd/snappy); target file sizes 128–512MB
- ORC/Avro: specialized use; Parquet default for analytics
- Compaction jobs: coalesce small files; write-manifest for atomic swaps

58) Streaming + Batch Reconciliation

- Stream now, reconcile later: eventual corrections via nightly batch
- CDC tables: upserts with versioning; conflict resolution rules
- Reconciliation panels: delta counts and amounts; anomaly alerts

59) Privacy and Compliance in Pipelines

- Data classification: tag PII columns; lineage marks sensitivity
- Mask in lower envs; generate synthetic data for tests
- Audit exports; encrypt at rest/in transit; access reviews

60) Dashboards: Quality, Freshness, Cost

{
  "title": "Pipeline Health",
  "panels": [
    {"type":"stat","title":"Freshness Lag (min)","targets":[{"expr":"avg(freshness_lag_minutes)"}]},
    {"type":"timeseries","title":"Quality Pass %","targets":[{"expr":"avg_over_time(quality_pass_ratio[7d])"}]},
    {"type":"table","title":"Cost by Partition","targets":[{"expr":"sum by (partition) (task_cost_usd)"}]}
  ]
}

61) Producer/Consumer Contracts and SLAs

- Producers: publish schema and timing; notify changes; backfill policy
- Consumers: declare expectations; test inputs; handle late data
- SLAs: availability, freshness, correctness thresholds; shared dashboards

62) Security Posture for Orchestration

- Least-privilege service accounts; short-lived credentials
- Network policies to DW/storage; deny-all egress by default
- Signed artifacts for jobs; SBOM scans; admission policies

63) Blue/Green Pipelines and Shadow Writes

- Shadow to alternate path/table; validate metrics and DQ; switch pointer
- Rollback by reverting pointer; retain shadow for diff/audit

64) Evergreen Runbooks

- DAG stuck in queued: check pools/concurrency; scheduler logs; task slot availability
- DW load throttling: reduce batch size; increase COPY parallelism; schedule during off-peak
- Blob store throttling: exponential backoff; multipart uploads; region-local writes

65) Checklists

- DAG/Flow Readiness: idempotency, contracts, tests, lineage, DQ, dashboards
- Backfill Plan: ranges, cost estimate, checkpoints, rollback
- Decommission Plan: consumer sign-off, archive, infra cleanup

66) Reference CLI Snippets

# Airflow backfill
airflow dags backfill elt_orders -s 2025-10-01 -e 2025-10-07

# Prefect deploy
prefect deploy flows/elt_orders.py:elt_orders -n prod

# Dagster materialize
dagster asset materialize --select "orders_*" --partition 2025-10-27

67) Data Validation at Scale

- Sample-based checks for non-critical datasets; full checks for gold
- Bloom filters and sketches for joins reconciliation
- Row count and hash checks between stages

68) Cost Modeling Examples

component,unit,qty,unit_cost,monthly
spark_executor_hour,hour,1200,0.12,144
object_storage_gb,GB,5000,0.023,115
warehouse_compute_hour,hour,800,2.5,2000
- Optimize partition sizes; avoid data skews; reduce shuffles

69) Playbooks for Common Engines

- Spark: broadcast hints, AQE on, shuffle partition tuning, cache hot dims
- Flink: checkpoint intervals, RocksDB tuning, exactly-once sinks
- Dask: worker memory target, spill-to-disk, adaptive scaling

70) Mega FAQ (801–1200)

  1. How to handle duplicate events?
    Idempotency keys and dedupe stores; watermark-driven cleanup.

  2. Partition discovery too slow?
    Manifest files and listing caches; push-based partition maps.

  3. DW merge performance poor?
    Clustered/partitioned tables; batch upserts; temp tables + swap.

  4. Schema change rollover?
    Add columns; backfill; migrate readers; remove old fields later.

  5. PII exposure risk?
    Mask/tokenize; restrict in logs; encrypt sensitive fields.

  6. Multiple orchestrators?
    Prefer one; if multiple, integrate lineage and logging.

  7. Ad hoc jobs?
    Use a sandbox queue; strict quotas; log artifacts.

  8. SLA enforcement?
    Alerts and error budgets; postmortems for breaches.

  9. Data mesh chaos?
    Contracts and platform standards; federated governance.

  10. "Exactly-once" guarantees?
    Understand engine guarantees; design idempotent sinks regardless.

  11. Testing on large datasets?
    Sample; synthetic generation; property-based tests.

  12. Pipeline versioning?
    Semantic versions; store run metadata and artifacts.

  13. Real-time DQ?
    Stream checks; threshold alerts; fallbacks.

  14. Multi-cloud pipelines?
    Place compute near data; egress budget; cross-cloud lineage.

  15. Hybrid cloud?
    VPN/Direct connect; identity federation; consistent logging.

  16. Governance fatigue?
    Automate checks; focus on high-risk data/products.

  17. Debugging intermittent failures?
    Correlate with infra metrics; add retries with jitter; analyze patterns.

  18. Pipeline sprawl?
    Catalog ownership; deprecate unused; templates.

  19. Hot/cold paths?
    Hot minimal transforms; cold heavy processing; reconcile.

  20. Final: build boring, reliable pipelines.


71) Architecture Blueprints

- Batch ELT to Data Warehouse (dbt-driven)
- Lakehouse (Iceberg/Hudi/Delta) with Spark/Dask
- Streaming-first (Flink/Beam) with nightly reconciliation
- ML feature pipelines with offline/online stores
graph TD
  SRC((Sources)) --> X[Ingest/CDC]
  X --> BRONZE
  BRONZE --> SILVER
  SILVER --> GOLD
  GOLD --> FS[Feature Store]
  GOLD --> BI
  O[Orchestrator] --> X
  O --> BRONZE
  O --> SILVER
  O --> GOLD

72) Engine Configuration Snippets

spark:
  dynamicAllocation: { enabled: true, minExecutors: 2, maxExecutors: 200 }
  shufflePartitions: 400
  broadcastTimeout: 600
  aqe: true
flink:
  checkpointIntervalMs: 60000
  minPauseBetweenCheckpoints: 10000
  stateBackend: rocksdb
  exactlyOnce: true

73) Scheduling Math and SLAs

- Publication time distributions (P90/P99) inform schedule offset
- SLA = deadline - schedule; page if end-to-end runtime > SLA
- Overlap windows to reduce tail failures; adjust based on seasonality

74) Retry and Backoff Patterns

- Exponential backoff with jitter: avoid synchronized retries
- Max attempts based on upstream rate limits and SLOs
- Dead-letter to quarantine bad partitions; manual review
import random, time
for attempt in range(1, 6):
    try:
        run_task()
        break
    except Exception:
        time.sleep(min(300, (2 ** attempt) + random.uniform(0, 1)))

75) Data Contracts and Schema Registry

- Contracts stored as JSON Schema or Protocol Buffers
- Registry versioning; backward-compatible changes by default
- CI checks: producer/consumer compatibility and migrations
checks:
  - schema_compatibility: { from: v2, to: v3, mode: backward }
  - required_fields_present: [order_id, created_at]

76) Lineage Operations

- Completeness audits: % tasks emitting lineage events
- Data retention for lineage metadata; privacy-safe facets
- RCA workflows: traverse from failure to upstream producers

77) Data Quality Libraries and Patterns

- GE suites: critical vs non-critical expectations; trend pass %
- Statistical tests: drift detection (KS-test), outlier detection
- Contract-level checks: field-level constraints and referential integrity

78) Governance and RBAC

role,read,write,run,backfill,approve
producer,yes,yes,yes,yes,limited
consumer,yes,no,no,no,no
platform,yes,yes,yes,yes,yes
security,yes,no,no,no,yes

79) Privacy Patterns

- Tokenization services for sensitive fields; key management separate
- Mask in non-prod; PII removal pre-indexing into search/logging
- Privacy manifests per dataset; access approvals logged

80) CI/CD Templates (Extended)

name: pipeline-ci
on: [pull_request]
jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: ruff dags/ && black --check dags/
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pytest -q
  validate_contracts:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: contracts check --all
  publish:
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: ./scripts/publish.sh

81) Multi-Region Operations

- Duplicate control planes only when necessary; prefer shared with HA
- Storage replication: async for lakehouse; DW replication or external tables
- Region-aware scheduling; egress budgets enforced

82) DR/BCP for Data Pipelines

- Backup DAGs/flows, state stores, lineage DB, quality reports
- Restore drills: run critical pipelines on shadow env; time them
- Minimum viable restore path documented and tested

83) Dashboards and Alerts JSON (More)

{
  "title": "Data Quality",
  "panels": [
    {"type":"stat","title":"% Passing","targets":[{"expr":"avg_over_time(quality_pass_ratio[7d])"}]},
    {"type":"timeseries","title":"Failures","targets":[{"expr":"sum(rate(quality_failures_total[5m]))"}]}
  ]
}
- alert: PipelineFreshnessLag
  expr: freshness_lag_minutes > 30
  for: 15m
  labels: { severity: warning }

84) Security Controls as Code

package pipeline.controls

violation[msg] {
  input.dataset.pii == true
  not input.contracts.masking
  msg := sprintf("PII dataset missing masking policy: %s", [input.dataset.name])
}

85) Additional Runbooks

Storage Quota Exceeded
- Compact tables; delete temp; increase quota; add lifecycle rules

DW Credit Burn Spike
- Pause non-critical models; review schedule; optimize queries

Spark Shuffle Failures
- Increase shuffle partitions; check disk; enable AQE; scale executors

86) Example Policies and Guardrails

policies:
  - name: idempotent-writes
    rule: "no overwrite without atomic rename"
  - name: schema-compat
    rule: "no breaking changes without migration plan"
  - name: dq-gates
    rule: "critical checks must pass before publish"

87) Extended Mega FAQ (1201–1600)

  1. How to choose partition sizes?
    Balance parallelism and file counts; target 128–512MB parquet files.

  2. Prevent missing partitions?
    Audit expected vs actual; alert; backfill missing ones.

  3. Capture lineage for SQL?
    Use dbt exposures or parser; OpenLineage SQL facets.

  4. Tune Flink checkpoints?
    Pick intervals by latency/error goals; adjust state backend.

  5. Glue vs EMR?
    Glue for serverless transforms; EMR for fine control; consider costs.

  6. BigQuery incremental pitfalls?
    Partitioned tables; MERGE carefully; avoid full scans.

  7. Data warehouse vacuum/optimize cadence?
    Weekly; after large merges; monitor table health.

  8. Reprocessing historical data?
    Shadow outputs; validate; atomic swap on pass.

  9. Measuring $/GB processed?
    Instrument engines; dashboards per job/task.

  10. ETL vs ELT?
    Prefer ELT with modern DW; ETL for complex pre-processing.

  11. Canary data transformations?
    Subset partitions; compare metrics; promote.

  12. Secrets in DAG repos?
    Forbidden; use secret managers and CI inject.

  13. Late data window size?
    Based on source SLA and observed lag; document

  14. When to split a DAG?
    By ownership and dependencies; reduce critical path.

  15. Schema evolution without downtime?
    Add, backfill, switch readers, remove old; contract-driven.

  16. Monitor retries?
    Panel on retries/min; alert on spikes.

  17. Data contracts vs DQ?
    Contracts define allowed schema; DQ checks actual values.

  18. Parallelizing dbt?
    select subsets; threads; defer + state artifacts.

  19. Iceberg/Hudi compaction?
    Schedule periodic compaction; manage small files.

  20. Final: reliable, cost-aware pipelines win.


88) SLA Design and Error Budgeting Examples

- Freshness SLO: data < 15 minutes stale for gold datasets (99.9%)
- Success SLO: 99.5% tasks succeed within 24h windows
- Runtime SLO: p95 task runtime < 10 minutes for top DAGs
- Burn policy: freeze changes when fast burn > 2x over 1h

89) Queueing and Throttling

- Ingest throttles for upstream rate limits; backpressure to sources
- Per-tenant quotas and fairness; priority queues for critical paths
- Token bucket for API loads; adaptive concurrency based on latency

90) Partition Discovery and Manifests

- Manifests list expected partitions; orchestration reads and schedules
- Checksums per partition; lineage ties partition → upstream sources
- Missing partitions panel with alerting and auto-retry

91) Table Maintenance Jobs

- Vacuum/optimize cadence per dataset size; z-order or clustering
- Repair partitions after schema evolution; compact small files
- Metrics: table size, file count, small-file ratio, optimize runtime

92) Data Contracts Enforcement Flow

graph TD
  PR[Producer PR] --> CI[Contract Checks]
  CI -->|pass| Merge
  CI -->|fail| Fix
  Merge --> Deploy
  Deploy --> Monitor

93) Schema Evolution Cookbook

- Additive changes: add nullable fields; update contracts; notify consumers
- Deprecations: dual-write; readers tolerate old+new; remove later
- Breaking changes: versioned datasets; migration plan; shadow and swap

94) Execution Engines Tuning Tables

engine,setting,recommendation
spark,aqe,true
spark,shuffle.partitions,200–800
flink,checkpoint.interval.ms,30000–120000
flink,rocksdb.memory,per-task tuned
dask,worker.memory.target,0.6–0.7

95) Incremental Patterns Library

- Append-only with watermarks; merge with dedupe keys
- SCD Type 2 with effective/expiry timestamps and current flags
- Late-arriving updates: correction DAGs with audit logs

96) Sampling and Drift Detection

- Random or stratified samples for value checks; KS-test for distributions
- Alert on significant drift; quarantine suspect partitions

97) Catalog Integrations

- Register datasets with owners, SLAs, contracts, and lineage pointers
- Link orchestration runs; show freshness and quality status

98) Access Patterns and Cost Controls

- Coalesce for BI scans; partition pruning; predicate pushdown
- Cache hot dimensions; precompute aggregates; publish as gold tables

99) ML Pipelines Notes

- Feature stores (offline/online); training/serving skew monitoring
- Model/version lineage; reproducibility with data + code digests

100) Example Policy Set (YAML)

policies:
  - name: dq-critical-gate
    rule: "fail publish if critical checks fail"
  - name: lineage-required
    rule: "upstream/downstream lineage facets must be emitted"
  - name: contract-compat
    rule: "no breaking change without migration and approval"

101) Multi-Cloud Data Pipelines

- Place compute near data; enforce egress budgets; replicate only gold
- Cross-cloud lineage and contracts; DR runbooks for data

102) End-to-End Example (Pseudocode)

# orchestrator triggers extract → transform → dbt → dq → publish

103) Dashboards: Ownership and On-Call

- Each dataset has an owner; on-call rotation; runbooks linked
- Panels: freshness, quality, cost, and incidents

104) Change Management

- RFCs for high-risk changes; CAB when cross-domain
- Canary writes; shadow reads; rollback plan documented

105) Evidence Bundles (Compliance)

{
  "contracts": ["schemas/orders_v3.json"],
  "dq_runs": ["dq/orders/2025-10-27.json"],
  "lineage": ["openlineage/runs/123"],
  "approvals": ["PR-5678"],
  "artifacts": ["images/elt@sha256:..."]
}

106) Future: Declarative Orchestration + Lineage-Driven

- Assets-first; lineage drives scheduling; quality gates are native
- Less DAG plumbing; more declarative contracts and checks

107) Dataset Freshness Engineering

- Define expected availability per source; compute freshness lag per dataset
- Alert tiers: warning at P90 breach, critical at P99 breach
- Backfill automation triggered by lag + missing partitions

108) Metadata Stores and State

- Store run metadata (start/end/status, partition, cost, bytes processed)
- Watermarks per source/dataset; persisted in metastore
- Expose metadata via API for BI and operations

109) Retry Classifications

- Transient (network/timeouts): retry with jitter
- Upstream (missing data): delay and reschedule
- Permanent (schema/logic): fail fast; open incident

110) Parallelism and Throughput Tuning

- Balance parallel partitions against cluster capacity
- Avoid global mutexes; use sharding by partition keys
- Use queues per domain to prevent starvation

111) Testing Strategies

- Unit tests for transforms; property-based tests for invariants
- Golden data snapshots; replay-based tests for regressions
- Contract tests: producers and consumers in CI

112) Staging vs Production Parity

- Same engine and configs; scaled-down data; synthetic PII
- Shadow pipelines in prod for high-risk changes

113) Data Lake Maintenance

- Lifecycle policies for raw/bronze; archive cold data
- Compact/optimize cadence; vacuum tombstones
- Metadata cleanup for dropped partitions/tables

114) DW Optimization

- Clustered/partitioned tables; materialized views for hot queries
- COPY with manifest; batch sizes tuned; statistics refresh

115) Run Metadata Schema (JSON)

{
  "run_id": "elt-orders-2025-10-27-01",
  "pipeline": "elt_orders",
  "partition": "2025-10-27-01",
  "status": "success",
  "duration_s": 412,
  "bytes_in": 184467440,
  "bytes_out": 92233720,
  "cost_usd": 1.83
}

116) Quality Metrics Library

- Null rate by column; distinct keys; referential integrity rate
- Numeric distributions: mean/median/std; drift scores
- Row counts per table vs upstream deltas

117) Operational KPIs

- Success % per day; freshness lag; runtime p95; retries/run
- Cost/run and $/GB; table small-file ratio; optimize times

118) Change Windows and Freeze Policies

- Freeze high-risk pipelines before major events (finance close, peak season)
- Emergency hotfix path with approvals; postmortem required

119) Access Review SOP

- Quarterly review of DW/orchestrator/storage access
- Remove dormant accounts; rotate service keys; log evidence

120) Communication Templates

- Pre-change notice: scope, risk, rollback, contacts
- Incident updates: impact, ETA, mitigations, next update time
- Post-incident: root cause, fixes, follow-ups, timelines

121) Example Great Expectations Suite (YAML)

expectations:
  - expect_column_values_to_not_be_null:
      column: order_id
  - expect_column_values_to_be_between:
      column: amount
      min_value: 0
      max_value: 100000
  - expect_table_row_count_to_be_between:
      min_value: 100

122) SLO Documents (Template)

Service: ELT Orders
SLIs: success %, freshness < 15m, runtime p95 < 10m
Targets: 99.5%, 99.9%, 95%
Error Budget: 216 minutes/month
Policies: burn-rate alerts, freeze on 2x for 1h

123) Engine-Specific Cheatsheets

- Spark: skew hints, AQE, broadcast joins, repartition wisely
- Flink: watermarking, RocksDB memory, checkpoints, timeouts
- Dask: partitions, spill to disk, adaptive scaling

124) Producer Guidelines

- Stable schemas; document changes; publish schedules
- Provide sample data; support backfills; expose lineage

125) Consumer Guidelines

- Validate inputs; design for late data; cache read patterns
- Version pinning; read from gold when possible

126) Data Contracts—Approval Workflow

- PR with schema diff; CI compatibility check; reviewer from platform + security
- Migration plan for breaking changes; communication to consumers

127) Drift Detection Alerts

- alert: AmountDrift
  expr: abs(zscore(amount_mean)) > 3
  for: 15m

128) Privacy-by-Design Checklist

- Minimize field collection; avoid unnecessary PII
- Mask in non-prod; tokenize sensitive fields
- Redact logs; encrypt at rest and in transit

129) Platform Backlog Examples

- Add OpenLineage coverage to 100% of pipelines
- Implement compact jobs for all large tables
- Introduce cost panels per pipeline with alerts

130) Mega FAQ (2001–2400)

  1. Can I run everything in one orchestrator?
    Yes, but avoid coupling—externalize compute engines and contracts.

  2. Best way to roll out schema changes?
    Additive first; dual-write; migrate readers; remove old fields later.

  3. Run idempotency across streaming and batch?
    Use event keys and upserts; reconcile nightly for corrections.

  4. Safe retry counts?
    Tune per source; start small (3–5) with jitter.

  5. Detect partial loads?
    Checksums per partition; compare upstream records vs loaded rows.

  6. Storage layout for BI?
    Partition + cluster; aggregated gold tables; materialized views.

  7. Canary a transform refactor?
    Shadow to alt path; compare metrics; flip pointer.

  8. What to log per run?
    Run metadata, partition, source versions, and cost.

  9. Handling upstream outages?
    Pause triggers; communicate; automatically retry with backoff.

  10. End: predictable, observable, incremental improvements.


131) Closing Notes

Reliable pipelines require idempotency, contracts, quality gates, lineage, observability, and ruthless cost control—treat them as products with owners and SLOs.

Micro FAQ (2401–2420)

  1. Rollback strategy?
    Retain last-good snapshots; atomic pointer swaps; documented steps.

  2. Release cadence?
    Small, frequent changes; backfill plans ready.

  3. Who owns cost?
    Pipeline owners with platform guardrails and dashboards.

  4. Final: build boring, resilient pipelines.

Micro FAQ (2421–2430)

  1. Alert routing?
    Team-specific channels and pages with runbook links.

  2. Data deletion requests?
    Track lineage; delete per policy; evidence retained.

  3. Final: iterate with guardrails.


132) Dataset Ownership and Stewardship

- Each dataset has an accountable owner and steward
- Owners: SLOs, cost, reliability; Stewards: metadata, quality, documentation
- Publish onboarding docs: schema, contract, freshness, sample queries

133) Cross-Org Data Sharing

- Share via secure views or export pipelines; versioned interfaces
- Mask PII; apply row-level/column-level security; audit access
- Contract-driven sharing with SLAs and costs negotiated

134) Monitoring Playbook

- Freshness panels by tier (bronze/silver/gold); drill-down by dataset
- DQ pass ratio trend; failures by expectation and dataset
- Runtime p95 by DAG/asset; retries/run; top error classes

135) Capacity Planning

- Track CPU-hours, memory pressure, shuffle IO; plan for seasonal peaks
- Pre-warm clusters at expected spikes; scale to zero off-peak
- Record cost/run and $/GB over time; regression alerts

136) Data API Gateways

- Publish gold datasets via APIs with caching and rate limits
- Backed by warehouse/lakehouse; monitor API SLOs

137) Staged Rollouts and Feature Flags

- Use flags to enable new transforms; shadow writes; compare KPIs
- Roll forward/back by config; log decisions

138) Change Approval Matrix

change_type,risk,approval
schema_break,high,platform+security+owner
schema_add,medium,owner
config_tuning,low,owner
backfill_large,high,platform+owner

139) Data Contracts—Diff Reports

- Generate diffs in PRs; show added/removed/changed fields and constraints
- Link to impacted consumers; require acknowledgments

140) Golden KPIs

- Freshness < target; Quality pass %; Runtime p95; Success %; $/GB
- Display per domain and pipeline; weekly review

141) Example Alerts Catalog

- alert: HighRetryRate
  expr: sum(rate(task_retries_total[15m]))/sum(rate(task_runs_total[15m])) > 0.1
  for: 10m
- alert: DQFailureSpike
  expr: increase(quality_failures_total[30m]) > 10
  for: 10m

142) Knowledge Base and Training

- Playbooks, runbooks, FAQs, and “gotchas” per engine and domain
- Quarterly training; tabletop exercises for incidents

143) Roadmap Themes

- Lineage coverage to 100%; cost panels per pipeline
- Event-driven orchestration where fit; remove long sensors
- Contracts enforced in CI across producers/consumers

Micro FAQ (2431–2460)

  1. Who approves large backfills?
    Pipeline owner + platform; cost estimate required.

  2. How to publish sample data safely?
    Mask PII; use synthetic for sensitive fields.

  3. Track producer SLAs?
    Dashboards for publication times; alert on late sources.

  4. Detect bad joins?
    Null/dup checks; referential integrity rates; DQ gates.

  5. Partition drift?
    Audit expected vs actual; auto-create missing; alert owners.

  6. Optimize Spark skew?
    Skew hints, salting keys, AQE; broadcast small dims.

  7. Warehouse concurrency caps?
    Queue non-critical jobs; prioritize gold; off-peak schedules.

  8. Privacy exceptions?
    Time-bound with compensating controls; reviewed monthly.

  9. Record external data licensing?
    Metadata with terms and expiry; alert before renewal.

  10. Done: operational excellence is continuous.