Orchestrating ML Workflows Using Airflow or Dagster

Machine learning workflows are complex beasts. They involve data extraction, validation, preprocessing, feature engineering, model training, evaluation, deployment, and monitoring—all of which need to run reliably, often on schedules, and with proper handling of failures and dependencies. This is where workflow orchestration tools become essential. Apache Airflow and Dagster have emerged as two leading solutions, each with distinct philosophies and strengths. Understanding how to orchestrate ML workflows using these tools—and which one fits your needs—can make the difference between a fragile system held together with duct tape and a robust, maintainable ML platform.

Understanding Workflow Orchestration for ML

Workflow orchestration is about defining, scheduling, and monitoring complex pipelines where tasks have dependencies on each other. In machine learning, you might need to extract data from a database every morning, validate it, train a model only if the data quality checks pass, evaluate the model against your production model, and deploy it only if it performs better. Each step depends on the previous ones, and failures need to be handled gracefully with retries, alerts, and rollbacks.

The orchestrator acts as the conductor of this symphony, ensuring tasks run in the right order, resources are allocated appropriately, and you have visibility into what’s happening. Without orchestration, teams resort to cron jobs with fragile bash scripts, manual execution, or complex custom schedulers that become maintenance nightmares. A proper orchestration tool brings structure, reliability, and observability to your ML pipelines.

Apache Airflow: The Veteran Scheduler

Apache Airflow, created at Airbnb in 2014 and open-sourced shortly after, has become the de facto standard for workflow orchestration. Its core concept is the DAG (Directed Acyclic Graph)—a collection of tasks with defined dependencies that Airflow schedules and executes.

The Airflow Approach to ML Workflows

Airflow treats everything as tasks arranged in a DAG. You define your ML pipeline by creating a Python file that specifies tasks and their dependencies. A task might be “extract data from database,” another “preprocess data,” another “train model,” and so on. Airflow’s scheduler monitors these DAGs, triggers them based on schedules or external triggers, and manages task execution across workers.

Core concepts include:

  • DAGs: The workflow definition, written as Python code
  • Operators: Reusable task templates (PythonOperator, BashOperator, KubernetesPodOperator, etc.)
  • Tasks: Individual units of work within a DAG
  • Scheduler: Monitors DAGs and triggers task execution
  • Executor: Determines how tasks run (locally, on Celery workers, on Kubernetes)

For ML workflows, you might use a KubernetesPodOperator to spin up a pod with GPUs for model training, a PythonOperator for data validation, and a BashOperator for running inference scripts. Airflow handles the orchestration—ensuring tasks run in order, retrying failures, and alerting you when things go wrong.

Airflow in Practice: A Training Pipeline Example

Consider a daily model retraining pipeline. Your DAG might look like this:

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.amazon.aws.operators.emr import EmrCreateJobFlowOperator
from datetime import datetime, timedelta

with DAG(
    'ml_training_pipeline',
    start_date=datetime(2025, 1, 1),
    schedule_interval='@daily',
    catchup=False
) as dag:
    
    extract_data = PythonOperator(
        task_id='extract_training_data',
        python_callable=extract_data_from_warehouse
    )
    
    validate_data = PythonOperator(
        task_id='validate_data_quality',
        python_callable=run_data_validation
    )
    
    train_model = EmrCreateJobFlowOperator(
        task_id='train_on_emr',
        job_flow_overrides=training_config
    )
    
    extract_data >> validate_data >> train_model

This pipeline extracts data, validates it, and only trains if validation passes. The >> operator defines dependencies. Airflow handles scheduling, retries, and provides a UI to monitor progress.

Airflow’s Strengths and Challenges

Airflow excels at scheduling and executing tasks reliably at scale. Its ecosystem is mature with hundreds of pre-built operators for various services. The web UI provides excellent visibility into pipeline runs, task logs, and historical performance. It integrates seamlessly with cloud platforms and supports multiple executors for different scaling needs.

However, Airflow has limitations for ML workflows. It’s fundamentally a task scheduler, not a data-aware orchestrator. Tasks operate independently without knowledge of the data they produce or consume. This makes it harder to handle data lineage, implement efficient caching, or partition work based on data. Airflow also requires careful setup—managing the scheduler, workers, database, and ensuring everything scales properly can be operationally complex.

Testing Airflow DAGs can be cumbersome because tasks often have external dependencies. Local development means running a full Airflow instance. The paradigm of writing DAGs as Python code that gets parsed by the scheduler also creates gotchas—you can’t use certain Python patterns, and mistakes in DAG definition can break the entire scheduler.

Dagster: The Modern Data-Aware Orchestrator

Dagster represents a newer generation of orchestration tools, built from the ground up with data and ML workflows in mind. Released in 2019, it introduces concepts that make working with data pipelines more intuitive and maintainable.

The Dagster Philosophy

Dagster’s fundamental unit is the “asset”—a piece of data or ML artifact that your pipeline produces. Instead of thinking about tasks, you think about assets and how they’re produced. A training dataset is an asset. A trained model is an asset. Evaluation metrics are assets. You define software-defined assets (SDAs) that describe how to compute these assets, and Dagster figures out the execution order.

This shift from task-centric to data-centric thinking aligns naturally with ML workflows. You care about producing a trained model with certain properties, not just running a training script. Dagster makes this explicit.

Key concepts include:

  • Assets: Data or ML artifacts produced by your pipeline
  • Software-Defined Assets (SDAs): Functions that define how to compute assets
  • Ops: Reusable units of computation (similar to Airflow tasks)
  • Jobs: Collections of ops or asset materializations
  • Resources: External services like databases or API clients
  • Sensors: Triggers that watch for conditions to start runs

Dagster in Practice: Asset-Based ML Pipeline

The same training pipeline in Dagster might look like this:

from dagster import asset, AssetExecutionContext
import pandas as pd

@asset
def training_data(context: AssetExecutionContext) -> pd.DataFrame:
    """Extract and prepare training data"""
    df = extract_data_from_warehouse()
    context.log.info(f"Extracted {len(df)} records")
    return df

@asset
def validated_training_data(training_data: pd.DataFrame) -> pd.DataFrame:
    """Validate data quality"""
    assert training_data['target'].notnull().all()
    assert len(training_data) > 10000
    return training_data

@asset
def trained_model(validated_training_data: pd.DataFrame):
    """Train ML model"""
    model = train_model(validated_training_data)
    save_model(model, "s3://models/latest")
    return {"model_path": "s3://models/latest", "accuracy": 0.92}

Here, assets depend on each other through function parameters. Dagster automatically determines that validated_training_data depends on training_data, which depends on successful extraction. The data flows through the pipeline naturally.

Dagster’s Advantages for ML

Dagster’s data-awareness brings several advantages. Asset definitions include metadata, dependencies, and descriptions, making lineage tracking automatic. You can see exactly how any asset was produced and what depends on it. This is invaluable for ML, where understanding data provenance and model dependencies is critical.

Partitioning is built-in and elegant. If you want to train models per customer segment or per time period, Dagster handles this natively. You can backfill partitions, test individual partitions, and monitor partition-level success rates.

Testing is more natural because assets are pure functions with inputs and outputs. You can unit test asset logic without running the full orchestrator. Local development is simpler—Dagster’s UI (Dagit) runs locally without complex setup.

Dagster also provides rich type systems, allowing you to define schemas for your assets. You can enforce that training data has specific columns, that model outputs conform to expected formats, and get validation automatically.

Where Dagster Falls Short

Dagster is younger, which means a smaller ecosystem of integrations compared to Airflow. While it covers major cloud services and databases, you might need to write custom resources for niche systems. The community is growing but smaller than Airflow’s massive user base.

For organizations already invested in Airflow with extensive DAGs, migration represents significant work. Dagster’s paradigm shift requires rethinking how pipelines are structured. Teams need to learn new concepts and patterns.

Dagster can also be more opinionated. While this helps with consistency, it means less flexibility in certain scenarios. If you need fine-grained control over task execution or have very specific scheduling requirements, Airflow’s flexibility might be preferable.

Airflow vs Dagster: Key Differences

Apache Airflow

  • Task-centric: Focus on operations
  • Mature ecosystem: Hundreds of operators
  • Schedule-driven: Time-based triggers
  • Battle-tested: Used at massive scale
  • Complex setup: Operational overhead

Dagster

  • Asset-centric: Focus on data products
  • Growing ecosystem: Modern integrations
  • Data-driven: Asset-based triggers
  • Developer-friendly: Better testing/dev experience
  • Simpler setup: Easier to get started

Making the Choice for Your ML Platform

The decision between Airflow and Dagster depends on your specific context. If you have existing Airflow infrastructure and expertise, extending it for ML workflows is often the pragmatic choice. Airflow handles ML workloads well despite not being ML-specific, and the operational knowledge your team has built is valuable.

For new ML platforms or teams prioritizing developer experience and data lineage, Dagster offers compelling advantages. Its asset model aligns naturally with how ML practitioners think about pipelines. The improved testing story and simpler local development can boost team velocity significantly.

Consider these factors:

Choose Airflow if:

  • You have existing Airflow infrastructure and expertise
  • You need maximum flexibility in task execution
  • You require extensive integrations with legacy systems
  • You’re comfortable with operational complexity
  • Your workflows are more about task scheduling than data transformation

Choose Dagster if:

  • You’re building a new ML platform from scratch
  • Data lineage and asset tracking are priorities
  • You want better local development and testing workflows
  • Your team values modern developer experience
  • Your workflows center on producing and consuming data assets

Some organizations even run both—Airflow for legacy batch jobs and operational workflows, Dagster for new ML pipelines where its asset model shines.

Real-World ML Pipeline Comparison

Scenario: Daily model retraining with data validation, training, evaluation, and conditional deployment.

Airflow Approach:

  • Define 6-8 separate task operators
  • Use XCom to pass data between tasks
  • Implement custom sensors for conditional logic
  • Schedule DAG with cron expression
  • View task-level logs and metrics

Dagster Approach:

  • Define 3-4 software-defined assets
  • Data flows naturally through function parameters
  • Use asset checks for validation logic
  • Schedule asset materialization
  • View asset lineage and data quality metrics

Conclusion

Both Apache Airflow and Dagster are excellent tools for orchestrating ML workflows, but they approach the problem differently. Airflow’s task-centric model offers flexibility and a mature ecosystem, making it ideal for organizations with existing infrastructure. Dagster’s asset-centric approach provides better data lineage, testing, and developer experience, making it compelling for new ML platforms.

The best choice depends on your team’s expertise, existing infrastructure, and priorities. Neither tool is inherently superior—they excel in different contexts. What matters most is having proper orchestration in place, whether through Airflow, Dagster, or another tool, rather than relying on fragile manual processes or cron jobs that inevitably break at 3 AM.

Leave a Comment