DBT Incremental Strategy Examples

When you’re working with large datasets in dbt, full table refreshes quickly become impractical—rebuilding millions or billions of rows on every run wastes time and compute resources. Incremental models solve this by processing only new or changed data, dramatically reducing transformation time and cost. However, choosing the right incremental strategy and implementing it correctly requires understanding the nuances of each approach. DBT offers multiple incremental strategies—append, merge, delete+insert, and insert_overwrite—each optimized for different data patterns and warehouse capabilities. Getting this right means the difference between models that run in seconds versus hours, and between data pipelines that stay current versus ones that lag behind business needs.

Understanding Incremental Models: The Foundation

Before diving into specific strategies, you need to understand what makes incremental models different from standard dbt models and when they’re appropriate.

A standard dbt table model rebuilds completely on every run. The {{ this }} table is dropped and recreated from scratch based on your SQL query. This approach is simple and guarantees accuracy—there’s no risk of stale data or partial updates—but it’s prohibitively expensive for large datasets. If your source table has 100 million rows and you add 10,000 new rows daily, rebuilding all 100 million rows is wasteful.

Incremental models change this paradigm. On the first run, they build the full table just like a standard model. On subsequent runs, they process only new or changed records and merge them into the existing table. This requires answering two fundamental questions: How do we identify new or changed records? How do we merge them into the existing table?

The first question is answered through your model’s SQL logic, typically using a date filter or watermark:

{% if is_incremental() %}
  where event_timestamp > (select max(event_timestamp) from {{ this }})
{% endif %}

{% if is_incremental() %}
  where event_timestamp > (select max(event_timestamp) from {{ this }})
{% endif %}

This conditional logic runs only during incremental runs, filtering your source data to records newer than what already exists. The is_incremental() macro returns true when the target table exists and dbt is performing an incremental run.

The second question—how to merge—is answered by your chosen incremental strategy. This is where the real complexity and power lies, and where different strategies shine in different scenarios.

The Append Strategy: Simplest for Immutable Event Data

The append strategy is the simplest incremental approach: new records are simply added to the existing table with no updates or deletions. This strategy works perfectly for immutable event data where records never change after creation.

When to Use Append:

Event logs (clickstream, application logs, sensor readings)
Time-series data where historical values never change
Transaction records that are never modified after insertion
Any scenario where you only need to add new records, never update existing ones

How It Works: When dbt runs with append strategy, it executes your filtered query and inserts the results into the existing table. There’s no checking for duplicates, no updates, no deletions—just pure insertion. This makes append the fastest incremental strategy since it’s a simple INSERT operation.

Implementation Example:

{{
  config(
    materialized='incremental',
    incremental_strategy='append',
    unique_key='event_id'
  )
}}

select
    event_id,
    user_id,
    event_type,
    event_timestamp,
    event_properties
from {{ source('raw', 'events') }}
{% if is_incremental() %}
    where event_timestamp > (select max(event_timestamp) from {{ this }})
{% endif %}

{{
  config(
    materialized='incremental',
    incremental_strategy='append',
    unique_key='event_id'
  )
}}

select
    event_id,
    user_id,
    event_type,
    event_timestamp,
    event_properties
from {{ source('raw', 'events') }}
{% if is_incremental() %}
    where event_timestamp > (select max(event_timestamp) from {{ this }})
{% endif %}

This model processes events incrementally by filtering to events newer than the maximum timestamp already in the table. Each run appends new events without touching existing records.

Important Considerations: The unique_key in append strategy doesn’t enforce uniqueness—it’s primarily documentation. If your source data contains duplicates or your incremental logic allows the same record through multiple times, you’ll get duplicates in your target table. This is fine for true event data but problematic if records can be duplicated through processing errors or overlapping incremental windows.

For protection against duplicates with append strategy, you need defensive SQL:

{% if is_incremental() %}
    where event_timestamp > (select max(event_timestamp) from {{ this }})
    and event_id not in (select event_id from {{ this }})
{% endif %}

{% if is_incremental() %}
    where event_timestamp > (select max(event_timestamp) from {{ this }})
    and event_id not in (select event_id from {{ this }})
{% endif %}

This additional filter prevents reprocessing events that might have the same timestamp as the current maximum, but be cautious—NOT IN with subqueries can be slow on large tables. For true event streams where timestamps are unique and monotonically increasing, the simpler timestamp filter suffices.

Incremental Strategy Selection Guide

Append: Immutable events, logs, time-series where records never change

Merge: Dimensions or facts that can be updated, SCD Type 1, upserts

Delete+Insert: Partitioned data where entire partitions refresh, large batch updates

Insert_Overwrite: BigQuery/Spark partitioned tables, cost-optimized refreshes

The Merge Strategy: Handling Updates and Late-Arriving Data

The merge strategy (also called upsert) is the most flexible and commonly used incremental approach. It handles both inserts and updates, making it appropriate for data that can change over time.

When to Use Merge:

Customer or product dimensions that update (address changes, attribute updates)
Fact tables where measures can be corrected (order amounts revised, quantities adjusted)
Slowly Changing Dimensions (SCD) Type 1 where you want current values only
Any table where records might arrive late or need corrections

How It Works: Merge strategy performs a SQL MERGE operation (or equivalent). It compares incoming records to existing records based on a unique key. If a matching key exists, the record is updated. If no match exists, the record is inserted. This dual behavior handles both new records and updates to existing records in a single operation.

Implementation Example:

{{
  config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='customer_id'
  )
}}

select
    customer_id,
    customer_email,
    customer_name,
    customer_address,
    account_status,
    updated_at
from {{ source('crm', 'customers') }}
{% if is_incremental() %}
    where updated_at > (select max(updated_at) from {{ this }})
{% endif %}

{{
  config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='customer_id'
  )
}}

select
    customer_id,
    customer_email,
    customer_name,
    customer_address,
    account_status,
    updated_at
from {{ source('crm', 'customers') }}
{% if is_incremental() %}
    where updated_at > (select max(updated_at) from {{ this }})
{% endif %}

This customer dimension model uses merge strategy with customer_id as the unique key. When a customer’s information changes in the source system, the incremental run captures the update (through the updated_at filter) and merges it into the existing table. The customer’s old record is updated with new values—classic SCD Type 1 behavior.

Unique Key Importance: The unique_key is critical for merge strategy—it determines how records match. It can be a single column or a list of columns for composite keys:

{{
  config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key=['order_id', 'line_item_id']
  )
}}

{{
  config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key=['order_id', 'line_item_id']
  )
}}

For composite keys, dbt generates appropriate merge logic considering all key columns. Choose your unique key carefully—it must genuinely identify records uniquely, or you’ll get unpredictable merge behavior.

Performance Considerations: Merge is more expensive than append because it requires:

Comparing incoming records to existing records (join operation)
Determining which records to update vs. insert
Performing both update and insert operations

For very large tables with many updates, merge can be slow. Strategies to improve performance include:

Ensuring the unique key is indexed in your data warehouse
Narrowing the incremental filter to minimize records compared
Using delete+insert strategy for batch scenarios where partitions can be fully replaced

Handling Deletions: Standard merge handles inserts and updates but not deletions. If a record is deleted from your source, merge won’t remove it from your target. For true synchronization including deletions, you need additional logic:

{% if is_incremental() %}
    -- Add a delete pass before merging
    delete from {{ this }}
    where updated_at > (select max(updated_at) from {{ this }})
    and customer_id not in (
        select customer_id from {{ source('crm', 'customers') }}
    )
{% endif %}

{% if is_incremental() %}
    -- Add a delete pass before merging
    delete from {{ this }}
    where updated_at > (select max(updated_at) from {{ this }})
    and customer_id not in (
        select customer_id from {{ source('crm', 'customers') }}
    )
{% endif %}

This pattern is advanced and requires careful consideration of performance and logic correctness.

The Delete+Insert Strategy: Partition-Based Updates

Delete+insert strategy combines deletion of existing records with insertion of new records, operating at a partition or segment level rather than record-by-record.

When to Use Delete+Insert:

Partitioned tables where you refresh entire partitions (daily/monthly partitions)
Batch updates where a known set of records needs complete refresh
Data quality corrections that affect multiple historical records
Scenarios where merge performance is poor due to large update volumes

How It Works: This strategy deletes records matching certain criteria (typically a partition key), then inserts new records for that same partition. Unlike merge which compares records individually, delete+insert operates on groups of records defined by partition keys.

Implementation Example:

{{
  config(
    materialized='incremental',
    incremental_strategy='delete+insert',
    unique_key='date_day'
  )
}}

select
    date_day,
    product_id,
    sum(quantity) as total_quantity,
    sum(revenue) as total_revenue,
    count(distinct customer_id) as unique_customers
from {{ ref('orders') }}
group by date_day, product_id
{% if is_incremental() %}
    where date_day >= current_date - 7  -- Refresh last 7 days
{% endif %}

{{
  config(
    materialized='incremental',
    incremental_strategy='delete+insert',
    unique_key='date_day'
  )
}}

select
    date_day,
    product_id,
    sum(quantity) as total_quantity,
    sum(revenue) as total_revenue,
    count(distinct customer_id) as unique_customers
from {{ ref('orders') }}
group by date_day, product_id
{% if is_incremental() %}
    where date_day >= current_date - 7  -- Refresh last 7 days
{% endif %}

This daily sales summary model uses delete+insert to refresh the last 7 days of data on every run. The strategy deletes all records with date_day values in the last 7 days, then inserts fresh calculations for those days. This is perfect for correcting late-arriving orders or ensuring aggregates reflect all adjustments.

Why Delete+Insert Over Merge: You might wonder why not just use merge for this scenario. Delete+insert offers advantages when:

You’re updating many records in a partition: If 90% of a day’s records need updating, deleting and reinserting the entire day is faster than individual merges
Aggregations or complex logic: When your model involves aggregations, recalculating entire partitions ensures correctness
Warehouse-specific performance: Some data warehouses (Snowflake particularly) perform better with delete+insert for batch operations

Unique Key as Partition Key: The unique_key in delete+insert defines the deletion criteria. Records matching the unique key values from your incremental query are deleted before insertion. This is different from merge where unique key identifies individual records—here it identifies partitions or groups.

For date-partitioned data, unique_key='date_day' means “delete all records for the dates in my incremental query, then insert new records for those dates.”

Multiple Partition Keys: You can specify multiple partition keys if your data is partitioned on multiple dimensions:

{{
  config(
    materialized='incremental',
    incremental_strategy='delete+insert',
    unique_key=['date_day', 'region']
  )
}}

{{
  config(
    materialized='incremental',
    incremental_strategy='delete+insert',
    unique_key=['date_day', 'region']
  )
}}

This configuration deletes all records matching the (date_day, region) combinations in your incremental query before inserting new records.

The Insert_Overwrite Strategy: BigQuery and Spark Optimization

Insert_overwrite is a specialized strategy primarily for BigQuery and Spark that leverages native partition overwrite capabilities for maximum efficiency and cost savings.

When to Use Insert_Overwrite:

BigQuery tables with partition pruning enabled
Spark/Databricks Delta tables with partition overwrite support
Date-partitioned tables where you want atomic partition replacement
Cost-sensitive environments where BigQuery query costs matter

How It Works: Instead of explicitly deleting and inserting, insert_overwrite uses the data warehouse’s native partition overwrite capability. In BigQuery, this leverages INSERT OVERWRITE which atomically replaces partition contents. In Spark, it uses dynamic partition overwrite mode.

BigQuery Implementation Example:

{{
  config(
    materialized='incremental',
    incremental_strategy='insert_overwrite',
    partition_by={
      "field": "event_date",
      "data_type": "date",
      "granularity": "day"
    }
  )
}}

select
    event_date,
    user_id,
    event_type,
    event_count,
    revenue
from {{ ref('events_raw') }}
{% if is_incremental() %}
    where event_date >= current_date - 3  -- Process last 3 days
{% endif %}

{{
  config(
    materialized='incremental',
    incremental_strategy='insert_overwrite',
    partition_by={
      "field": "event_date",
      "data_type": "date",
      "granularity": "day"
    }
  )
}}

select
    event_date,
    user_id,
    event_type,
    event_count,
    revenue
from {{ ref('events_raw') }}
{% if is_incremental() %}
    where event_date >= current_date - 3  -- Process last 3 days
{% endif %}

This model processes the last 3 days of event data on each run. With insert_overwrite strategy, BigQuery completely replaces those 3 date partitions atomically. Old data for those dates is removed, new data is written—but this happens as a single atomic operation, not separate delete and insert steps.

Key Advantages:

Atomic Operations: Partition replacement is atomic—readers never see partial or inconsistent data during the update
Cost Efficiency: In BigQuery, insert_overwrite only scans the partitions being overwritten, potentially reducing query costs versus full table scans
Performance: Native partition operations are typically faster than application-level delete+insert
Simplicity: The data warehouse handles the deletion logic—you just specify which partitions to overwrite

Partition Configuration: The partition_by configuration is crucial for insert_overwrite. It tells dbt how your table is partitioned, which dbt uses to generate appropriate partition overwrite syntax:

partition_by={
  "field": "event_date",
  "data_type": "date",
  "granularity": "day",
  "time_ingestion_partitioning": false
}

partition_by={
  "field": "event_date",
  "data_type": "date",
  "granularity": "day",
  "time_ingestion_partitioning": false
}

For BigQuery, granularity can be “day”, “hour”, “month”, or “year”. The partition field must be a date, timestamp, or integer column.

Limitations and Considerations: Insert_overwrite only works with partitioned tables. If your table isn’t partitioned, this strategy isn’t available. Additionally, it’s specific to certain data warehouses—Snowflake and Redshift don’t support this strategy natively, so dbt falls back to delete+insert behavior.

Common Pitfalls and Solutions

Duplicate Records with Append: Add defensive filters or use merge if updates are possible
Missing Deletions with Merge: Merge doesn’t handle deletions—implement custom delete logic if needed
Slow Incremental Filter: Ensure your filter column (often a timestamp) is indexed
Overlapping Incremental Windows: Use exclusive rather than inclusive boundaries in filters
Wrong Strategy Choice: Event data doesn’t need merge, dimensions don’t need append
Missing is_incremental() Check: Full refresh breaks without this conditional logic

Advanced Patterns: Combining Strategies and Custom Logic

Real-world scenarios often require more sophisticated approaches than pure strategy implementations. Let’s explore advanced patterns.

Hybrid Incremental with Lookback Windows: Sometimes you need to reprocess recent data to catch late-arriving updates while using efficient incremental logic for older data:

{{
  config(
    materialized='incremental',
    incremental_strategy='delete+insert',
    unique_key='date_day'
  )
}}

select
    date_day,
    metric_value
from {{ source('raw', 'metrics') }}
{% if is_incremental() %}
    -- Always reprocess last 7 days to catch late data
    where date_day >= current_date - 7
{% endif %}

{{
  config(
    materialized='incremental',
    incremental_strategy='delete+insert',
    unique_key='date_day'
  )
}}

select
    date_day,
    metric_value
from {{ source('raw', 'metrics') }}
{% if is_incremental() %}
    -- Always reprocess last 7 days to catch late data
    where date_day >= current_date - 7
{% endif %}

This pattern uses delete+insert to refresh a rolling 7-day window, ensuring late-arriving data is incorporated. Older data remains untouched, providing efficiency while maintaining accuracy for recent data.

Conditional Strategy Selection: Different environments might require different strategies. Use Jinja to select strategy based on target:

{{
  config(
    materialized='incremental',
    incremental_strategy='insert_overwrite' if target.type == 'bigquery' else 'delete+insert',
    unique_key='date_day'
  )
}}

{{
  config(
    materialized='incremental',
    incremental_strategy='insert_overwrite' if target.type == 'bigquery' else 'delete+insert',
    unique_key='date_day'
  )
}}

This configuration uses insert_overwrite on BigQuery for cost optimization but falls back to delete+insert on other warehouses.

Incremental with Data Quality Checks: Incorporate data quality validation into incremental logic:

{% if is_incremental() %}
    where event_timestamp > (select max(event_timestamp) from {{ this }})
    and event_timestamp <= current_timestamp()  -- Prevent future timestamps
    and user_id is not null  -- Data quality filter
    and event_type in ('click', 'view', 'purchase')  -- Valid events only
{% endif %}

{% if is_incremental() %}
    where event_timestamp > (select max(event_timestamp) from {{ this }})
    and event_timestamp <= current_timestamp()  -- Prevent future timestamps
    and user_id is not null  -- Data quality filter
    and event_type in ('click', 'view', 'purchase')  -- Valid events only
{% endif %}

These filters ensure only valid data enters your incremental model, maintaining data quality over time.

Backfilling Strategies: When you need to backfill historical data in an incremental model, use dbt’s --full-refresh flag to rebuild from scratch, or implement date range parameters:

{% if is_incremental() %}
    {% if var('start_date', none) and var('end_date', none) %}
        -- Custom date range for backfill
        where date_day >= '{{ var("start_date") }}'
          and date_day <= '{{ var("end_date") }}'
    {% else %}
        -- Normal incremental logic
        where date_day > (select max(date_day) from {{ this }})
    {% endif %}
{% endif %}

{% if is_incremental() %}
    {% if var('start_date', none) and var('end_date', none) %}
        -- Custom date range for backfill
        where date_day >= '{{ var("start_date") }}'
          and date_day <= '{{ var("end_date") }}'
    {% else %}
        -- Normal incremental logic
        where date_day > (select max(date_day) from {{ this }})
    {% endif %}
{% endif %}

Run with dbt run --vars '{start_date: "2023-01-01", end_date: "2023-12-31"}' to backfill specific date ranges.

Optimizing Incremental Performance

Even with the right strategy, incremental models can be slow without proper optimization. Key techniques to improve performance:

Materialized Source Tables: If your incremental filter queries large raw tables, materialize filtered subsets first:

-- staging_events.sql (incremental model)
-- Filters raw events to last 7 days
-- Downstream models query this instead of raw table

-- final_events.sql (depends on staging_events)
-- Much faster because it queries pre-filtered staging model

-- staging_events.sql (incremental model)
-- Filters raw events to last 7 days
-- Downstream models query this instead of raw table

-- final_events.sql (depends on staging_events)
-- Much faster because it queries pre-filtered staging model

Partitioning and Clustering: Configure table partitioning and clustering to accelerate incremental filters and merges:

{{
  config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='id',
    partition_by={
      "field": "created_date",
      "data_type": "date"
    },
    cluster_by=['user_id', 'product_id']
  )
}}

{{
  config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='id',
    partition_by={
      "field": "created_date",
      "data_type": "date"
    },
    cluster_by=['user_id', 'product_id']
  )
}}

Partitioning on the incremental filter field (often a date) ensures queries only scan relevant partitions. Clustering on merge key columns speeds up the merge operation.

Incremental Filter Optimization: Avoid expensive subqueries in incremental filters when possible. Instead of:

where timestamp > (select max(timestamp) from {{ this }})

where timestamp > (select max(timestamp) from {{ this }})

Consider maintaining a metadata table with watermarks:

where timestamp > (select max_timestamp from {{ ref('watermarks') }} where table_name = 'my_table')

where timestamp > (select max_timestamp from {{ ref('watermarks') }} where table_name = 'my_table')

This pattern is faster when {{ this }} is very large, though it adds complexity.

Batch Size Tuning: For incremental models processing many days of data, tune batch size to balance frequency with efficiency:

-- Process last 1 day (frequent, small batches)
where date_day = current_date - 1

-- vs

-- Process last 7 days (less frequent, larger batches)
where date_day >= current_date - 7

-- Process last 1 day (frequent, small batches)
where date_day = current_date - 1

-- vs

-- Process last 7 days (less frequent, larger batches)
where date_day >= current_date - 7

Smaller batches provide fresher data but run more frequently. Larger batches reduce overhead but increase latency.

Testing and Validating Incremental Models

Incremental models are harder to test than full-refresh models because their behavior differs between first run and subsequent runs. Comprehensive testing ensures correctness.

Schema Tests: Use dbt’s built-in schema tests to validate incremental model outputs:

models:
  - name: incremental_events
    columns:
      - name: event_id
        tests:
          - unique
          - not_null
      - name: event_timestamp
        tests:
          - not_null

models:
  - name: incremental_events
    columns:
      - name: event_id
        tests:
          - unique
          - not_null
      - name: event_timestamp
        tests:
          - not_null

Run these tests after incremental runs to catch duplicates or null values that might indicate incorrect strategy choice or faulty logic.

Custom Data Tests: Create custom tests to verify incremental behavior:

-- tests/assert_no_duplicates_in_incremental_window.sql
select
    event_id,
    count(*) as count
from {{ ref('incremental_events') }}
where event_timestamp >= current_timestamp - interval '7 days'
group by event_id
having count(*) > 1

-- tests/assert_no_duplicates_in_incremental_window.sql
select
    event_id,
    count(*) as count
from {{ ref('incremental_events') }}
where event_timestamp >= current_timestamp - interval '7 days'
group by event_id
having count(*) > 1

This test fails if the last 7 days of data contains duplicates, which would indicate append strategy allowing duplicate processing.

Full Refresh Comparison: Periodically run full refreshes and compare results to incremental runs to verify correctness:

-- Compare row counts
select count(*) from incremental_model;  -- Incremental version
select count(*) from incremental_model_full_refresh;  -- Full refresh version

-- Compare aggregates
select sum(revenue) from incremental_model where date_day = '2024-01-15';
select sum(revenue) from incremental_model_full_refresh where date_day = '2024-01-15';

-- Compare row counts
select count(*) from incremental_model;  -- Incremental version
select count(*) from incremental_model_full_refresh;  -- Full refresh version

-- Compare aggregates
select sum(revenue) from incremental_model where date_day = '2024-01-15';
select sum(revenue) from incremental_model_full_refresh where date_day = '2024-01-15';

Discrepancies indicate incremental logic issues—perhaps late data isn’t being reprocessed, or updates aren’t being merged correctly.

Conclusion

Choosing and implementing the right incremental strategy is fundamental to building scalable, performant dbt pipelines. Append strategy offers maximum efficiency for immutable event data, merge provides flexibility for dimensions and facts that update, delete+insert optimizes partition-level refreshes, and insert_overwrite leverages warehouse-native capabilities for cost and performance. Understanding when each strategy applies and how to implement it correctly—including proper unique key configuration, defensive filtering, and performance optimization—separates reliable data pipelines from fragile ones that break under scale or fail to maintain data quality.

Mastering incremental strategies requires both conceptual understanding and practical experience. Start simple with append for event data and merge for dimensions, then graduate to advanced patterns as needs arise. Test thoroughly, particularly around edge cases like late-arriving data and duplicate handling. Monitor performance and optimize incrementally based on actual bottlenecks. With these skills, you’ll build dbt pipelines that scale to billions of rows while maintaining freshness and correctness—the hallmark of production-grade analytics engineering.

Understanding Incremental Models: The Foundation

The Append Strategy: Simplest for Immutable Event Data

Incremental Strategy Selection Guide

The Merge Strategy: Handling Updates and Late-Arriving Data

The Delete+Insert Strategy: Partition-Based Updates

The Insert_Overwrite Strategy: BigQuery and Spark Optimization

Common Pitfalls and Solutions

Advanced Patterns: Combining Strategies and Custom Logic

Optimizing Incremental Performance

Testing and Validating Incremental Models

Conclusion

Leave a Comment Cancel reply