Best Practices for Joining Large Fact Tables for ML Training Sets

Creating machine learning training datasets from production data warehouses is a deceptively complex challenge. While the conceptual task seems straightforward—join relevant tables to create a wide feature matrix—the reality involves navigating massive fact tables with billions of rows, managing complex join conditions that create fan-outs, balancing computational resources, and ensuring temporal consistency that prevents label leakage. A poorly designed join strategy doesn’t just waste compute resources; it can silently corrupt your training data through accidental label leakage, create memory-related failures during model training, or produce features that don’t match what you’ll have at prediction time.

This article explores battle-tested practices for joining large fact tables to create ML training sets. These aren’t theoretical optimizations but practical strategies that address the real pain points data scientists and ML engineers face when working with production-scale data warehouses. The focus is on patterns that maintain correctness while achieving acceptable performance, recognizing that in ML feature engineering, correctness—preventing label leakage, maintaining point-in-time consistency, and creating features that match production—matters more than raw query speed.

Understanding the Fact Table Join Challenge

Before diving into solutions, we need to understand what makes fact table joins for ML uniquely challenging compared to typical analytical queries.

The Scale Problem:

Fact tables in production data warehouses routinely contain billions of rows. A retail transaction table might have 100 million rows per day. An ad impression table might have billions per day. Customer behavior event tables grow continuously as users interact with products. When joining multiple such tables to create a training dataset with hundreds of millions of examples, you’re dealing with queries that process terabytes of data.

Unlike typical business intelligence queries that aggregate data (group by, sum, count), ML training set construction often requires row-level joins to create per-example features. You can’t aggregate away the problem—each training example might need its own unique set of features derived from potentially millions of related fact table rows.

The Fan-Out Problem:

Fact table joins naturally create fan-outs where one row in the base table (your training examples) joins to many rows in feature tables. Consider building a customer churn prediction model where each training example is one customer in one month. Joining to the transactions table brings in potentially dozens or hundreds of transactions per customer. Joining to the customer service interactions table adds more rows. Each join multiplies the row count.

This fan-out is intentional—you need those transaction details to create features. But naive joins create exploding intermediate results. A 10-million customer training set joining to tables that average 50 related rows each creates a 500-million row intermediate result. Multiple such joins compound the problem, quickly exceeding memory limits or creating performance disasters.

The Temporal Complexity Problem:

ML models predict future outcomes based on past features. This temporal relationship creates strict requirements: features must represent only information available at prediction time, excluding any data from the future that would cause label leakage. For time-series forecasting or churn prediction, this means point-in-time joins where features are calculated using only historical data up to a specific timestamp.

Implementing point-in-time correctness is straightforward conceptually but complex in practice. You need to filter every joined table by timestamp, handle varying event granularities (daily aggregates, hourly events, real-time updates), and ensure consistency across tables with different update cadences. Mistakes here are silent—the query runs successfully but produces a model that performs unrealistically well in training and fails in production.

The Query Optimizer Challenge:

Data warehouse query optimizers are tuned for analytical workloads—filtering, grouping, and aggregating. They struggle with ML workload patterns involving complex multi-way joins, large intermediate results, and aggregations at the individual example level rather than summary statistics.

Optimizers might choose poor join orders, fail to push down filters effectively, or produce execution plans that spill to disk excessively. Unlike traditional analytics where you can manually tune a query until performance is acceptable, ML feature engineering queries are often generated programmatically, making manual optimization infeasible.

⚠️ Common Pitfalls to Avoid

Label Leakage: Accidentally including future information in features
Fan-Out Explosion: Unconstrained joins creating billions of intermediate rows
Memory Overflow: Intermediate results exceeding available memory
Cartesian Products: Missing join conditions creating exponential results
Inconsistent Granularity: Mismatched join keys causing duplicates/drops
Non-Deterministic Aggregations: Order-dependent results from implicit grouping

Strategy 1: Build Spine Table First

The foundation of successful fact table joins for ML is creating a spine table—a primary table containing one row per training example with the label and all identifying keys needed for joins.

What a Spine Table Contains:

The spine table defines your training examples explicitly. For a churn prediction model, each row represents one customer-month with:

  • Customer ID (primary key for joining)
  • Reference date (point-in-time for feature computation)
  • Label (did customer churn in next month?)
  • Any stratification fields (customer segment, acquisition channel)

This table is small—maybe millions of rows even for large training sets—because it contains only keys and labels, no features. Its size is bounded by your training example count, not by feature complexity.

Why This Matters:

Starting with the spine table allows you to control join fan-out from the beginning. Every subsequent join explicitly joins to this base table rather than joining facts tables to each other first. This prevents the common mistake of joining two large fact tables together and then filtering to relevant training examples, creating massive intermediate results that mostly get discarded.

The spine table also makes temporal correctness explicit. By including the reference date, you force every feature join to respect point-in-time constraints. Features must be calculated using only data from before the reference date.

Implementation Pattern:

-- Create spine table with training examples
CREATE TABLE ml.churn_training_spine AS
SELECT 
    customer_id,
    DATE_TRUNC('month', observation_date) as reference_month,
    observation_date,
    LEAD(has_active_subscription, 1) OVER (
        PARTITION BY customer_id 
        ORDER BY observation_date
    ) as label_churned_next_month
FROM customer_snapshots
WHERE observation_date BETWEEN '2023-01-01' AND '2024-01-01'
    AND observation_date = DATE_TRUNC('month', observation_date); -- monthly snapshots

-- Add indexes for join performance
CREATE INDEX idx_spine_customer ON ml.churn_training_spine(customer_id);
CREATE INDEX idx_spine_date ON ml.churn_training_spine(observation_date);

This spine table has ~10M rows (customers × months). All subsequent feature engineering joins to this table, keeping the data size manageable.

Strategy 2: Pre-Aggregate Feature Tables

Rather than joining raw fact tables and aggregating during feature computation, pre-aggregate feature tables at the desired granularity before joining to the spine.

The Pre-Aggregation Pattern:

Instead of joining the spine table to the raw transactions table (billions of rows) and then computing sum(amount) grouped by customer, create an intermediate table that already contains per-customer transaction aggregates. This pre-aggregated table is much smaller and joins more efficiently.

The pattern involves two queries:

  1. Aggregate the fact table to the join key granularity
  2. Join the pre-aggregated result to the spine table

Example: Transaction Features:

-- Step 1: Pre-aggregate transactions by customer and month
CREATE TABLE ml.customer_monthly_transaction_features AS
SELECT 
    customer_id,
    DATE_TRUNC('month', transaction_date) as transaction_month,
    COUNT(*) as transaction_count,
    SUM(amount) as total_amount,
    AVG(amount) as avg_amount,
    MAX(amount) as max_amount,
    COUNT(DISTINCT product_category) as unique_categories
FROM transactions
WHERE transaction_date >= '2022-01-01'  -- sufficient lookback
GROUP BY customer_id, DATE_TRUNC('month', transaction_date);

-- Add index for efficient joining
CREATE INDEX idx_txn_features_key 
ON ml.customer_monthly_transaction_features(customer_id, transaction_month);

-- Step 2: Join to spine with point-in-time logic
SELECT 
    spine.*,
    COALESCE(txn.transaction_count, 0) as txn_count_last_30d,
    COALESCE(txn.total_amount, 0) as txn_amount_last_30d
FROM ml.churn_training_spine spine
LEFT JOIN ml.customer_monthly_transaction_features txn
    ON spine.customer_id = txn.customer_id
    AND txn.transaction_month = DATE_TRUNC('month', 
        spine.observation_date - INTERVAL '30 days')

This approach reduces the transaction join from potentially billions of rows to millions of pre-aggregated rows, dramatically improving performance and reducing memory requirements.

Choosing Aggregation Granularity:

The pre-aggregation granularity must match or exceed your spine table’s granularity. If your training examples are customer-months, aggregate to customer-months or finer (customer-days). Never aggregate to coarser granularity (customer-quarters) because you lose the ability to do point-in-time joins correctly.

For features requiring multiple time windows (last 7 days, last 30 days, last 90 days), create separate pre-aggregated tables for each window or use window functions to calculate rolling aggregates. Don’t try to compute multiple time windows in a single join—it creates complex logic prone to errors.

Strategy 3: Use Windowed Aggregations for Time-Series Features

For time-series features requiring aggregations over rolling windows, window functions provide cleaner, more maintainable solutions than self-joins.

The Self-Join Problem:

A common anti-pattern for time-series features is joining a table to itself with date range conditions:

-- Anti-pattern: self-join for rolling window
SELECT 
    t1.customer_id,
    t1.date,
    COUNT(t2.transaction_id) as txn_count_last_30d
FROM customer_dates t1
LEFT JOIN transactions t2
    ON t1.customer_id = t2.customer_id
    AND t2.date BETWEEN t1.date - 30 AND t1.date - 1
GROUP BY t1.customer_id, t1.date

This creates a massive fan-out where each row joins to potentially hundreds of transaction rows, even after grouping. For 10M training examples with an average of 50 transactions in the last 30 days, you’re processing 500M rows.

Window Function Solution:

Window functions calculate aggregations without fan-out:

-- Better: window functions for rolling aggregations
WITH daily_transactions AS (
    SELECT 
        customer_id,
        DATE_TRUNC('day', transaction_date) as day,
        COUNT(*) as txn_count,
        SUM(amount) as txn_amount
    FROM transactions
    WHERE transaction_date >= '2022-01-01'
    GROUP BY customer_id, DATE_TRUNC('day', transaction_date)
),
rolling_features AS (
    SELECT 
        customer_id,
        day,
        SUM(txn_count) OVER (
            PARTITION BY customer_id 
            ORDER BY day 
            ROWS BETWEEN 29 PRECEDING AND 1 PRECEDING
        ) as txn_count_last_30d,
        AVG(txn_amount) OVER (
            PARTITION BY customer_id 
            ORDER BY day 
            ROWS BETWEEN 29 PRECEDING AND 1 PRECEDING
        ) as avg_txn_amount_last_30d
    FROM daily_transactions
)
SELECT 
    spine.*,
    COALESCE(rf.txn_count_last_30d, 0) as txn_count_last_30d,
    COALESCE(rf.avg_txn_amount_last_30d, 0) as avg_txn_amount_last_30d
FROM ml.churn_training_spine spine
LEFT JOIN rolling_features rf
    ON spine.customer_id = rf.customer_id
    AND spine.observation_date = rf.day

This approach first aggregates to daily granularity (reducing billions to millions of rows), then uses window functions to compute rolling aggregates. The join to the spine table involves much smaller intermediate results.

Performance Considerations:

Window functions have their own performance characteristics. They require partitioning and sorting data, which can be expensive. For very wide windows (365 days of history) on high-cardinality partition keys (millions of customers), window functions might still struggle.

In such cases, consider pre-computing rolling features at regular intervals (daily) and materializing them in a table. Then join to the most recent pre-computed value before each spine table date. This trades computation time (pre-compute daily) for query time (fast joins to pre-computed values).

Strategy 4: Partition and Filter Aggressively

Partitioning strategies dramatically affect query performance when working with multi-terabyte fact tables.

Table Partitioning Design:

Modern data warehouses (Snowflake, BigQuery, Redshift) support table partitioning on specific columns, typically date/time fields. Properly partitioned fact tables allow the query optimizer to skip reading irrelevant partitions entirely.

For fact tables used in ML feature engineering, partition by the event timestamp or transaction date. This aligns with the temporal filtering inherent in point-in-time feature computation—you’re always filtering to dates before the reference date.

Example Partitioning Strategy:

-- In Snowflake
CREATE TABLE transactions (
    transaction_id BIGINT,
    customer_id BIGINT,
    transaction_date DATE,
    amount DECIMAL(10,2),
    -- other columns
)
PARTITION BY DATE_TRUNC('month', transaction_date);

-- In BigQuery
CREATE TABLE transactions (
    transaction_id INT64,
    customer_id INT64,
    transaction_date DATE,
    amount NUMERIC,
    -- other columns
)
PARTITION BY DATE_TRUNC(transaction_date, MONTH);

Monthly partitioning works well for tables where you typically query 1-12 months of history. Daily partitioning suits tables where queries need only days or weeks of history.

Filter Pushdown Optimization:

After partitioning, ensure your queries include partition key filters that the optimizer can push down. When joining fact tables to the spine table, include explicit date filters:

-- Good: explicit date filter enables partition pruning
SELECT 
    spine.customer_id,
    COUNT(t.transaction_id) as txn_count
FROM ml.churn_training_spine spine
LEFT JOIN transactions t
    ON spine.customer_id = t.customer_id
    AND t.transaction_date BETWEEN 
        spine.observation_date - INTERVAL '90 days' 
        AND spine.observation_date - INTERVAL '1 day'
WHERE spine.observation_date BETWEEN '2023-01-01' AND '2024-01-01'
GROUP BY spine.customer_id

The filter on t.transaction_date allows the optimizer to read only relevant partitions. Without this filter, even with partitioned tables, the optimizer might scan everything.

Clustering and Sorting:

Beyond partitioning, set clustering keys (Snowflake) or sort keys (Redshift) on columns used in join conditions. For customer transaction tables, cluster by customer_id after partitioning by date. This physically co-locates rows for the same customer, improving join performance.

🚀 Performance Optimization Hierarchy

Impact Order (highest to lowest):

1. Spine table approach (10-100x improvement)
2. Pre-aggregation before joins (5-50x improvement)
3. Partition pruning through date filters (3-10x improvement)
4. Window functions vs self-joins (2-10x improvement)
5. Appropriate join types (LEFT vs INNER) (1.5-3x improvement)
6. Clustering/sorting on join keys (1.2-2x improvement)
7. Query hints and manual optimization (1.1-1.5x improvement)

Focus on top items first—they provide the most impact

Strategy 5: Handle Fan-Out with Appropriate Aggregations

When fan-out is unavoidable—you genuinely need to join one training example to many fact rows—explicit aggregation strategies prevent data corruption.

The Implicit Duplication Problem:

Consider joining customers to multiple feature tables where each creates fan-out:

-- Problematic: multiple fan-outs create duplicates
SELECT 
    c.customer_id,
    c.label,
    t.transaction_id,
    s.support_ticket_id
FROM customers c
LEFT JOIN transactions t ON c.customer_id = t.customer_id
LEFT JOIN support_tickets s ON c.customer_id = s.customer_id

If a customer has 10 transactions and 3 support tickets, this creates 30 rows for that customer (10 × 3). The label gets duplicated 30 times. If you later aggregate features without careful grouping, you might inadvertently sum the label 30 times, corrupting your training data.

Solution: Aggregate Before Joining:

Each feature table should be fully aggregated to the join key granularity before joining:

-- Correct: aggregate each feature table independently
WITH transaction_features AS (
    SELECT 
        customer_id,
        COUNT(*) as txn_count,
        SUM(amount) as txn_total
    FROM transactions
    WHERE transaction_date >= '2023-01-01'
    GROUP BY customer_id
),
support_features AS (
    SELECT 
        customer_id,
        COUNT(*) as ticket_count,
        AVG(resolution_hours) as avg_resolution_time
    FROM support_tickets
    WHERE created_date >= '2023-01-01'
    GROUP BY customer_id
)
SELECT 
    c.customer_id,
    c.label,
    COALESCE(tf.txn_count, 0) as txn_count,
    COALESCE(tf.txn_total, 0) as txn_total,
    COALESCE(sf.ticket_count, 0) as ticket_count,
    COALESCE(sf.avg_resolution_time, 0) as avg_resolution_time
FROM customers c
LEFT JOIN transaction_features tf ON c.customer_id = tf.customer_id
LEFT JOIN support_features sf ON c.customer_id = sf.customer_id

Now each join is one-to-one (customer to their transaction features, customer to their support features). No fan-out, no duplication, no risk of corrupting labels or other base table data.

Array Aggregations for Detail Preservation:

Sometimes you need to preserve detail beyond simple aggregates. Modern SQL dialects support array aggregations that collect multiple rows into arrays:

WITH transaction_arrays AS (
    SELECT 
        customer_id,
        ARRAY_AGG(amount ORDER BY transaction_date DESC LIMIT 10) 
            as last_10_amounts,
        ARRAY_AGG(product_category ORDER BY transaction_date DESC LIMIT 10) 
            as last_10_categories
    FROM transactions
    WHERE transaction_date >= '2023-01-01'
    GROUP BY customer_id
)

This creates feature columns containing arrays of the last 10 transaction amounts and categories. You can then extract features from these arrays (mean, std dev, mode) downstream in Python/R during feature engineering, while still avoiding fan-out in the SQL joins.

Strategy 6: Incremental Processing for Large-Scale Training Sets

For training sets requiring hundreds of millions of examples, single-query approaches hit practical limits. Incremental processing strategies become necessary.

Date-Based Batching:

Process the spine table in date-based chunks, creating features for each chunk independently, then union results:

-- Process January 2023
CREATE TABLE ml.features_202301 AS
SELECT spine.*, [features]
FROM ml.churn_training_spine spine
[feature joins]
WHERE spine.observation_date BETWEEN '2023-01-01' AND '2023-01-31'; -- Process February 2023 

CREATE TABLE ml.features_202302 AS 
SELECT spine.*, [features] 
FROM ml.churn_training_spine spine
[feature joins]
WHERE spine.observation_date BETWEEN '2023-02-01' AND '2023-02-28'; 

-- ... continue for all months 

-- Union all chunks 
CREATE TABLE ml.features_full AS 
SELECT * FROM ml.features_202301 
UNION ALL 
SELECT * FROM ml.features_202302 
UNION ALL [...];

This approach has several advantages:

  • Each query processes smaller data volumes, staying within memory/time limits
  • Failed chunks can be retried without reprocessing everything
  • Parallel processing of chunks reduces total wall-clock time
  • Intermediate results can be validated before combining

Customer-Based Batching:

Alternatively, batch by customer ID ranges:

-- Process customers 0-999,999
CREATE TABLE ml.features_batch_01 AS
SELECT spine.*, [features]
FROM ml.churn_training_spine spine
[feature joins]
WHERE spine.customer_id BETWEEN 0 AND 999999; -- Process customers 1,000,000-1,999,999 

CREATE TABLE ml.features_batch_02 AS -- ...

This works well when feature computation is more intensive per customer than per date (many historical periods of data per customer).

Orchestration Considerations:

Incremental processing requires orchestration—Airflow, Prefect, or custom scripts—to manage batch creation, monitor failures, and combine results. Include retry logic, failure alerts, and progress tracking. For production ML pipelines, this orchestration infrastructure is essential anyway, so the incremental approach aligns with operational best practices.

Strategy 7: Validate Temporal Consistency

Point-in-time correctness is critical but easy to violate accidentally. Systematic validation catches errors before they corrupt models.

Validation Checks:

After creating training features, run these validations:

No future data leakage:

-- Check: no features should be from after the reference date
SELECT 
    customer_id,
    observation_date,
    last_transaction_date
FROM ml.features_full
WHERE last_transaction_date > observation_date;

-- Should return zero rows

Consistency across examples:

-- Check: same customer at different dates should show progression
SELECT 
    customer_id,
    observation_date,
    total_lifetime_value,
    LAG(total_lifetime_value) OVER (
        PARTITION BY customer_id 
        ORDER BY observation_date
    ) as previous_ltv
FROM ml.features_full
WHERE customer_id IN (SELECT customer_id FROM ml.features_full GROUP BY customer_id HAVING COUNT(*) > 1)
ORDER BY customer_id, observation_date;

-- Verify LTV never decreases (assuming it's cumulative)

Feature distribution stability:

-- Check: feature distributions shouldn't shift dramatically by date
SELECT 
    DATE_TRUNC('month', observation_date) as month,
    AVG(transaction_count) as avg_txn_count,
    STDDEV(transaction_count) as stddev_txn_count,
    MIN(transaction_count) as min_txn_count,
    MAX(transaction_count) as max_txn_count
FROM ml.features_full
GROUP BY DATE_TRUNC('month', observation_date)
ORDER BY month;

-- Look for unexplained discontinuities

These checks catch common errors: incorrect join conditions that allow future data, calculation bugs that produce nonsensical values, and data quality issues from source tables.

Handling Common Edge Cases

Several edge cases recur frequently in production feature engineering pipelines.

Customers with No Activity:

Customers might have no transactions, no support tickets, or no events in your fact tables. Handle this explicitly:

-- Use LEFT JOIN and COALESCE for missing data
LEFT JOIN transaction_features tf 
    ON spine.customer_id = tf.customer_id
WHERE spine.observation_date = tf.feature_date

-- Provide sensible defaults
COALESCE(tf.transaction_count, 0) as transaction_count,
COALESCE(tf.total_spent, 0) as total_spent,
COALESCE(tf.last_transaction_date, '1900-01-01') as last_transaction_date

Don’t use INNER JOIN for feature tables unless you explicitly want to exclude customers without activity. INNER JOIN silently drops examples, potentially creating selection bias.

Late-Arriving Data:

Data might arrive late—a transaction from yesterday appears in your warehouse today. For training sets, this usually isn’t critical since you’re looking at historical data that’s already settled. But for feature consistency with production, document the latency assumptions.

If training features use data as of T with a 2-day latency buffer, production must use the same buffer to match training distribution.

Slowly Changing Dimensions:

Customer attributes might change over time (address, plan tier, status). Use point-in-time joins to snapshots rather than current-state dimension tables:

-- Correct: join to historical customer state
LEFT JOIN customer_snapshots cs
    ON spine.customer_id = cs.customer_id
    AND cs.snapshot_date = spine.observation_date

-- Incorrect: joins to current state, causing leakage
LEFT JOIN customers c
    ON spine.customer_id = c.customer_id

The incorrect version uses today’s customer attributes for historical predictions, leaking future information.

Conclusion

Creating ML training sets from large fact tables requires a disciplined approach that prioritizes correctness over raw performance while still achieving acceptable query execution times. The spine table pattern provides a foundation by explicitly defining training examples and controlling join fan-out from the start, while pre-aggregation and window functions prevent intermediate result explosion. Aggressive filtering through partitioning and careful temporal logic ensure point-in-time consistency that prevents label leakage, and systematic validation catches the edge cases that silently corrupt model training.

These practices aren’t just performance optimizations—they’re correctness guarantees that ensure your training data actually represents the problem you’re trying to solve. A training set created in 2 hours that contains label leakage is worthless, while one created in 10 hours with verified temporal consistency is production-ready. Master these patterns and your feature engineering pipelines will scale from millions to billions of examples while maintaining the data integrity that production ML systems demand.

Leave a Comment