When you’re building modern data platforms, one of the most persistent challenges is the artificial divide between analytics and machine learning workflows. Data teams maintain separate pipelines—one feeding data warehouses for BI dashboards and SQL analytics, another feeding data lakes or feature stores for ML training and inference. This duplication wastes resources, creates consistency problems, and slows down both analytics and ML development. The lakehouse architecture promises to solve this by providing a unified platform that supports both analytical queries and ML workloads on the same datasets. However, achieving this unification requires specific architectural patterns that handle the distinct requirements of each use case while maintaining a single source of truth. This guide explores proven lakehouse patterns that successfully bridge analytics and ML, enabling organizations to build more efficient, consistent, and agile data platforms.
Understanding the Analytics-ML Divide
Before examining solutions, it’s important to understand why analytics and ML datasets have historically lived in separate systems, and what problems this separation creates.
Traditional data warehouses excel at structured analytical queries—aggregations, joins, filtering on clean, well-modeled data. They optimize for SQL performance through techniques like columnar storage, aggressive caching, and materialized views. The data model typically follows dimensional modeling with facts and dimensions, and transformations happen through SQL-based tools like dbt. This environment suits business analysts running dashboards, data analysts doing exploratory analysis, and BI tools generating reports.
Machine learning workflows have different requirements. ML engineers need access to raw, granular data—not just aggregated summaries. They work with diverse data types including structured tables, semi-structured JSON, unstructured text, and binary files like images. Feature engineering often requires complex transformations that go beyond SQL capabilities, using Python libraries like pandas, NumPy, or domain-specific tools. Training datasets need reproducible snapshots, not continuously updated tables. Model serving requires low-latency access to features, sometimes at millions of requests per second.
These differing requirements led to the classic architecture where analytics happens in Snowflake/BigQuery/Redshift while ML happens in S3/GCS/Azure Data Lake using Spark, Python scripts, and specialized ML platforms. The problem? Now you have two copies of your data with separate pipelines, different latencies, potential inconsistencies, and doubled infrastructure costs. When a new data source arrives, both teams need to build separate ingestion. When definitions change, both pipelines need updating. When bugs arise, they might appear in one system but not the other.
The lakehouse architecture addresses this by providing ACID transactions, schema enforcement, and query optimization on data lake storage. This combination enables both analytical SQL queries and ML data access from the same tables, finally unifying these workloads.
Core Lakehouse Pattern: Medallion Architecture for Multi-Purpose Data
The medallion architecture—organizing data into bronze, silver, and gold layers—provides the foundation for supporting both analytics and ML from unified datasets. However, implementing this pattern effectively for both use cases requires understanding how each layer serves different purposes.
Bronze Layer: Raw Data Landing Zone
The bronze layer stores raw data exactly as received from source systems. This immutability is crucial for both analytics and ML—it provides audit trails for compliance, enables reprocessing when logic changes, and serves as the ultimate source of truth.
For analytics, bronze tables support exploratory analysis and ad-hoc queries when business users need to investigate edge cases not covered by downstream models. For ML, bronze data is essential for feature engineering experimentation and training data generation when you need the full fidelity of source data without aggregations or joins that might destroy signals.
Key implementation patterns for bronze:
- Store data in open formats (Parquet, Delta, Iceberg) not proprietary warehouse formats
- Partition by ingestion timestamp for efficient incremental processing
- Include metadata columns: source system, ingestion timestamp, data lineage
- Use schema-on-read where appropriate to handle evolving source schemas
- Enable time travel for reproducing historical ML training datasets
Silver Layer: Cleaned and Conformed Data
Silver tables contain cleaned, validated, and conformed data. Records are deduplicated, invalid data is filtered or corrected, and schemas are standardized. This layer bridges bronze’s raw fidelity with gold’s business-oriented modeling.
For analytics, silver provides cleaner data for joins and aggregations without the complexity of business logic. For ML, silver tables are the primary source for feature engineering—clean enough to work with but granular enough to extract diverse features.
The critical pattern here is maintaining row-level granularity in silver. Don’t aggregate or summarize—those transformations belong in gold. Keep silver at the grain of individual events, transactions, or observations so ML engineers can engineer features requiring granular data while analysts can still query it efficiently.
Implementation patterns:
- Apply data quality rules consistently across all downstream use cases
- Store in formats supporting efficient columnar reads (Parquet, ORC)
- Partition strategically for both batch ML training (by date) and analytical queries (by business dimensions)
- Use Z-ordering or clustering on columns used in both analytical filters and ML feature lookups
- Maintain slowly changing dimensions with full history for point-in-time correct ML features
Gold Layer: Business-Optimized Views
Gold tables contain business-aggregated, denormalized, and optimized data for specific use cases. For analytics, this includes dimensional models (star schemas), aggregated metrics tables, and materialized dashboards. For ML, gold includes feature tables, pre-computed features for low-latency serving, and model training datasets.
The pattern that unifies these seemingly different purposes is treating gold as the “business logic” layer that can be rematerialized from silver as needed. When analytics needs change or ML models require different features, you don’t modify silver—you create new gold tables with different transformations.
Medallion Layer Responsibilities
Bronze (Raw): Immutable source data, full history, audit trail, reprocessing foundation
Silver (Cleaned): Validated data at grain, shared foundation for analytics and ML
Gold (Business): Optimized for specific use cases—dimensional models, aggregations, feature tables
Key Principle: Multiple gold tables can derive from same silver data, supporting diverse requirements
Pattern 1: Feature Tables as Dimensional Facts
One powerful unification pattern treats ML feature tables as specialized fact tables in dimensional modeling. This recognizes that features and metrics often derive from the same underlying events—they’re just different transformations of the same data.
Consider user engagement data. Analytics needs daily active users, session duration averages, and engagement trends for dashboards. ML needs per-user features like total sessions, average session length, and engagement patterns for churn prediction models. These derive from the same event stream but with different granularity and windowing.
The pattern: Create gold tables at user-day grain containing both aggregated metrics for analytics and features for ML:
-- user_daily_engagement (gold layer)
-- Serves both analytics dashboards and ML features
CREATE TABLE gold.user_daily_engagement (
user_id STRING,
date DATE,
-- Analytics metrics (aggregated for reporting)
total_sessions INT,
total_duration_minutes DOUBLE,
pages_viewed INT,
-- ML features (windowed historical aggregates)
sessions_last_7d INT,
sessions_last_30d INT,
avg_session_duration_7d DOUBLE,
days_since_last_session INT,
engagement_trend_7d DOUBLE, -- Slope of sessions over last 7 days
-- Shared computed columns
is_power_user BOOLEAN,
engagement_score DOUBLE
)
USING delta
PARTITIONED BY (date)
CLUSTER BY (user_id);
This table serves analytics through aggregation queries:
-- Analytics: Daily active users trend
SELECT date, COUNT(DISTINCT user_id) as dau
FROM gold.user_daily_engagement
GROUP BY date;
And serves ML through row-level feature extraction:
# ML: Training data for churn prediction
features = spark.table("gold.user_daily_engagement").filter(
"date >= '2024-01-01'"
).select(
"user_id", "date",
"sessions_last_7d", "avg_session_duration_7d",
"engagement_trend_7d", "days_since_last_session"
)
This pattern reduces redundancy—one transformation pipeline feeds both use cases. When engagement definition changes, one update propagates to both analytics and ML. Feature drift and metrics inconsistency become impossible because they’re literally the same data.
Pattern 2: Point-in-Time Correct Feature Engineering
A critical challenge in unified lakehouse architectures is ensuring ML features reflect what was actually known at prediction time, not future information. This point-in-time correctness is less critical for analytics (which often looks at current state) but essential for ML (which must not leak future information into training).
The pattern involves maintaining temporal validity in your silver and gold tables through effective dated modeling and careful join logic.
Implementing Temporal Validity:
Store all changes to dimensional data with valid-from and valid-to timestamps:
-- Silver: Customer dimensions with temporal validity
CREATE TABLE silver.customers_history (
customer_id STRING,
valid_from TIMESTAMP,
valid_to TIMESTAMP,
-- Attributes that change over time
customer_tier STRING,
account_status STRING,
primary_region STRING,
lifetime_value DOUBLE
)
USING delta
PARTITIONED BY (date(valid_from));
When creating feature tables for ML, join using point-in-time logic:
-- Gold: Training dataset with point-in-time correct features
CREATE TABLE gold.customer_features_training AS
SELECT
t.transaction_id,
t.transaction_timestamp,
t.customer_id,
-- Point-in-time customer attributes
c.customer_tier,
c.account_status,
c.lifetime_value,
-- Target variable (future outcome)
t.transaction_amount
FROM silver.transactions t
LEFT JOIN silver.customers_history c
ON t.customer_id = c.customer_id
AND t.transaction_timestamp >= c.valid_from
AND t.transaction_timestamp < c.valid_to;
This join ensures that features reflect customer attributes as they existed at transaction time, not current values. The same customer_tier used to train the model is what would be available at inference time.
Analytics can use current values:
-- Analytics: Current customer segments
SELECT customer_tier, COUNT(*) as customer_count
FROM silver.customers_history
WHERE valid_to IS NULL -- Current records only
GROUP BY customer_tier;
This pattern lets analytics queries default to current state while ML queries explicitly enforce point-in-time correctness. Both queries hit the same underlying history table but apply different temporal filters.
Pattern 3: Streaming Features for Real-Time ML and Analytics
Modern data platforms increasingly need to support both batch analytics and real-time ML inference. The lakehouse can unify these through streaming feature computation patterns.
The architecture uses streaming processing (Spark Structured Streaming, Flink, Kafka Streams) to continuously compute features and write them to Delta/Iceberg tables. These tables serve both real-time ML inference APIs (low-latency lookup) and batch analytics queries (aggregation over historical data).
Implementation Pattern:
# Streaming feature computation writing to Delta
streaming_events = (
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("subscribe", "user_events")
.load()
)
# Compute rolling aggregates
windowed_features = (
streaming_events
.withWatermark("event_time", "1 hour")
.groupBy(
window("event_time", "1 hour", "10 minutes"),
"user_id"
)
.agg(
count("*").alias("event_count"),
avg("session_duration").alias("avg_duration")
)
)
# Write to Delta for serving and analytics
(windowed_features.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/checkpoints/user_features")
.toTable("gold.user_features_realtime")
)
This streaming pipeline writes to a Delta table that supports:
Real-time ML inference: Feature lookup with low latency for online predictions
# Inference: Fetch latest features for user
latest_features = spark.sql("""
SELECT * FROM gold.user_features_realtime
WHERE user_id = '12345'
ORDER BY window.end DESC
LIMIT 1
""").toPandas()
Batch analytics: Historical analysis over the same feature stream
-- Analytics: Hourly feature trend analysis
SELECT
date_trunc('day', window.end) as date,
AVG(avg_duration) as daily_avg_duration
FROM gold.user_features_realtime
GROUP BY date_trunc('day', window.end);
The key enabler is Delta Lake’s ability to support concurrent reads (from inference APIs and analytics queries) and writes (from streaming pipeline) with ACID guarantees. This wouldn’t be possible with traditional data lake formats that lack transaction support.
Key Benefits of Unified Streaming Features
- Single Pipeline: One streaming computation serves both real-time ML and batch analytics
- Consistency: ML inference uses the same features that analytics reports on
- Reduced Latency: Features pre-computed and materialized, not computed on-demand
- Historical Analysis: Complete feature history available for model debugging and analytics
- Cost Efficiency: Shared infrastructure and computation reduces operational overhead
Pattern 4: Training Dataset Versioning and Reproducibility
ML models require reproducible training datasets—the exact same data that produced a model version must be available months later for retraining, auditing, or debugging. Analytics typically works with “current” data and doesn’t need this reproducibility. The lakehouse pattern for unifying these needs uses Delta time travel and dataset versioning.
Delta Time Travel for Dataset Snapshots:
Delta Lake’s time travel capability allows querying tables as they existed at specific timestamps or versions. This provides reproducible ML datasets without physically copying data:
# Train model on data as it existed on specific date
training_data = spark.read.format("delta") \
.option("timestampAsOf", "2024-01-15") \
.table("silver.features")
# Or by version number
training_data = spark.read.format("delta") \
.option("versionAsOf", 42) \
.table("silver.features")
When you train a model, log the timestamp or version number:
import mlflow
with mlflow.start_run():
# Train model
model = train_model(training_data)
# Log dataset version for reproducibility
mlflow.log_param("dataset_table", "silver.features")
mlflow.log_param("dataset_timestamp", "2024-01-15T00:00:00Z")
mlflow.log_param("dataset_version", 42)
Later, you can reproduce the exact training dataset:
# Reproduce training dataset for model version X
params = mlflow.get_run(run_id).data.params
training_data_reproduction = spark.read.format("delta") \
.option("timestampAsOf", params["dataset_timestamp"]) \
.table(params["dataset_table"])
Benefits for Analytics and ML:
For ML, this provides the reproducibility required for model governance and debugging. For analytics, the same tables benefit from time travel for regulatory compliance, audit trails, and historical analysis. Both use cases benefit from a single mechanism rather than separate versioning systems.
Retention and Optimization:
Delta time travel has retention costs—old versions consume storage. Configure retention based on requirements:
-- Retain 30 days for analytics, longer for ML training datasets
ALTER TABLE silver.features
SET TBLPROPERTIES (
'delta.logRetentionDuration' = '30 days',
'delta.deletedFileRetentionDuration' = '30 days'
);
-- For critical ML training tables, extend retention
ALTER TABLE gold.ml_training_datasets
SET TBLPROPERTIES (
'delta.logRetentionDuration' = '365 days',
'delta.deletedFileRetentionDuration' = '365 days'
);
This tiered approach balances storage costs with reproducibility needs—general tables have shorter retention, critical ML training tables have extended retention.
Pattern 5: Unified Metadata and Feature Discovery
As lakehouses grow, discovering what data exists and understanding its meaning becomes critical for both analytics and ML teams. A unified metadata layer prevents the fragmentation where analytics has a data catalog while ML has a separate feature store.
Implementing Unified Metadata:
Use a metadata layer that spans both analytics tables and ML features, treating features as specialized columns with additional semantics:
# Unified metadata definition
table: silver.user_behavior
description: "Cleaned user behavior events for analytics and ML"
domain: customer_analytics
columns:
- name: user_id
type: string
description: "Unique user identifier"
tags: [pii, entity_key]
- name: event_timestamp
type: timestamp
description: "When event occurred"
tags: [time_key]
- name: session_duration_seconds
type: double
description: "Session length in seconds"
tags: [metric, ml_feature]
feature_type: numerical
feature_engineering: "Raw session duration, use with log transform for ML"
- name: pages_viewed
type: integer
description: "Number of pages viewed in session"
tags: [metric, ml_feature]
feature_type: numerical
feature_engineering: "Count feature, consider bucketing for categorical treatment"
This unified metadata serves multiple purposes:
For Analytics Users:
- Data catalog showing available tables and columns
- Business definitions and calculation logic
- Data lineage showing source systems and transformations
- Usage metrics showing which tables are queried most
For ML Engineers:
- Feature discovery showing available features and their properties
- Feature engineering suggestions and transformations
- Feature statistics (distributions, missing rates, cardinality)
- Model feature usage tracking which models use which features
Implementation Tools:
Several platforms support this unified approach:
- DataHub or Amundsen: Open-source data catalogs that can model both tables and features
- Databricks Unity Catalog: Unified governance across analytics and ML assets
- Feast Feature Store on Lakehouse: Feature metadata layer on top of Delta tables
- Custom metadata stores: Using Delta tables themselves to store metadata
The key is ensuring both teams use the same metadata system rather than maintaining separate catalogs and feature registries.
Pattern 6: Computational Efficiency Through Shared Materialization
Both analytics and ML often require expensive computations—complex joins, aggregations, or transformations. The lakehouse pattern for computational efficiency materializes these computations once and serves both use cases.
Shared Aggregation Tables:
When both analytics dashboards and ML models need aggregated data, compute it once:
-- Gold: Shared aggregated customer metrics
CREATE OR REPLACE TABLE gold.customer_metrics_daily
AS SELECT
customer_id,
date,
-- Aggregated metrics for analytics
SUM(order_amount) as total_revenue,
COUNT(DISTINCT order_id) as order_count,
AVG(order_amount) as avg_order_value,
-- Features for ML
SUM(order_amount) / NULLIF(COUNT(DISTINCT order_id), 0) as revenue_per_order,
COUNT(DISTINCT product_category) as category_diversity,
MAX(order_amount) as max_single_order,
-- Shared derived metrics
CASE
WHEN SUM(order_amount) > 1000 THEN 'high_value'
WHEN SUM(order_amount) > 100 THEN 'medium_value'
ELSE 'low_value'
END as value_segment
FROM silver.orders
GROUP BY customer_id, date;
This single materialized table eliminates redundant computation. Analytics queries aggregate further:
-- Analytics: Revenue by customer segment
SELECT value_segment, SUM(total_revenue) as segment_revenue
FROM gold.customer_metrics_daily
WHERE date >= CURRENT_DATE - 30
GROUP BY value_segment;
While ML queries use the precomputed features directly:
# ML: Training features for churn prediction
features = spark.table("gold.customer_metrics_daily") \
.filter("date >= '2024-01-01'") \
.select("customer_id", "date", "revenue_per_order",
"category_diversity", "value_segment")
Performance Benefits:
- Reduced compute costs: Expensive aggregations computed once, not separately for each use case
- Faster query times: Both analytics and ML read from pre-materialized tables
- Consistent results: Impossible for analytics and ML to see different aggregated values
- Simplified pipelines: One transformation DAG instead of two parallel pipelines
Governance and Access Control Across Use Cases
Unified lakehouses must enforce appropriate access controls that recognize different needs of analytics and ML users while maintaining security and compliance.
Row-Level and Column-Level Security:
Implement fine-grained access control that applies consistently:
-- Create views with row-level security
CREATE VIEW analytics.customers_filtered AS
SELECT * FROM silver.customers
WHERE
CASE
WHEN current_user() IN ('analyst_team') THEN region = 'US'
WHEN current_user() IN ('ml_team') THEN TRUE
ELSE FALSE
END;
This pattern gives ML teams full data access (needed for representative training data) while restricting analytics users to regional subsets (compliance requirement). The same underlying table serves both with different access policies.
Column Masking for PII:
Mask sensitive columns differently for different users:
-- Analytics: Masked email for privacy
SELECT
customer_id,
CASE
WHEN current_user() IN ('ml_team') THEN email
ELSE CONCAT('****', SUBSTRING(email, INSTR(email, '@'), 100))
END as email
FROM silver.customers;
ML gets full email for feature engineering (domain extraction, validation features), while analytics gets masked values sufficient for counting but not identifying individuals.
Audit Trails:
The lakehouse’s ACID transaction log provides unified audit trails showing who accessed what data and when—critical for both analytics governance (regulatory compliance) and ML governance (model audit requirements).
Conclusion
Lakehouse patterns for unifying analytics and ML datasets represent a fundamental shift from the traditional separation of these workloads into distinct systems. By implementing medallion architecture with careful attention to granularity, temporal correctness, and computational efficiency, organizations can build platforms where a single source of truth serves both analytical dashboards and machine learning models. The patterns covered—treating features as dimensional facts, maintaining point-in-time correctness, streaming features to unified tables, versioning training datasets through time travel, unified metadata discovery, and shared materialization—provide concrete approaches to achieving this unification without sacrificing the specific requirements of either use case.
The benefits extend beyond cost savings and operational simplicity. Unified lakehouses accelerate development by eliminating redundant work, improve trust through consistency between analytics and ML, and enable new capabilities like real-time features serving both live predictions and interactive dashboards. As data platforms evolve, these patterns will become foundational to organizations building modern, AI-driven analytics platforms. The key to success lies not in forcing analytics and ML into identical workflows, but in recognizing their shared foundations while accommodating their distinct requirements through thoughtful lakehouse architecture and implementation patterns.