Machine learning models in production face a fundamental tension: they need features computed from both historical patterns and real-time events. A fraud detection model benefits from a user’s transaction history over months (batch) while also requiring instant analysis of the current transaction’s characteristics (streaming). A recommendation system needs deep collaborative filtering computed across all users (batch) combined with the user’s actions in the current session (streaming). This tension between comprehensive historical analysis and immediate responsiveness has driven the evolution of hybrid architectures that combine batch and streaming feature engineering—systems that deliver the best of both worlds while managing the inherent complexities of maintaining consistency across two fundamentally different computational paradigms.
Understanding the Batch-Streaming Divide
Before examining hybrid architectures, we must understand why batch and streaming processing represent distinct approaches with different strengths and inherent limitations.
Batch processing excels at comprehensive analysis across large datasets. When computing features like “user’s average transaction amount over the past 90 days” or “product’s popularity ranking among all users,” batch systems can efficiently process millions or billions of records. They leverage distributed computing frameworks like Spark or data warehouse systems to parallelize computation across entire datasets. Batch jobs can perform complex joins, aggregations, and statistical computations that would be prohibitively expensive in real-time.
The batch paradigm offers important advantages beyond computational efficiency. Batch jobs are easier to reason about—they operate on complete, immutable datasets with well-defined start and end points. Debugging is straightforward: rerun the job with the same input and you get the same output. Testing requires no special infrastructure: process sample data and verify results. Failure recovery is simple: restart the failed job.
However, batch processing has a fundamental limitation: latency. Batch jobs run on schedules—hourly, daily, or weekly. Features computed in batch reflect historical data, not current state. A user’s “average transaction amount” computed this morning doesn’t include transactions from this afternoon. For applications requiring immediate response to new data, batch processing alone is insufficient.
Streaming processing addresses this latency problem by computing features incrementally as events arrive. When a user clicks a product, streaming systems can immediately update session-level features like “products viewed in current session” or “time spent browsing category.” Streaming enables real-time personalization, instant fraud detection, and immediate anomaly alerting—use cases where minutes or hours of delay are unacceptable.
Streaming systems face their own challenges. They must maintain state across distributed systems while processing events that may arrive out of order, be delayed, or appear multiple times. Computing aggregations over sliding time windows requires careful state management. Debugging streaming applications is difficult—you can’t simply rerun events in the same order. Testing requires simulating event streams with proper timing and failure scenarios.
The key insight is that batch and streaming aren’t competing approaches—they’re complementary. Most production ML systems need both, which is where hybrid architectures enter.
Batch vs Streaming: Complementary Strengths
Batch Processing
- Complete historical analysis
- Complex aggregations efficient
- Deterministic & reproducible
- Simple debugging & testing
- 90-day user statistics
- Cross-user patterns
- Historical embeddings
Streaming Processing
- Real-time responsiveness
- Event-driven updates
- Low-latency features
- Continuous processing
- Session-level metrics
- Real-time counters
- Immediate anomalies
Architectural Patterns for Hybrid Feature Engineering
Several architectural patterns have emerged for integrating batch and streaming feature computation. Each pattern addresses the integration challenge differently, with tradeoffs in complexity, consistency, and operational overhead.
The Lambda Architecture Pattern
Lambda architecture, popularized by Nathan Marz, represents one of the first systematic approaches to combining batch and streaming. The architecture maintains two parallel pipelines: a batch layer that processes complete datasets to produce accurate views, and a speed layer that processes real-time streams to provide low-latency updates.
The batch layer periodically recomputes all features from raw data, providing eventual consistency and serving as the source of truth. It might run daily, computing features like “user’s 30-day transaction average” or “product popularity scores” across all historical data. These batch features are written to a serving layer—typically a key-value store or feature store—where they can be efficiently retrieved during inference.
The speed layer operates continuously, processing streaming events and computing incremental updates. When a new transaction arrives, the speed layer immediately updates real-time features like “transactions in the past hour” or “session-level activity metrics.” These streaming features supplement the batch features, filling the gap between batch job runs.
During model inference, the system retrieves features from both layers: batch features provide stable, comprehensive views while streaming features add real-time context. A prediction request for user fraud detection might fetch the user’s historical transaction patterns (batch) and their current session behavior (streaming), combining them into a complete feature vector.
Lambda architecture’s key strength is fault tolerance. If the streaming layer fails or produces incorrect results, the batch layer will eventually correct them. The batch recomputation provides a self-healing mechanism that catches bugs, handles edge cases missed by streaming logic, and maintains long-term data quality.
The significant drawback is complexity and duplication. You must implement and maintain two separate codebases computing the same logical features—once for batch and once for streaming. These implementations must produce consistent results despite different computational models. A bug in one pipeline but not the other creates subtle inconsistencies. Testing requires validating both pipelines independently and verifying their outputs align.
The Kappa Architecture Simplification
Kappa architecture, proposed by Jay Kreps, eliminates the dual codebase problem by treating everything as a stream. Instead of separate batch and streaming pipelines, Kappa uses a single streaming pipeline that can process both real-time events and historical data replayed as a stream.
The core insight is that batch processing is just streaming over historical data. If you need to recompute features from scratch, replay your event log through the streaming pipeline. The same code that processes today’s transactions can process last year’s transactions replayed from persistent storage (like Kafka with retention or a data lake).
Kappa architecture simplifies the development model significantly. Write your feature computation logic once, as a streaming application. Test it against sample event streams. Deploy it to process real-time events. When you need to recompute historical features—for backfilling, bug fixes, or logic changes—replay historical events through the same pipeline.
This approach works brilliantly when your features are purely event-based computations that don’t require complex joins with external datasets. A streaming pipeline can maintain running counts, sliding window aggregations, and sequential pattern detection with a single codebase.
However, Kappa struggles with features requiring heavy batch-oriented operations. Computing “user’s percentile rank among all users by transaction volume” is straightforward in batch—sort all users and assign percentiles. In streaming, you need approximate algorithms or complex state management. Features involving complex multi-way joins, statistical computations across entire populations, or machine learning model training don’t map naturally to streaming paradigms.
The Hybrid Store Pattern with Feature Store
Modern feature stores provide a third pattern that acknowledges batch and streaming as distinct but coordinates them through a unified serving layer. Rather than forcing one computational model or duplicating logic, this pattern embraces both batch and streaming, using each for what it does best, while ensuring consistent feature access.
The feature store serves as the integration point. Batch pipelines compute comprehensive features and write them to the feature store with timestamps indicating their computation time and validity period. Streaming pipelines compute real-time features and write them to the same feature store, typically with shorter validity periods or as incremental updates.
The feature store handles the complexity of combining these sources. When a model requests features, the store retrieves the most recent batch features and merges them with any relevant streaming updates. The store manages timestamps, ensures consistency, and provides point-in-time correct features for training (historical features as they existed at training time) and online serving (latest available features).
This pattern allows you to choose batch or streaming based on feature characteristics, not architectural constraints. Use batch for:
- Features requiring complete dataset analysis (rankings, percentiles, global statistics)
- Complex multi-table joins and aggregations
- Computationally expensive operations like embedding generation
- Features that don’t change frequently (daily or weekly updates suffice)
Use streaming for:
- Real-time counters and accumulators
- Session-level features
- Immediate event pattern detection
- Features requiring sub-second freshness
The feature store abstracts these differences from the model. The model simply requests “user_30day_avg_transaction” and “user_current_session_activity” without knowing one comes from batch and the other from streaming.
Implementing Feature Consistency Across Pipelines
The greatest challenge in hybrid architectures is maintaining consistency between batch and streaming feature computations. When both pipelines compute logically equivalent features, their results must align—not perfectly (due to timing and sampling differences) but within acceptable tolerances.
Define features with formal specifications. Don’t just implement “user transaction count”—specify exactly what this means. Does it include pending transactions? What timezone determines day boundaries? How are refunds handled? Clear specifications enable independent implementation validation.
Share transformation logic where possible. Modern tools like Apache Beam provide abstraction layers that work across both batch and streaming. Write your feature transformation once using Beam’s API, then execute it in batch mode (on Spark or Dataflow batch) or streaming mode (on Flink or Dataflow streaming). This eliminates code duplication for transformations that can be expressed in both paradigms.
Here’s an example of unified transformation logic using Apache Beam:
import apache_beam as beam
from apache_beam.transforms.window import FixedWindows
from apache_beam.transforms.trigger import AfterWatermark, AfterCount
class ComputeUserFeatures(beam.DoFn):
"""
Feature computation that works in both batch and streaming modes.
"""
def process(self, element):
"""
Compute features from user event.
Works identically for historical replay (batch) or real-time (streaming).
"""
user_id = element['user_id']
event_type = element['event_type']
timestamp = element['timestamp']
amount = element.get('amount', 0)
# Yield features with user_id as key
yield {
'user_id': user_id,
'event_count': 1,
'total_amount': amount,
'event_type': event_type,
'timestamp': timestamp
}
def create_user_feature_pipeline(input_source, output_sink, window_duration=None):
"""
Creates a pipeline that works for both batch and streaming.
Args:
input_source: Could be files (batch) or Pub/Sub (streaming)
output_sink: Feature store, database, or files
window_duration: For streaming; None for batch
"""
pipeline_options = beam.options.pipeline_options.PipelineOptions()
with beam.Pipeline(options=pipeline_options) as pipeline:
events = (
pipeline
| 'Read Events' >> beam.io.ReadFromText(input_source) # Or ReadFromPubSub
| 'Parse JSON' >> beam.Map(json.loads)
| 'Add Timestamps' >> beam.Map(
lambda x: beam.window.TimestampedValue(x, x['timestamp'])
)
)
# Apply windowing for streaming; for batch, this creates one global window
if window_duration:
events = events | 'Window' >> beam.WindowInto(
FixedWindows(window_duration),
trigger=AfterWatermark(early=AfterCount(100)),
accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING
)
features = (
events
| 'Extract Features' >> beam.ParDo(ComputeUserFeatures())
| 'Key by User' >> beam.Map(lambda x: (x['user_id'], x))
| 'Aggregate by User' >> beam.CombinePerKey(
lambda values: {
'total_events': sum(v['event_count'] for v in values),
'total_amount': sum(v['total_amount'] for v in values),
'unique_event_types': len(set(v['event_type'] for v in values)),
'last_event_time': max(v['timestamp'] for v in values)
}
)
| 'Format Features' >> beam.Map(
lambda kv: {
'user_id': kv[0],
'features': kv[1],
'computed_at': datetime.now().isoformat()
}
)
)
# Write to feature store
features | 'Write Features' >> beam.io.WriteToText(output_sink) # Or write to database
# For batch processing
create_user_feature_pipeline(
input_source='gs://bucket/historical_events/*.json',
output_sink='gs://bucket/batch_features/',
window_duration=None # Process entire dataset as single window
)
# For streaming processing (same code!)
create_user_feature_pipeline(
input_source='projects/my-project/topics/user-events',
output_sink='projects/my-project/topics/feature-updates',
window_duration=3600 # 1-hour windows
)
This unified approach ensures consistency—the same transformation logic executes in both modes. Bugs get fixed once, not twice. Feature definitions remain synchronized automatically.
Implement consistency validation. Periodically compare batch and streaming feature values for the same entities and time periods. Significant divergence indicates bugs or architectural issues. Set up automated alerts when batch and streaming features differ by more than acceptable thresholds.
Use feature stores with built-in consistency management. Modern feature stores like Feast, Tecton, or Databricks Feature Store provide capabilities specifically designed for hybrid scenarios. They handle timestamp management, ensure point-in-time correctness, and can automatically reconcile differences between batch and streaming sources.
Managing State in Streaming Feature Engineering
Streaming feature computation requires maintaining state across events—counters, accumulators, sliding windows, and session data. This state management is central to streaming architecture and deserves careful attention in hybrid systems.
Choose the appropriate state backend for your scale. Small-scale streaming applications can maintain state in memory within the streaming application. This works for moderate throughput and limited state size (gigabytes, not terabytes). For larger scale, use distributed state backends like RocksDB (embedded in Flink) or external stores like Redis.
In-memory state offers the lowest latency but limited capacity and no fault tolerance beyond checkpointing. Embedded state backends like RocksDB provide excellent performance with larger capacity and checkpoint-based recovery. External stores like Redis or Cassandra enable independent scaling of compute and storage but add network latency to every state access.
Implement proper checkpointing and recovery. Streaming applications must checkpoint state periodically to enable recovery from failures. Without checkpoints, failures require reprocessing from the beginning of your event log—acceptable for short-running jobs, catastrophic for production systems processing months of data.
Configure checkpointing based on your consistency requirements and performance constraints. More frequent checkpointing (every few minutes) provides faster recovery but adds overhead. Less frequent checkpointing (hourly) reduces overhead but means longer recovery times and more reprocessing after failures.
Handle late and out-of-order events gracefully. Events don’t always arrive in timestamp order. A mobile device might go offline, then upload a batch of events when connectivity returns. Your streaming pipeline must handle these late arrivals without corrupting state.
Use watermarks to track event time progress and decide when to finalize window computations. Watermarks represent the system’s best estimate of how late events might arrive. When the watermark passes a window’s end time, you can finalize that window’s computations. Configure allowed lateness based on your data characteristics—tight bounds (minutes) for well-behaved streams, loose bounds (hours) for streams with connectivity issues.
The Feature Store as Integration Hub
The feature store has evolved from a simple key-value store into a sophisticated system managing the complexity of hybrid architectures. Understanding how to leverage feature store capabilities is critical for production hybrid systems.
Separate online and offline storage. The feature store typically maintains two storage systems: an online store optimized for low-latency serving during inference, and an offline store optimized for large-scale training data generation. The online store (Redis, DynamoDB, Cassandra) provides millisecond access to the latest features. The offline store (S3, BigQuery, Snowflake) maintains complete feature history for training dataset creation.
Batch features typically write to both stores: online for serving, offline for training. Streaming features often write only to the online store with periodic snapshots to offline storage, since maintaining complete history of high-frequency updates is expensive and rarely needed.
Leverage point-in-time correctness for training. A critical feature store capability is generating training datasets with features as they existed at training time, not as they exist now. Without this, training data suffers from target leakage—features incorporate information from the future that wouldn’t be available during actual prediction.
The feature store tracks feature timestamps and can reconstruct historical feature values. When generating training data for a purchase prediction model, it retrieves user features as they existed immediately before each historical purchase, ensuring the training data accurately represents what the model would see during real prediction.
Implement feature versioning. Feature definitions change over time. You fix bugs, add new computations, or change aggregation windows. The feature store must handle these changes without breaking existing models or losing historical data.
Version your features explicitly: user_transaction_avg_v1, user_transaction_avg_v2. This allows gradual migration—new models use v2, existing models continue with v1 until retrained. The feature store maintains multiple versions simultaneously, computing and serving each according to its definition.
Here’s a practical example of feature store integration:
from feast import FeatureStore, Entity, FeatureView, Field
from feast.types import Float32, Int64, String
from datetime import timedelta
# Define entity (what features describe)
user = Entity(
name="user",
join_keys=["user_id"],
description="User entity for feature joins"
)
# Batch feature view (computed daily)
user_stats_batch = FeatureView(
name="user_statistics_batch",
entities=[user],
ttl=timedelta(days=1),
schema=[
Field(name="avg_transaction_amount_30d", dtype=Float32),
Field(name="transaction_count_90d", dtype=Int64),
Field(name="account_age_days", dtype=Int64),
Field(name="user_risk_score", dtype=Float32),
],
source="user_stats_batch_source", # Points to batch pipeline output
tags={"team": "risk", "update_frequency": "daily"}
)
# Streaming feature view (continuously updated)
user_realtime_features = FeatureView(
name="user_realtime_activity",
entities=[user],
ttl=timedelta(hours=2),
schema=[
Field(name="session_transaction_count", dtype=Int64),
Field(name="session_duration_minutes", dtype=Float32),
Field(name="last_transaction_amount", dtype=Float32),
Field(name="transactions_last_hour", dtype=Int64),
],
source="user_realtime_source", # Points to streaming pipeline
tags={"team": "risk", "update_frequency": "realtime"}
)
# Initialize feature store
fs = FeatureStore(repo_path=".")
# Online serving: Get latest features for prediction
def get_features_for_prediction(user_id):
"""
Retrieves both batch and streaming features for real-time prediction.
"""
features = fs.get_online_features(
features=[
"user_statistics_batch:avg_transaction_amount_30d",
"user_statistics_batch:transaction_count_90d",
"user_statistics_batch:user_risk_score",
"user_realtime_activity:session_transaction_count",
"user_realtime_activity:transactions_last_hour",
],
entity_rows=[{"user_id": user_id}]
).to_dict()
return features
# Offline serving: Generate training dataset
def generate_training_data(transaction_df):
"""
Creates training dataset with point-in-time correct features.
Combines batch and streaming features as they existed at transaction time.
"""
training_data = fs.get_historical_features(
entity_df=transaction_df, # Contains user_id and event_timestamp
features=[
"user_statistics_batch:avg_transaction_amount_30d",
"user_statistics_batch:transaction_count_90d",
"user_statistics_batch:user_risk_score",
"user_realtime_activity:session_transaction_count",
"user_realtime_activity:transactions_last_hour",
]
).to_df()
return training_data
# Example: Get features for fraud detection prediction
user_features = get_features_for_prediction(user_id="user_12345")
print(f"Historical stats: {user_features['avg_transaction_amount_30d']}")
print(f"Real-time activity: {user_features['session_transaction_count']}")
This integration provides a clean interface: the model doesn’t need to know which features come from batch versus streaming pipelines. The feature store handles retrieval, combination, and consistency.
Hybrid Architecture Data Flow
Both pipelines write to a shared feature store, which provides consistent feature access for training (offline) and serving (online).