Building an ML Feature Store on AWS

Machine learning systems in production face a critical challenge: managing features consistently across training and inference while maintaining low latency and high availability. A feature store solves this problem by providing a centralized repository for feature definitions, computations, and serving infrastructure. Building a feature store on AWS leverages the cloud provider’s extensive data and ML services to create a scalable, reliable system that accelerates model development and ensures production consistency. This architecture becomes essential as organizations scale from a few models to dozens or hundreds, where feature reuse and standardization deliver compounding returns.

The feature store serves as the bridge between raw data and machine learning models, transforming data pipelines into reusable feature pipelines. Rather than each data scientist independently computing features from scratch, teams define features once and consume them across multiple models. This approach eliminates the training-serving skew that plagues many ML systems, where subtle differences in feature computation between training and production cause mysterious model degradation. AWS provides the building blocks needed to construct feature stores that handle both batch and real-time workloads while integrating with existing data infrastructure.

Understanding Feature Store Architecture

Core Components and Responsibilities

A production feature store comprises several interconnected components that work together to transform raw data into ML-ready features. The feature definition layer provides the interface where data scientists and ML engineers declare features using code, typically defining transformation logic that converts raw data into engineered features. These definitions become the source of truth for how features are computed across all environments.

The feature computation engine executes transformation logic on raw data, handling both batch processing for historical data and stream processing for real-time features. AWS services like AWS Glue for batch transformations and Amazon Kinesis for streaming data provide the computational backbone. The engine must handle data at scale, processing millions or billions of records while maintaining consistency and correctness.

Storage in a feature store is dual-purpose, maintaining both offline and online stores. The offline store keeps historical feature values used for training models, typically stored in columnar formats optimized for analytical queries. Amazon S3 serves as the primary offline store, with data organized in formats like Parquet for efficient scanning. The online store provides low-latency access to current feature values for real-time inference, using databases like Amazon DynamoDB or Amazon ElastiCache that can serve features within single-digit milliseconds.

The feature registry acts as a catalog and metadata store, tracking available features, their schemas, lineage information, and statistics. AWS Lake Formation and AWS Glue Data Catalog provide foundational services for metadata management, while custom registries built on Amazon RDS or DynamoDB can store feature-specific metadata like computation logic, owners, and usage patterns.

Data Flow Patterns

Feature stores implement distinct data flows for training and inference workloads. Training flows typically operate in batch mode, retrieving historical feature values for defined time windows. A training job specifies a feature list and observation timestamps, and the feature store returns point-in-time correct features—the feature values as they existed at each observation time, avoiding data leakage from future information.

Inference flows split between batch and real-time patterns. Batch inference resembles training, retrieving features for large sets of entities simultaneously. Real-time inference requires sub-100 millisecond latency, fetching current feature values for individual predictions. This dual-serving requirement drives the offline-online store architecture, where batch workloads use S3 while real-time predictions query DynamoDB.

Feature freshness varies by use case and drives different update patterns. Some features update daily through scheduled batch jobs, while others require near-real-time updates as events occur. A feature store must support both patterns, using AWS Glue workflows for batch updates and Amazon Kinesis with AWS Lambda for streaming updates.

Feature Store Architecture on AWS

📥 Data Ingestion

• S3 (batch data)
• Kinesis (streaming)
• RDS/DynamoDB (databases)

⚙️ Feature Computation

• AWS Glue (batch)
• Lambda (real-time)
• EMR (complex transforms)

💾 Feature Storage

• S3 (offline store)
• DynamoDB (online store)
• Glue Catalog (metadata)

🎯 Model Serving

• SageMaker (training)
• API Gateway + Lambda
• Real-time endpoints

Key Principle: Same feature definitions drive both training (offline) and inference (online) to eliminate training-serving skew

Implementing Offline Feature Store with S3 and Glue

Designing the Storage Schema

The offline feature store’s storage schema determines query performance and cost efficiency. Organizing features in S3 requires careful consideration of partitioning strategies, file formats, and directory structures. The most common pattern partitions data by feature group and date, creating paths like s3://features/user_demographics/date=2024-11-28/ that enable efficient time-based queries and incremental updates.

Choosing between row-based and columnar formats impacts both storage costs and query performance. Parquet has emerged as the standard for feature stores, offering excellent compression ratios and columnar access patterns that align with ML training workloads which typically retrieve all features for a subset of entities. Parquet’s schema evolution capabilities also simplify feature additions without rewriting historical data.

Feature versioning within the offline store enables reproducibility and experimentation. Each feature computation can produce a new version, stored in separate S3 prefixes or encoded in the data itself with version columns. This allows data scientists to compare model performance across feature versions or roll back to previous definitions when new features degrade model quality.

Building Feature Pipelines with AWS Glue

AWS Glue provides the serverless ETL infrastructure for computing batch features. Glue jobs written in Python or Scala execute transformation logic on distributed datasets, reading from various sources and writing computed features to S3. The Glue job design should separate raw data reading, feature transformation, and feature writing into distinct stages for modularity and testability.

A typical feature pipeline begins with a Glue crawler that discovers raw data schemas and populates the Glue Data Catalog. Subsequent Glue jobs reference these catalog entries, benefiting from schema inference and partition pruning. The transformation logic applies feature engineering—aggregations, encodings, joins—producing feature values for each entity and timestamp.

Here’s an example Glue job that computes user features from event data:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import functions as F
from pyspark.sql.window import Window

args = getResolvedOptions(sys.argv, ['JOB_NAME', 'execution_date'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

execution_date = args['execution_date']

# Read raw events from S3 via Glue Catalog
events_df = glueContext.create_dynamic_frame.from_catalog(
    database="raw_data",
    table_name="user_events",
    push_down_predicate=f"date='{execution_date}'"
).toDF()

# Compute user activity features
user_features = events_df.groupBy("user_id").agg(
    F.count("event_id").alias("event_count_24h"),
    F.countDistinct("session_id").alias("session_count_24h"),
    F.sum(F.when(F.col("event_type") == "purchase", 1).otherwise(0)).alias("purchase_count_24h"),
    F.sum("amount").alias("total_spent_24h"),
    F.max("timestamp").alias("last_activity_timestamp")
)

# Add temporal features
user_features = user_features.withColumn("feature_timestamp", F.lit(execution_date))
user_features = user_features.withColumn("date", F.lit(execution_date))

# Write to S3 in Parquet format, partitioned by date
user_features.write.mode("overwrite").partitionBy("date").parquet(
    f"s3://feature-store/user_activity_features/"
)

job.commit()

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import functions as F
from pyspark.sql.window import Window

args = getResolvedOptions(sys.argv, ['JOB_NAME', 'execution_date'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

execution_date = args['execution_date']

# Read raw events from S3 via Glue Catalog
events_df = glueContext.create_dynamic_frame.from_catalog(
    database="raw_data",
    table_name="user_events",
    push_down_predicate=f"date='{execution_date}'"
).toDF()

# Compute user activity features
user_features = events_df.groupBy("user_id").agg(
    F.count("event_id").alias("event_count_24h"),
    F.countDistinct("session_id").alias("session_count_24h"),
    F.sum(F.when(F.col("event_type") == "purchase", 1).otherwise(0)).alias("purchase_count_24h"),
    F.sum("amount").alias("total_spent_24h"),
    F.max("timestamp").alias("last_activity_timestamp")
)

# Add temporal features
user_features = user_features.withColumn("feature_timestamp", F.lit(execution_date))
user_features = user_features.withColumn("date", F.lit(execution_date))

# Write to S3 in Parquet format, partitioned by date
user_features.write.mode("overwrite").partitionBy("date").parquet(
    f"s3://feature-store/user_activity_features/"
)

job.commit()

This pipeline demonstrates key patterns: reading partitioned data efficiently, computing aggregations that represent features, and writing results in a partitioned, versioned manner. The resulting Parquet files become the offline feature store, queryable by training pipelines.

Orchestrating Feature Pipelines

Feature computation pipelines require orchestration to handle dependencies, scheduling, and failure recovery. AWS Step Functions provides workflow orchestration for complex feature pipelines with multiple stages and dependencies. For simpler cases, AWS Glue workflows offer built-in orchestration tied directly to Glue jobs.

The orchestration layer manages feature freshness by scheduling regular updates. Daily batch features might run at 2 AM, ensuring fresh features are available for morning model training or batch inference. Dependencies between feature groups—for example, user features that depend on computed session features—are encoded in the orchestration workflow.

Error handling in feature pipelines must balance correctness with availability. A failed feature computation shouldn’t prevent the entire ML system from operating. Some pipelines implement graceful degradation, using slightly stale features when fresh computations fail. Others maintain multiple feature versions, falling back to the previous version on computation failures.

Building Online Feature Store with DynamoDB

Schema Design for Low-Latency Access

The online feature store must serve individual feature vectors within milliseconds to support real-time inference. DynamoDB’s single-digit millisecond latency and fully managed scalability make it ideal for this role. The schema design centers on entity IDs as partition keys, with feature values stored in item attributes.

A common pattern uses the entity ID (user ID, product ID, etc.) as the partition key and optionally a feature group identifier as the sort key, allowing each entity to have features from multiple feature groups stored efficiently. Feature values are stored as top-level attributes for fast access, with additional metadata like computation timestamps and versions stored alongside values.

DynamoDB’s item size limit of 400 KB accommodates most feature vectors, but very large feature sets may require splitting across multiple items or using compression. For features with many dimensions like embeddings, storing them as binary data reduces storage costs and improves retrieval speed.

Here’s a DynamoDB schema design for user features:

import boto3
from datetime import datetime
from decimal import Decimal

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('feature-store-online')

def write_user_features(user_id, features, feature_group='user_activity'):
    """Write feature vector to online store"""
    item = {
        'entity_id': user_id,
        'feature_group': feature_group,
        'event_count_24h': Decimal(str(features['event_count_24h'])),
        'session_count_24h': Decimal(str(features['session_count_24h'])),
        'purchase_count_24h': Decimal(str(features['purchase_count_24h'])),
        'total_spent_24h': Decimal(str(features['total_spent_24h'])),
        'last_activity_timestamp': features['last_activity_timestamp'],
        'updated_at': datetime.utcnow().isoformat(),
        'ttl': int((datetime.utcnow().timestamp() + 86400 * 7))  # 7 day TTL
    }
    
    table.put_item(Item=item)

def get_user_features(user_id, feature_group='user_activity'):
    """Retrieve feature vector from online store"""
    response = table.get_item(
        Key={
            'entity_id': user_id,
            'feature_group': feature_group
        }
    )
    
    if 'Item' in response:
        item = response['Item']
        # Convert DynamoDB types back to standard Python types
        features = {
            'event_count_24h': float(item['event_count_24h']),
            'session_count_24h': float(item['session_count_24h']),
            'purchase_count_24h': float(item['purchase_count_24h']),
            'total_spent_24h': float(item['total_spent_24h']),
            'last_activity_timestamp': item['last_activity_timestamp']
        }
        return features
    
    return None

import boto3
from datetime import datetime
from decimal import Decimal

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('feature-store-online')

def write_user_features(user_id, features, feature_group='user_activity'):
    """Write feature vector to online store"""
    item = {
        'entity_id': user_id,
        'feature_group': feature_group,
        'event_count_24h': Decimal(str(features['event_count_24h'])),
        'session_count_24h': Decimal(str(features['session_count_24h'])),
        'purchase_count_24h': Decimal(str(features['purchase_count_24h'])),
        'total_spent_24h': Decimal(str(features['total_spent_24h'])),
        'last_activity_timestamp': features['last_activity_timestamp'],
        'updated_at': datetime.utcnow().isoformat(),
        'ttl': int((datetime.utcnow().timestamp() + 86400 * 7))  # 7 day TTL
    }
    
    table.put_item(Item=item)

def get_user_features(user_id, feature_group='user_activity'):
    """Retrieve feature vector from online store"""
    response = table.get_item(
        Key={
            'entity_id': user_id,
            'feature_group': feature_group
        }
    )
    
    if 'Item' in response:
        item = response['Item']
        # Convert DynamoDB types back to standard Python types
        features = {
            'event_count_24h': float(item['event_count_24h']),
            'session_count_24h': float(item['session_count_24h']),
            'purchase_count_24h': float(item['purchase_count_24h']),
            'total_spent_24h': float(item['total_spent_24h']),
            'last_activity_timestamp': item['last_activity_timestamp']
        }
        return features
    
    return None

This implementation shows the read and write patterns for online features, handling DynamoDB’s Decimal type requirements and including TTL for automatic cleanup of stale features.

Synchronizing Offline and Online Stores

Keeping offline and online stores synchronized ensures consistency between training and inference. The most common pattern computes features in batch (writing to S3), then materialized to the online store (writing to DynamoDB). This unidirectional flow from offline to online maintains the offline store as the source of truth.

AWS Lambda functions triggered by S3 events provide a serverless mechanism for materialization. When Glue jobs write new feature data to S3, S3 event notifications trigger Lambda functions that read the data and write it to DynamoDB. This approach scales automatically and keeps synchronization logic separate from feature computation.

For high-throughput scenarios, using Amazon Kinesis Data Streams as an intermediary provides better control over materialization throughput. Feature computations write to both S3 and Kinesis, with Lambda functions or Kinesis Data Firehose consuming the stream to populate DynamoDB. This pattern also enables real-time feature updates for streaming features.

Handling Real-Time Feature Updates

Some features require real-time updates based on streaming events rather than batch computations. User session features, real-time counters, and event-based features benefit from immediate updates. Amazon Kinesis Data Streams ingests events, with Lambda functions performing lightweight transformations and updating DynamoDB directly.

Real-time feature computation must be lightweight to maintain low latency. Complex aggregations or joins are typically pre-computed in batch, with real-time updates handling simple increments, replacements, or windowed counters. For example, a “page views in last hour” feature might be computed in real-time by incrementing a counter, while “page views in last 90 days” is batch-computed nightly.

Combining batch and real-time features in the same feature vector requires careful timestamp management. Features should include metadata indicating freshness, allowing inference systems to detect stale features. Some implementations maintain separate DynamoDB tables for batch and real-time features, joining them at query time based on feature requirements.

Performance Characteristics

< 10ms

Online Store Latency

DynamoDB single-item reads

Minutes

Batch Feature Latency

Glue job computation time

TB-scale

Storage Capacity

S3 offline feature storage

Integrating with SageMaker for Training

Feature Retrieval for Training

SageMaker training jobs need efficient access to features from the offline store. The training pipeline specifies required features and a time range, and the feature store returns point-in-time correct feature values for each training example. This point-in-time correctness prevents data leakage by ensuring features reflect only information available at prediction time.

Amazon SageMaker Feature Store provides native integration for this pattern, but custom implementations can achieve similar results using SageMaker Processing jobs that read from S3, perform temporal joins, and output training datasets. The processing job implements the point-in-time join logic, merging label data with features based on timestamps.

Training dataset generation becomes a repeatable process with versioned features. Data scientists specify feature lists and observation periods, and the system generates consistent training data. This repeatability is crucial for model comparison and debugging—using identical feature versions eliminates features as a variable when evaluating model architectures or hyperparameters.

Feature Transformation in Training Pipelines

Not all feature engineering happens in the feature store. Some transformations are model-specific, such as normalization, one-hot encoding, or embedding lookups. SageMaker Processing jobs provide a natural place for these transformations, sitting between feature retrieval and model training.

The separation between feature store and model-specific transforms follows a principle: the feature store computes features usable across many models, while training pipelines apply model-specific preprocessing. This division allows feature reuse while giving data scientists flexibility to experiment with different preprocessing strategies.

SageMaker Pipelines orchestrates the end-to-end workflow from feature retrieval through training to deployment. A typical pipeline includes steps for feature retrieval, data validation, preprocessing, training, and model registration. This automation ensures consistent, repeatable model development that leverages centralized features.

Implementing Feature Monitoring and Validation

Feature Quality Metrics

Monitoring feature quality prevents subtle bugs from degrading model performance. Basic statistics like mean, median, standard deviation, and percentiles establish baselines for expected feature distributions. Comparing incoming features against these baselines detects drift or anomalies that might indicate upstream data issues.

AWS Glue DataBrew provides a visual interface for profiling features, computing descriptive statistics, and identifying data quality issues. For automated monitoring, custom Lambda functions or Glue jobs can compute statistics on new feature batches and compare them against historical profiles stored in CloudWatch or a time-series database.

Missing value rates, uniqueness, and type consistency are critical quality metrics. A feature that suddenly has 50% missing values likely indicates a data pipeline failure. Similarly, unexpected data types or out-of-range values suggest bugs in feature computation logic that require investigation.

Data Validation and Testing

Implementing data validation prevents bad features from reaching models. Great Expectations, an open-source Python library, integrates well with AWS infrastructure to define expectations about feature data—expected ranges, missing value thresholds, categorical value sets—and validate each batch.

Validation runs at multiple points in the pipeline: after raw data ingestion, after feature computation, and before writing to online stores. Failed validations can halt pipelines or trigger alerts, depending on severity. Some validations are blocking (wrong data types), while others are warning-level (unusual distribution shift).

Testing feature computation logic requires sample inputs with known outputs. Unit tests verify transformation functions produce expected results for edge cases. Integration tests run complete feature pipelines on test data and validate outputs against fixtures. This testing discipline, borrowed from software engineering, dramatically improves feature reliability.

Drift Detection and Alerting

Feature drift—changes in feature distributions over time—can degrade model performance even when the model itself hasn’t changed. Monitoring drift helps teams understand when model retraining is necessary. Statistical tests like the Kolmogorov-Smirnov test compare recent feature distributions against training distributions, quantifying drift.

Amazon SageMaker Model Monitor provides built-in drift detection for models deployed on SageMaker endpoints, comparing inference inputs against training data baselines. For custom monitoring, CloudWatch metrics combined with Lambda functions enable flexible alerting on feature statistics.

Alert thresholds should balance sensitivity with actionability. Too-sensitive alerts create fatigue, while too-lax thresholds miss significant issues. Starting with broad thresholds and tightening based on observed false positive rates helps calibrate monitoring systems.

Access Control and Governance

IAM Policies for Feature Access

Proper access control ensures teams can access features they need while preventing unauthorized access to sensitive data. AWS IAM policies control access to S3 buckets storing offline features and DynamoDB tables storing online features. Policies should follow least-privilege principles, granting read access broadly but limiting write access to automated pipelines.

Feature groups containing sensitive data like personally identifiable information require additional protections. S3 bucket policies can restrict access to specific IAM roles, while DynamoDB table policies can enforce similar restrictions. Encryption at rest and in transit protects sensitive feature data from unauthorized access.

Cross-account access patterns enable feature sharing across AWS accounts in large organizations. IAM roles with cross-account trust relationships allow training jobs in one account to access features stored in another, maintaining security while enabling collaboration.

Audit Logging and Compliance

Tracking feature access for compliance requires comprehensive logging. AWS CloudTrail logs S3 and DynamoDB API calls, providing audit trails of who accessed which features when. S3 access logging and DynamoDB streams offer additional visibility into data access patterns.

For regulated industries, feature stores must demonstrate data lineage—tracking features from raw data sources through transformations to model inputs. AWS Lake Formation provides lineage tracking capabilities, while custom solutions can build lineage graphs using metadata from Glue Data Catalog and CloudWatch logs.

Compliance requirements may mandate data retention policies or the ability to delete individual records. S3 lifecycle policies automate feature data deletion after retention periods expire. For DynamoDB, TTL fields automatically remove stale features, while targeted deletion using partition keys removes specific user data when required.

Cost Optimization Strategies

Storage Cost Management

Feature stores can accumulate significant storage costs as feature history grows. S3 storage costs scale with data volume, making efficient storage critical. Using compression and columnar formats like Parquet reduces storage by 5-10x compared to uncompressed row formats. Transitioning historical features to S3 Glacier or S3 Intelligent-Tiering further reduces costs for infrequently accessed data.

DynamoDB costs depend on provisioned capacity or on-demand pricing. For predictable workloads, provisioned capacity with auto-scaling optimizes costs. For variable workloads, on-demand pricing eliminates capacity planning at a modest cost premium. Archiving cold features from DynamoDB to S3 after they’re no longer needed for real-time inference cuts online storage costs.

Monitoring feature usage helps identify unused features consuming resources. AWS CloudWatch metrics track S3 bucket access patterns and DynamoDB table usage. Features that haven’t been accessed in months are candidates for archival or deletion, reducing ongoing storage costs.

Compute Cost Optimization

Glue job costs scale with DPU-hours consumed during feature computation. Right-sizing Glue jobs—using appropriate numbers of workers for data volumes—prevents overprovisioning. Benchmarking jobs with different configurations identifies the cost-performance sweet spot. Using Glue’s auto-scaling capabilities adjusts worker counts based on data volume, optimizing costs dynamically.

Lambda function costs for online store materialization depend on execution time and memory allocation. Optimizing Lambda code for efficiency reduces costs—batching DynamoDB writes, using efficient serialization, and minimizing network calls. For very high throughput, evaluating Fargate or EC2 alternatives to Lambda may reduce costs.

Training integration costs depend on data movement between S3 and SageMaker. Using S3 Select to filter feature data before downloading reduces transfer costs and speeds up training data preparation. Caching frequently used feature sets in EBS volumes attached to SageMaker instances eliminates repeated S3 downloads.

Conclusion

Building a feature store on AWS transforms machine learning operations from ad-hoc feature engineering to systematic, reusable feature management. By leveraging S3 for offline storage, DynamoDB for online serving, Glue for batch computation, and Kinesis for streaming updates, teams can construct feature stores that scale from initial experiments to production systems serving millions of predictions daily. The architecture ensures consistency between training and inference while providing the flexibility to evolve features as models and business requirements change.

Success with feature stores requires balancing technical implementation with organizational adoption. The technical infrastructure must be reliable and performant, but equally important is establishing processes for feature documentation, quality standards, and governance. Teams that invest in both dimensions—building robust technical systems and cultivating a culture of feature reuse—unlock the full value of feature stores, accelerating model development and improving production reliability across their entire ML portfolio.