Data Engineering Basics for Machine Learning Projects

Data engineering forms the critical foundation of every successful machine learning project, yet it’s often underestimated by teams eager to jump into model development. The reality is that machine learning models are only as good as the data pipelines feeding them. Understanding data engineering basics can mean the difference between a model that thrives in production and one that fails despite impressive training metrics.

Understanding the Data Engineering Role in ML

Data engineering for machine learning differs significantly from traditional data engineering. While conventional data pipelines focus on business intelligence and reporting, ML data pipelines must handle unique challenges like feature consistency, training-serving skew, and data versioning.

The data engineer’s responsibility in ML projects extends beyond moving data from point A to point B. You need to ensure data quality, implement proper feature stores, maintain reproducibility, and create pipelines that can handle both batch training and real-time inference. This requires a mindset shift from building static ETL processes to constructing dynamic, versioned data systems that support experimentation and iteration.

ML projects typically fail not because of poor algorithms, but because of data problems—missing values in production that weren’t present in training, features calculated differently between training and serving, or data drift that degrades model performance over time. Data engineering addresses these challenges systematically.

Building Robust Data Collection Systems

The foundation of any ML project is reliable data collection. Raw data rarely arrives in a format ready for machine learning, and the collection system you build determines the quality of everything downstream.

Start by identifying all relevant data sources for your ML use case. These might include application databases, event streams, third-party APIs, user interaction logs, or sensor data. Each source has unique characteristics that affect how you collect and process it.

For structured data from databases, implement change data capture (CDC) rather than full table dumps. CDC tracks only changes, reducing load on source systems and providing a complete audit trail:

# Example using a CDC approach
import pandas as pd
from datetime import datetime

def extract_changed_records(last_sync_time):
    query = f"""
    SELECT * FROM customer_data 
    WHERE updated_at > '{last_sync_time}'
    OR created_at > '{last_sync_time}'
    """
    return pd.read_sql(query, database_connection)

For event-driven data like user clicks, application logs, or IoT sensor readings, implement streaming collection using message queues like Apache Kafka or cloud-native services like AWS Kinesis. This ensures you capture data in real-time without overwhelming your systems:

from kafka import KafkaProducer
import json

producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

def log_user_event(user_id, event_type, event_data):
    event = {
        'user_id': user_id,
        'event_type': event_type,
        'timestamp': datetime.utcnow().isoformat(),
        'data': event_data
    }
    producer.send('user_events', event)

Implement schema validation at collection time. Catching data quality issues early prevents corrupted data from flowing through your entire pipeline:

from pydantic import BaseModel, validator
from datetime import datetime

class UserEvent(BaseModel):
    user_id: str
    event_type: str
    timestamp: datetime
    value: float
    
    @validator('value')
    def value_must_be_positive(cls, v):
        if v < 0:
            raise ValueError('value must be positive')
        return v

🔧 Essential Data Engineering Components for ML

💾
Data Lake/Warehouse

Centralized storage for raw and processed data with versioning

⚙️
Feature Store

Centralized repository for computed features shared across models

🔄
Pipeline Orchestration

Workflow management for scheduling and monitoring data flows

Data Quality Framework

Automated validation and monitoring of data integrity

Data Storage Architecture for ML Workloads

Choosing the right storage architecture significantly impacts your ML project’s success. ML workloads have different requirements than traditional analytics—you need fast random access for training, efficient batch processing for feature engineering, and low-latency reads for inference.

The modern approach uses a layered architecture with distinct zones:

Raw Zone: Store data exactly as collected, immutable and append-only. This serves as your source of truth and enables reprocessing if transformations change. Use columnar formats like Parquet or ORC for efficient storage and query performance:

import pyarrow as pa
import pyarrow.parquet as pq

# Write raw data in partitioned Parquet format
table = pa.Table.from_pandas(df)
pq.write_to_dataset(
    table,
    root_path='s3://ml-data/raw/user_events',
    partition_cols=['year', 'month', 'day'],
    compression='snappy'
)

Processed Zone: Store cleaned, validated, and transformed data. This layer applies business logic, handles missing values, and creates derived fields. Data here is still relatively raw but ready for feature engineering.

Feature Zone: Store computed features that directly feed models. This layer optimizes for ML consumption with features pre-calculated and stored in formats that enable fast retrieval during training and low-latency access during inference.

Implement partitioning strategies that align with your access patterns. For time-series data, partition by date. For user-based models, consider partitioning by user ID ranges:

# Partitioning strategy example
df.write.partitionBy('date', 'region').parquet(
    's3://ml-data/features/customer_features/'
)

Version your data explicitly. Machine learning requires reproducibility—you must be able to recreate the exact training dataset used for any model version:

# Data versioning approach
feature_version = 'v2.1'
output_path = f's3://ml-data/features/{feature_version}/customer_features/'

Feature Engineering Pipelines

Feature engineering transforms raw data into the inputs that machine learning models actually use. This is where data engineering and data science intersect most directly, and proper engineering discipline is essential.

Build feature pipelines that separate feature computation from feature storage. Compute features in batch jobs that run on schedules, then store results in a feature store accessible to both training and serving:

from datetime import datetime, timedelta
import pandas as pd

def compute_customer_features(customer_id, as_of_date):
    """
    Compute point-in-time correct features for a customer.
    Critical: Only use data available before as_of_date.
    """
    # Get transaction history up to as_of_date
    transactions = get_transactions(
        customer_id=customer_id,
        end_date=as_of_date
    )
    
    features = {
        'customer_id': customer_id,
        'as_of_date': as_of_date,
        # Aggregate features
        'total_purchases': len(transactions),
        'total_spend': transactions['amount'].sum(),
        'avg_purchase_amount': transactions['amount'].mean(),
        # Time-based features
        'days_since_first_purchase': (
            as_of_date - transactions['date'].min()
        ).days,
        'days_since_last_purchase': (
            as_of_date - transactions['date'].max()
        ).days,
        # Behavioral features
        'purchase_frequency': len(transactions) / max(
            (as_of_date - transactions['date'].min()).days, 1
        ),
        'spending_trend': calculate_trend(transactions),
    }
    
    return features

Implement point-in-time correctness rigorously. Training data leakage—where future information accidentally influences training—is one of the most common and insidious bugs in ML pipelines. Your feature pipeline must ensure that features calculated for any historical date only use data available before that date.

Create reusable feature transformations that work identically in both training and serving contexts. This prevents training-serving skew:

class FeatureTransformer:
    """Ensures consistent transformations in training and serving."""
    
    def __init__(self):
        self.scalers = {}
        self.encoders = {}
    
    def fit_transform(self, df, numeric_cols, categorical_cols):
        """Fit transformers on training data and transform."""
        from sklearn.preprocessing import StandardScaler, LabelEncoder
        
        result = df.copy()
        
        # Numeric features
        for col in numeric_cols:
            scaler = StandardScaler()
            result[f'{col}_scaled'] = scaler.fit_transform(
                df[[col]]
            )
            self.scalers[col] = scaler
        
        # Categorical features
        for col in categorical_cols:
            encoder = LabelEncoder()
            result[f'{col}_encoded'] = encoder.fit_transform(
                df[col]
            )
            self.encoders[col] = encoder
        
        return result
    
    def transform(self, df, numeric_cols, categorical_cols):
        """Transform new data using fitted transformers."""
        result = df.copy()
        
        for col in numeric_cols:
            result[f'{col}_scaled'] = self.scalers[col].transform(
                df[[col]]
            )
        
        for col in categorical_cols:
            result[f'{col}_encoded'] = self.encoders[col].transform(
                df[col]
            )
        
        return result

Data Quality and Validation

Data quality issues silently destroy ML models. Implementing systematic data quality checks throughout your pipeline is non-negotiable for production ML systems.

Define expectations explicitly for every dataset. Use libraries like Great Expectations to codify these rules:

import great_expectations as ge

# Load data with expectations
df = ge.read_csv('customer_data.csv')

# Define expectations
df.expect_column_values_to_not_be_null('customer_id')
df.expect_column_values_to_be_unique('customer_id')
df.expect_column_values_to_be_between('age', min_value=0, max_value=120)
df.expect_column_values_to_be_in_set(
    'country', 
    ['US', 'UK', 'CA', 'AU', 'DE']
)
df.expect_column_mean_to_be_between('purchase_amount', 10, 1000)

# Validate and get results
validation_result = df.validate()

Implement data quality checks at multiple pipeline stages:

  • Collection time: Validate schema and basic constraints
  • Processing time: Check for data quality issues like missing values, outliers, or distribution shifts
  • Feature engineering time: Ensure feature values fall within expected ranges
  • Model serving time: Validate input features match training expectations

Monitor data drift continuously. Distribution shifts in your input data over time can degrade model performance even if the model itself hasn’t changed:

from scipy import stats

def detect_distribution_shift(training_data, production_data, column):
    """
    Detect if production data distribution differs significantly
    from training data using Kolmogorov-Smirnov test.
    """
    statistic, p_value = stats.ks_2samp(
        training_data[column],
        production_data[column]
    )
    
    if p_value < 0.05:
        return {
            'drift_detected': True,
            'statistic': statistic,
            'p_value': p_value,
            'column': column
        }
    return {'drift_detected': False}

⚠️ Common Data Engineering Pitfalls in ML Projects

Data Leakage

Using future information in training features. Always ensure point-in-time correctness by only using data available before prediction time.

Training-Serving Skew

Computing features differently in training versus production. Use identical code paths and transformations for both contexts.

Ignoring Data Versioning

Not tracking which data version trained which model. Implement explicit versioning for reproducibility and debugging.

Inadequate Monitoring

Failing to track data quality metrics and distribution shifts. Set up continuous monitoring with alerts for anomalies.

Tightly Coupled Pipelines

Creating monolithic pipelines that are difficult to debug and maintain. Design modular, testable components with clear interfaces.

Pipeline Orchestration and Automation

Successful ML projects require sophisticated orchestration that coordinates data collection, processing, feature engineering, model training, and deployment. Manual execution doesn’t scale and introduces human error.

Modern orchestration tools like Apache Airflow, Prefect, or Dagster allow you to define workflows as directed acyclic graphs (DAGs) where tasks execute based on dependencies:

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'ml-team',
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
}

with DAG(
    'ml_feature_pipeline',
    default_args=default_args,
    schedule_interval='@daily',
    start_date=datetime(2024, 1, 1),
    catchup=False,
) as dag:
    
    extract_data = PythonOperator(
        task_id='extract_raw_data',
        python_callable=extract_raw_data_func,
    )
    
    validate_data = PythonOperator(
        task_id='validate_data_quality',
        python_callable=validate_data_func,
    )
    
    compute_features = PythonOperator(
        task_id='compute_features',
        python_callable=compute_features_func,
    )
    
    store_features = PythonOperator(
        task_id='store_to_feature_store',
        python_callable=store_features_func,
    )
    
    # Define dependencies
    extract_data >> validate_data >> compute_features >> store_features

Implement proper error handling and retry logic. Data pipelines fail—networks timeout, APIs rate limit, and data sources have temporary issues. Build resilience into your pipelines:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=10)
)
def fetch_external_data(api_endpoint):
    """Fetch data with automatic retries on failure."""
    response = requests.get(api_endpoint)
    response.raise_for_status()
    return response.json()

Create idempotent pipeline tasks. Running a task multiple times should produce the same result, making it safe to retry failed jobs without corrupting data:

def process_daily_data(date):
    """
    Idempotent processing - can safely rerun for any date.
    Overwrites previous output if it exists.
    """
    output_path = f's3://data/processed/{date}/'
    
    # Read raw data
    raw_data = read_raw_data(date)
    
    # Process
    processed_data = transform_data(raw_data)
    
    # Write to versioned location (overwrites if exists)
    processed_data.to_parquet(output_path, mode='overwrite')

Managing Training and Serving Data

The distinction between training data and serving data is critical. Training happens on historical data in batch, while serving happens on current data in real-time. Your data engineering must support both contexts.

For batch training, create training datasets that sample from your feature store at specific points in time. Include metadata about the dataset version, feature versions, and filtering criteria:

def create_training_dataset(start_date, end_date, feature_version):
    """
    Create a versioned training dataset.
    """
    features = load_features(
        feature_version=feature_version,
        start_date=start_date,
        end_date=end_date
    )
    
    labels = load_labels(
        start_date=start_date,
        end_date=end_date
    )
    
    # Merge features and labels
    training_data = features.merge(labels, on=['customer_id', 'date'])
    
    # Save with metadata
    metadata = {
        'created_at': datetime.now().isoformat(),
        'start_date': start_date.isoformat(),
        'end_date': end_date.isoformat(),
        'feature_version': feature_version,
        'num_records': len(training_data),
        'feature_columns': list(features.columns),
    }
    
    save_training_dataset(training_data, metadata)
    return training_data

For real-time serving, implement a feature serving layer that provides low-latency access to pre-computed features. This might use in-memory databases like Redis for millisecond response times:

import redis
import json

class FeatureServingStore:
    def __init__(self, redis_host='localhost', redis_port=6379):
        self.redis_client = redis.Redis(
            host=redis_host,
            port=redis_port,
            decode_responses=True
        )
    
    def store_features(self, entity_id, features):
        """Store features for real-time serving."""
        key = f'features:{entity_id}'
        self.redis_client.setex(
            key,
            timedelta(hours=24),  # TTL
            json.dumps(features)
        )
    
    def get_features(self, entity_id):
        """Retrieve features for inference."""
        key = f'features:{entity_id}'
        feature_json = self.redis_client.get(key)
        if feature_json:
            return json.loads(feature_json)
        return None

Ensure feature consistency between training and serving by using the same feature computation code in both contexts. Consider using a feature store framework like Feast or Tecton that handles this automatically.

Conclusion

Data engineering forms the backbone of successful machine learning projects, determining whether models can move from promising experiments to reliable production systems. By focusing on robust data collection, proper storage architecture, rigorous feature engineering, systematic quality validation, and thoughtful orchestration, you create the foundation for ML systems that work reliably at scale.

The basics covered here—point-in-time correctness, training-serving consistency, data versioning, quality monitoring, and pipeline automation—are not optional niceties but essential practices that separate successful ML projects from failed ones. Investing in solid data engineering upfront saves countless hours of debugging mysterious model failures and enables your team to iterate quickly and deploy confidently.

Leave a Comment