Setting Up a Reproducible ML Dev Environment

“It works on my machine” is the death knell of collaborative machine learning projects. A model that trains perfectly on your laptop fails mysteriously on a colleague’s workstation. Results you achieved last month become impossible to replicate this week. Production deployment requires weeks of debugging environment differences. These scenarios repeat endlessly in ML teams lacking reproducible development environments, wasting countless hours on issues that have nothing to do with actual machine learning.

Reproducibility in ML environments goes far beyond traditional software development concerns. ML projects depend on specific versions of deep learning frameworks, CUDA libraries, system dependencies, random seeds, and data processing pipelines that interact in complex ways. A tiny version mismatch in NumPy can change numerical results. Different CUDA versions produce different model behaviors. Random seed handling varies across frameworks. Building truly reproducible ML environments requires understanding these dependencies and implementing systems that capture and recreate them reliably across machines, time, and team members.

Why ML Reproducibility Is Uniquely Challenging

Machine learning environments face complexity that general software development doesn’t encounter, making reproducibility harder to achieve.

The Dependency Stack Depth

ML projects have unusually deep dependency stacks spanning multiple layers that must align precisely.

System level: Operating system, CUDA drivers, cuDNN libraries, system Python Computation level: TensorFlow/PyTorch, NumPy, SciPy, pandas ML level: scikit-learn, Hugging Face, specific model libraries Project level: Your custom code and project-specific packages

Each layer can break reproducibility. PyTorch 2.0 with CUDA 11.8 behaves differently than PyTorch 2.0 with CUDA 12.1. The same PyTorch version on different NumPy versions produces different results. TensorFlow’s behavior changed between minor versions when they modified random number generation.

Example failure: Your colleague installs PyTorch, which automatically pulls NumPy 1.26. Your environment has NumPy 1.24. Identical code produces different training curves because NumPy’s random number generation changed between versions, affecting data shuffling and initialization.

Non-Determinism in ML Training

ML training involves sources of randomness that must be controlled for reproducibility:

  • Weight initialization
  • Data shuffling and batching
  • Dropout masks
  • Augmentation transformations
  • GPU operation ordering

Each framework handles randomness differently. Setting random.seed(42) doesn’t guarantee reproducible PyTorch training—you need torch.manual_seed(42), np.random.seed(42), and CUDA determinism flags. TensorFlow has its own seeding mechanisms. Failure to set all random seeds appropriately causes irreproducible results even in identical environments.

Hardware-Specific Behaviors

GPU hardware affects results in ways that shock developers coming from traditional software.

Different GPU architectures (A100 vs RTX 4090) can produce slightly different numerical results for identical operations due to different floating-point implementations. CUDA version changes modify operation implementations, causing different results. Multi-GPU training introduces additional non-determinism from parallel operations.

This means: Perfect numerical reproducibility across all hardware is sometimes impossible. But you can achieve practical reproducibility—same results on the same hardware, predictable behaviors, and understanding where variance comes from.

Dependency Management Approaches

Capturing and reproducing the exact environment requires thoughtful dependency management strategy.

Requirements Files: The Baseline

Requirements.txt is the simplest approach but has significant limitations for ML reproducibility.

Basic requirements.txt:

torch==2.1.0
transformers==4.35.0
numpy==1.24.0
scikit-learn==1.3.2

Advantages:

  • Simple and universally understood
  • Works with standard pip
  • Easy to version control

Limitations:

  • Doesn’t capture transitive dependencies precisely
  • No guarantee of compatible versions without testing
  • Doesn’t specify Python version
  • Doesn’t handle system dependencies
  • Platform-specific differences not captured

Making requirements.txt more robust:

# python_version: 3.10
# Platform: linux

torch==2.1.0+cu118
numpy==1.24.0
transformers==4.35.0
scikit-learn==1.3.2

# Generate with: pip freeze > requirements.txt
# This captures exact versions including transitive deps

Best practice: Use pip freeze to capture exact versions of everything installed, including transitive dependencies. This ensures version consistency.

Conda Environments: Comprehensive Management

Conda manages both Python packages and system dependencies, making it well-suited for ML environments.

Environment specification (environment.yml):

name: ml-project
channels:
  - pytorch
  - conda-forge
  - defaults
dependencies:
  - python=3.10
  - pytorch=2.1.0
  - torchvision=0.16.0
  - cudatoolkit=11.8
  - numpy=1.24.0
  - pandas=2.1.0
  - scikit-learn=1.3.2
  - pip:
      - transformers==4.35.0

Key advantages:

  • Manages Python version explicitly
  • Handles CUDA and system libraries
  • Creates truly isolated environments
  • Cross-platform support (with caveats)

Creating reproducible Conda environments:

# Create environment from specification
conda env create -f environment.yml

# Export exact environment (includes all dependencies)
conda env export > environment_exact.yml

# Export with pip packages
conda env export --from-history > environment.yml

The --from-history flag exports only packages you explicitly requested, improving readability but losing transitive dependency locking. The full export captures everything but becomes verbose.

Poetry and Modern Dependency Management

Poetry provides deterministic dependency resolution with lock files, similar to modern JavaScript (npm) or Rust (cargo) workflows.

pyproject.toml for ML project:

[tool.poetry]
name = "ml-project"
version = "0.1.0"
python = "^3.10"

[tool.poetry.dependencies]
python = "^3.10"
torch = {version = "^2.1.0", source = "pytorch"}
transformers = "^4.35.0"
numpy = "^1.24.0"
scikit-learn = "^1.3.0"

[tool.poetry.group.dev.dependencies]
pytest = "^7.4.0"
jupyter = "^1.0.0"

[[tool.poetry.source]]
name = "pytorch"
url = "https://download.pytorch.org/whl/cu118"
priority = "supplemental"

Poetry generates poetry.lock containing exact versions of all dependencies with cryptographic hashes. This guarantees reproducibility across installations.

Advantages:

  • Deterministic dependency resolution
  • Lock file ensures exact versions
  • Separate dev dependencies
  • Excellent dependency conflict resolution

Limitations:

  • Doesn’t manage system dependencies (CUDA, etc.)
  • Slower than pip for large ML packages
  • Learning curve for teams unfamiliar with it

Dependency Management Comparison

requirements.txt
Simplicity: ⭐⭐⭐⭐⭐
Reproducibility: ⭐⭐
System deps:
Learning curve: Low
Best for: Simple projects, quick prototypes
Conda
Simplicity: ⭐⭐⭐
Reproducibility: ⭐⭐⭐⭐
System deps:
Learning curve: Medium
Best for: GPU projects, team environments
Poetry
Simplicity: ⭐⭐⭐
Reproducibility: ⭐⭐⭐⭐⭐
System deps:
Learning curve: Medium
Best for: Production code, package development

Containerization for Complete Reproducibility

Containers capture the entire environment stack, providing the highest level of reproducibility.

Docker for ML Environments

Docker containers package everything from OS to application code, ensuring identical environments across machines.

ML-optimized Dockerfile:

# Start from NVIDIA CUDA base image
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04

# Set working directory
WORKDIR /app

# Install Python and system dependencies
RUN apt-get update && apt-get install -y \
    python3.10 \
    python3-pip \
    git \
    && rm -rf /var/lib/apt/lists/*

# Copy dependency specifications
COPY requirements.txt .

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy project code
COPY . .

# Set environment variables for reproducibility
ENV PYTHONHASHSEED=42
ENV CUBLAS_WORKSPACE_CONFIG=:4096:8

# Default command
CMD ["python3", "train.py"]

Key practices for ML Docker images:

  • Use official NVIDIA CUDA images for GPU support
  • Pin all versions explicitly (Ubuntu version, Python version, base image tag)
  • Copy requirements before code (layer caching optimization)
  • Set environment variables affecting randomness
  • Document build date and versions in labels

Multi-stage builds reduce image size:

# Builder stage
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 as builder
# Install compilation dependencies
# Build wheels

# Runtime stage
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04
# Copy built wheels from builder
# Smaller final image

Docker Compose for Complex Setups

When projects need multiple services (ML training, database, API server), Docker Compose manages them together.

docker-compose.yml for ML project:

version: '3.8'

services:
  trainer:
    build: .
    volumes:
      - ./data:/app/data
      - ./models:/app/models
      - ./logs:/app/logs
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - PYTHONHASHSEED=42
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
  
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.8.0
    ports:
      - "5000:5000"
    volumes:
      - ./mlruns:/mlflow
    command: mlflow server --host 0.0.0.0 --backend-store-uri /mlflow
  
  jupyter:
    build: .
    ports:
      - "8888:8888"
    volumes:
      - ./notebooks:/app/notebooks
      - ./data:/app/data
    command: jupyter lab --ip=0.0.0.0 --allow-root --no-browser

Benefits:

  • All services defined in one file
  • Consistent network between services
  • Shared volume mounts for data
  • Easy to start entire stack with docker-compose up

Randomness Control and Determinism

Controlling randomness sources is essential for reproducible ML results.

Comprehensive Seed Setting

Set seeds for all randomness sources used in your project:

import random
import numpy as np
import torch
import os

def set_seed(seed=42):
    """Set seeds for reproducibility."""
    # Python built-in random
    random.seed(seed)
    
    # NumPy
    np.random.seed(seed)
    
    # PyTorch
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)  # Multi-GPU
    
    # Environment variables
    os.environ['PYTHONHASHSEED'] = str(seed)
    
    # Enforce deterministic behavior (may reduce performance)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    
    # For CUDA >= 10.2
    os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'
    torch.use_deterministic_algorithms(True)

# Call at the start of your script
set_seed(42)

TensorFlow equivalent:

import tensorflow as tf
import numpy as np
import random
import os

def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    
    # For reproducible operations
    os.environ['TF_DETERMINISTIC_OPS'] = '1'
    os.environ['TF_CUDNN_DETERMINISTIC'] = '1'

Data Loading Determinism

DataLoaders introduce randomness through shuffling and worker processes:

from torch.utils.data import DataLoader

def seed_worker(worker_id):
    """Seed worker processes for reproducibility."""
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

# Create deterministic DataLoader
g = torch.Generator()
g.manual_seed(42)

dataloader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4,
    worker_init_fn=seed_worker,
    generator=g
)

Without proper seeding, multi-process data loading produces different batches across runs even with the same global seed.

Version Control Integration

Code and environment specifications must be versioned together for true reproducibility.

What to Version Control

Essential files for version control:

  • All code (obviously)
  • requirements.txt / environment.yml / pyproject.toml
  • poetry.lock (if using Poetry)
  • Dockerfile and docker-compose.yml
  • Configuration files
  • Seed setting utilities
  • Documentation on environment setup

What NOT to version control:

  • Virtual environments (venv/, conda/)
  • Data files (use DVC or similar)
  • Model checkpoints (too large, use external storage)
  • Cache directories (__pycache__, .pytest_cache)
  • Local configuration overrides

.gitignore for ML projects:

# Python
__pycache__/
*.py[cod]
venv/
.env

# ML specific
*.pth
*.h5
*.ckpt
data/
logs/
mlruns/

# Jupyter
.ipynb_checkpoints/

# IDE
.vscode/
.idea/

Documentation Standards

README should include:

  1. Environment setup instructions
  2. Exact Python version required
  3. GPU requirements (if any)
  4. Step-by-step reproduction instructions
  5. Expected results with seeds

Example README section:

## Reproducible Setup

**Requirements:**
- Python 3.10
- CUDA 11.8 (for GPU)
- 16GB RAM minimum

**Setup:**
```bash
# Create conda environment
conda env create -f environment.yml
conda activate ml-project

# Or use Docker
docker build -t ml-project .
docker run --gpus all ml-project

Training:

python train.py --seed 42 --config configs/baseline.yaml

Expected Results (seed=42):

  • Epoch 10: Train Loss 0.342, Val Loss 0.389
  • Final Test Accuracy: 87.3%

## Configuration Management

Separate code from configuration to enable reproducible experiments with different parameters.

### Configuration Files

**YAML configuration** for ML experiments:

```yaml
# config/baseline.yaml
seed: 42
experiment_name: "baseline_run"

model:
  architecture: "resnet18"
  pretrained: true
  num_classes: 10

training:
  batch_size: 32
  epochs: 50
  learning_rate: 0.001
  optimizer: "adam"
  weight_decay: 0.0001

data:
  dataset: "cifar10"
  train_split: 0.8
  val_split: 0.1
  test_split: 0.1
  augmentation: true

paths:
  data_dir: "./data"
  output_dir: "./outputs"
  checkpoint_dir: "./checkpoints"

Loading configuration in code:

import yaml
from dataclasses import dataclass

def load_config(config_path):
    with open(config_path, 'r') as f:
        config = yaml.safe_load(f)
    return config

# Usage
config = load_config('config/baseline.yaml')
set_seed(config['seed'])
model = create_model(config['model'])

Benefits:

  • Experiments easily reproducible from config file
  • Config files version controlled
  • Compare experiments by comparing configs
  • No hardcoded values scattered through code

Environment Variables for Sensitive Data

Use environment variables for credentials and paths that differ across machines:

import os
from dotenv import load_dotenv

load_dotenv()  # Load from .env file

# Access environment variables
data_path = os.getenv('DATA_PATH', './data')  # Default if not set
api_key = os.getenv('API_KEY')

# Validate required variables
required_vars = ['DATA_PATH', 'MODEL_OUTPUT_PATH']
missing = [var for var in required_vars if not os.getenv(var)]
if missing:
    raise ValueError(f"Missing environment variables: {missing}")

.env file (NOT version controlled):

DATA_PATH=/mnt/datasets/ml_project
MODEL_OUTPUT_PATH=/mnt/outputs
MLFLOW_TRACKING_URI=http://localhost:5000

.env.example (version controlled template):

DATA_PATH=./data
MODEL_OUTPUT_PATH=./outputs
MLFLOW_TRACKING_URI=http://localhost:5000
API_KEY=your_api_key_here

Reproducibility Checklist

📦 Dependencies
✓ All package versions pinned (including transitive deps)
✓ Python version specified
✓ CUDA version documented
✓ System dependencies listed
✓ Lock files generated (if using Poetry/Conda)
🎲 Randomness
✓ All random seeds set (Python, NumPy, PyTorch/TF)
✓ Deterministic algorithms enabled
✓ DataLoader seeding configured
✓ Seeds version controlled in configs
✓ Results documented with seed used
🐳 Environment
✓ Dockerfile tested and working
✓ Docker image tags pinned
✓ Environment variables documented
✓ Setup instructions verified
✓ Works on fresh machine/container
📝 Documentation
✓ Setup steps documented in README
✓ Expected results documented
✓ Configuration files explained
✓ Known issues/limitations listed
✓ Hardware requirements specified

Testing Reproducibility

Verify reproducibility through systematic testing rather than hoping it works.

Multi-Machine Verification

Test on different machines to catch environment-specific issues:

  1. Fresh clone on your machine
  2. Colleague’s machine
  3. CI/CD environment
  4. Cloud instance

Document any differences in results and their causes. Some GPU-specific variance is acceptable if documented.

Reproducibility Test Script

Automated reproducibility testing:

import torch
import numpy as np
from train import train_model, set_seed

def test_reproducibility():
    """Test that identical seeds produce identical results."""
    
    results1 = []
    results2 = []
    
    for seed in [42, 123, 999]:
        # First run
        set_seed(seed)
        loss1, acc1 = train_model(epochs=5)
        results1.append((loss1, acc1))
        
        # Second run with same seed
        set_seed(seed)
        loss2, acc2 = train_model(epochs=5)
        results2.append((loss2, acc2))
        
        # Check equality
        assert np.isclose(loss1, loss2, rtol=1e-5), \
            f"Loss mismatch for seed {seed}: {loss1} vs {loss2}"
        assert np.isclose(acc1, acc2, rtol=1e-5), \
            f"Accuracy mismatch for seed {seed}: {acc1} vs {acc2}"
    
    print("✓ Reproducibility test passed")

if __name__ == '__main__':
    test_reproducibility()

Include in CI/CD to catch reproducibility regressions early.

Conclusion

Reproducible ML environments require attention to dependency versioning, randomness control, containerization, and documentation that general software projects don’t demand. The investment in setting up proper dependency management (whether Conda, Poetry, or Docker), comprehensive seed setting, configuration management, and testing pays dividends every time you revisit old experiments, onboard new team members, or deploy to production. Reproducibility isn’t a luxury or academic concern—it’s essential for debugging, collaboration, and building ML systems that work reliably.

The practices outlined—pinned dependencies with lock files, seed setting across all randomness sources, containerized environments, version-controlled configurations, and automated reproducibility testing—transform “works on my machine” into “works everywhere, every time.” Start with the basics (requirements.txt and seed setting), then add sophistication (Docker, automated testing) as project complexity demands. The goal isn’t perfection on day one but incremental improvement toward environments where reproducing results is straightforward rather than a frustrating debugging exercise. Build reproducibility into your workflow from the start, and focus your energy on actual machine learning rather than environment troubleshooting.

Leave a Comment