“It works on my machine” is the death knell of collaborative machine learning projects. A model that trains perfectly on your laptop fails mysteriously on a colleague’s workstation. Results you achieved last month become impossible to replicate this week. Production deployment requires weeks of debugging environment differences. These scenarios repeat endlessly in ML teams lacking reproducible development environments, wasting countless hours on issues that have nothing to do with actual machine learning.
Reproducibility in ML environments goes far beyond traditional software development concerns. ML projects depend on specific versions of deep learning frameworks, CUDA libraries, system dependencies, random seeds, and data processing pipelines that interact in complex ways. A tiny version mismatch in NumPy can change numerical results. Different CUDA versions produce different model behaviors. Random seed handling varies across frameworks. Building truly reproducible ML environments requires understanding these dependencies and implementing systems that capture and recreate them reliably across machines, time, and team members.
Why ML Reproducibility Is Uniquely Challenging
Machine learning environments face complexity that general software development doesn’t encounter, making reproducibility harder to achieve.
The Dependency Stack Depth
ML projects have unusually deep dependency stacks spanning multiple layers that must align precisely.
System level: Operating system, CUDA drivers, cuDNN libraries, system Python Computation level: TensorFlow/PyTorch, NumPy, SciPy, pandas ML level: scikit-learn, Hugging Face, specific model libraries Project level: Your custom code and project-specific packages
Each layer can break reproducibility. PyTorch 2.0 with CUDA 11.8 behaves differently than PyTorch 2.0 with CUDA 12.1. The same PyTorch version on different NumPy versions produces different results. TensorFlow’s behavior changed between minor versions when they modified random number generation.
Example failure: Your colleague installs PyTorch, which automatically pulls NumPy 1.26. Your environment has NumPy 1.24. Identical code produces different training curves because NumPy’s random number generation changed between versions, affecting data shuffling and initialization.
Non-Determinism in ML Training
ML training involves sources of randomness that must be controlled for reproducibility:
- Weight initialization
- Data shuffling and batching
- Dropout masks
- Augmentation transformations
- GPU operation ordering
Each framework handles randomness differently. Setting random.seed(42) doesn’t guarantee reproducible PyTorch training—you need torch.manual_seed(42), np.random.seed(42), and CUDA determinism flags. TensorFlow has its own seeding mechanisms. Failure to set all random seeds appropriately causes irreproducible results even in identical environments.
Hardware-Specific Behaviors
GPU hardware affects results in ways that shock developers coming from traditional software.
Different GPU architectures (A100 vs RTX 4090) can produce slightly different numerical results for identical operations due to different floating-point implementations. CUDA version changes modify operation implementations, causing different results. Multi-GPU training introduces additional non-determinism from parallel operations.
This means: Perfect numerical reproducibility across all hardware is sometimes impossible. But you can achieve practical reproducibility—same results on the same hardware, predictable behaviors, and understanding where variance comes from.
Dependency Management Approaches
Capturing and reproducing the exact environment requires thoughtful dependency management strategy.
Requirements Files: The Baseline
Requirements.txt is the simplest approach but has significant limitations for ML reproducibility.
Basic requirements.txt:
torch==2.1.0
transformers==4.35.0
numpy==1.24.0
scikit-learn==1.3.2
Advantages:
- Simple and universally understood
- Works with standard pip
- Easy to version control
Limitations:
- Doesn’t capture transitive dependencies precisely
- No guarantee of compatible versions without testing
- Doesn’t specify Python version
- Doesn’t handle system dependencies
- Platform-specific differences not captured
Making requirements.txt more robust:
# python_version: 3.10
# Platform: linux
torch==2.1.0+cu118
numpy==1.24.0
transformers==4.35.0
scikit-learn==1.3.2
# Generate with: pip freeze > requirements.txt
# This captures exact versions including transitive deps
Best practice: Use pip freeze to capture exact versions of everything installed, including transitive dependencies. This ensures version consistency.
Conda Environments: Comprehensive Management
Conda manages both Python packages and system dependencies, making it well-suited for ML environments.
Environment specification (environment.yml):
name: ml-project
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- python=3.10
- pytorch=2.1.0
- torchvision=0.16.0
- cudatoolkit=11.8
- numpy=1.24.0
- pandas=2.1.0
- scikit-learn=1.3.2
- pip:
- transformers==4.35.0
Key advantages:
- Manages Python version explicitly
- Handles CUDA and system libraries
- Creates truly isolated environments
- Cross-platform support (with caveats)
Creating reproducible Conda environments:
# Create environment from specification
conda env create -f environment.yml
# Export exact environment (includes all dependencies)
conda env export > environment_exact.yml
# Export with pip packages
conda env export --from-history > environment.yml
The --from-history flag exports only packages you explicitly requested, improving readability but losing transitive dependency locking. The full export captures everything but becomes verbose.
Poetry and Modern Dependency Management
Poetry provides deterministic dependency resolution with lock files, similar to modern JavaScript (npm) or Rust (cargo) workflows.
pyproject.toml for ML project:
[tool.poetry]
name = "ml-project"
version = "0.1.0"
python = "^3.10"
[tool.poetry.dependencies]
python = "^3.10"
torch = {version = "^2.1.0", source = "pytorch"}
transformers = "^4.35.0"
numpy = "^1.24.0"
scikit-learn = "^1.3.0"
[tool.poetry.group.dev.dependencies]
pytest = "^7.4.0"
jupyter = "^1.0.0"
[[tool.poetry.source]]
name = "pytorch"
url = "https://download.pytorch.org/whl/cu118"
priority = "supplemental"Poetry generates poetry.lock containing exact versions of all dependencies with cryptographic hashes. This guarantees reproducibility across installations.
Advantages:
- Deterministic dependency resolution
- Lock file ensures exact versions
- Separate dev dependencies
- Excellent dependency conflict resolution
Limitations:
- Doesn’t manage system dependencies (CUDA, etc.)
- Slower than pip for large ML packages
- Learning curve for teams unfamiliar with it
Dependency Management Comparison
Containerization for Complete Reproducibility
Containers capture the entire environment stack, providing the highest level of reproducibility.
Docker for ML Environments
Docker containers package everything from OS to application code, ensuring identical environments across machines.
ML-optimized Dockerfile:
# Start from NVIDIA CUDA base image
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04
# Set working directory
WORKDIR /app
# Install Python and system dependencies
RUN apt-get update && apt-get install -y \
python3.10 \
python3-pip \
git \
&& rm -rf /var/lib/apt/lists/*
# Copy dependency specifications
COPY requirements.txt .
# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt
# Copy project code
COPY . .
# Set environment variables for reproducibility
ENV PYTHONHASHSEED=42
ENV CUBLAS_WORKSPACE_CONFIG=:4096:8
# Default command
CMD ["python3", "train.py"]
Key practices for ML Docker images:
- Use official NVIDIA CUDA images for GPU support
- Pin all versions explicitly (Ubuntu version, Python version, base image tag)
- Copy requirements before code (layer caching optimization)
- Set environment variables affecting randomness
- Document build date and versions in labels
Multi-stage builds reduce image size:
# Builder stage
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 as builder
# Install compilation dependencies
# Build wheels
# Runtime stage
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04
# Copy built wheels from builder
# Smaller final image
Docker Compose for Complex Setups
When projects need multiple services (ML training, database, API server), Docker Compose manages them together.
docker-compose.yml for ML project:
version: '3.8'
services:
trainer:
build: .
volumes:
- ./data:/app/data
- ./models:/app/models
- ./logs:/app/logs
environment:
- CUDA_VISIBLE_DEVICES=0
- PYTHONHASHSEED=42
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
mlflow:
image: ghcr.io/mlflow/mlflow:v2.8.0
ports:
- "5000:5000"
volumes:
- ./mlruns:/mlflow
command: mlflow server --host 0.0.0.0 --backend-store-uri /mlflow
jupyter:
build: .
ports:
- "8888:8888"
volumes:
- ./notebooks:/app/notebooks
- ./data:/app/data
command: jupyter lab --ip=0.0.0.0 --allow-root --no-browser
Benefits:
- All services defined in one file
- Consistent network between services
- Shared volume mounts for data
- Easy to start entire stack with
docker-compose up
Randomness Control and Determinism
Controlling randomness sources is essential for reproducible ML results.
Comprehensive Seed Setting
Set seeds for all randomness sources used in your project:
import random
import numpy as np
import torch
import os
def set_seed(seed=42):
"""Set seeds for reproducibility."""
# Python built-in random
random.seed(seed)
# NumPy
np.random.seed(seed)
# PyTorch
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed) # Multi-GPU
# Environment variables
os.environ['PYTHONHASHSEED'] = str(seed)
# Enforce deterministic behavior (may reduce performance)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# For CUDA >= 10.2
os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'
torch.use_deterministic_algorithms(True)
# Call at the start of your script
set_seed(42)
TensorFlow equivalent:
import tensorflow as tf
import numpy as np
import random
import os
def set_seed(seed=42):
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
# For reproducible operations
os.environ['TF_DETERMINISTIC_OPS'] = '1'
os.environ['TF_CUDNN_DETERMINISTIC'] = '1'
Data Loading Determinism
DataLoaders introduce randomness through shuffling and worker processes:
from torch.utils.data import DataLoader
def seed_worker(worker_id):
"""Seed worker processes for reproducibility."""
worker_seed = torch.initial_seed() % 2**32
np.random.seed(worker_seed)
random.seed(worker_seed)
# Create deterministic DataLoader
g = torch.Generator()
g.manual_seed(42)
dataloader = DataLoader(
dataset,
batch_size=32,
shuffle=True,
num_workers=4,
worker_init_fn=seed_worker,
generator=g
)
Without proper seeding, multi-process data loading produces different batches across runs even with the same global seed.
Version Control Integration
Code and environment specifications must be versioned together for true reproducibility.
What to Version Control
Essential files for version control:
- All code (obviously)
requirements.txt/environment.yml/pyproject.tomlpoetry.lock(if using Poetry)- Dockerfile and docker-compose.yml
- Configuration files
- Seed setting utilities
- Documentation on environment setup
What NOT to version control:
- Virtual environments (
venv/,conda/) - Data files (use DVC or similar)
- Model checkpoints (too large, use external storage)
- Cache directories (
__pycache__,.pytest_cache) - Local configuration overrides
.gitignore for ML projects:
# Python
__pycache__/
*.py[cod]
venv/
.env
# ML specific
*.pth
*.h5
*.ckpt
data/
logs/
mlruns/
# Jupyter
.ipynb_checkpoints/
# IDE
.vscode/
.idea/
Documentation Standards
README should include:
- Environment setup instructions
- Exact Python version required
- GPU requirements (if any)
- Step-by-step reproduction instructions
- Expected results with seeds
Example README section:
## Reproducible Setup
**Requirements:**
- Python 3.10
- CUDA 11.8 (for GPU)
- 16GB RAM minimum
**Setup:**
```bash
# Create conda environment
conda env create -f environment.yml
conda activate ml-project
# Or use Docker
docker build -t ml-project .
docker run --gpus all ml-project
Training:
python train.py --seed 42 --config configs/baseline.yaml
Expected Results (seed=42):
- Epoch 10: Train Loss 0.342, Val Loss 0.389
- Final Test Accuracy: 87.3%
## Configuration Management
Separate code from configuration to enable reproducible experiments with different parameters.
### Configuration Files
**YAML configuration** for ML experiments:
```yaml
# config/baseline.yaml
seed: 42
experiment_name: "baseline_run"
model:
architecture: "resnet18"
pretrained: true
num_classes: 10
training:
batch_size: 32
epochs: 50
learning_rate: 0.001
optimizer: "adam"
weight_decay: 0.0001
data:
dataset: "cifar10"
train_split: 0.8
val_split: 0.1
test_split: 0.1
augmentation: true
paths:
data_dir: "./data"
output_dir: "./outputs"
checkpoint_dir: "./checkpoints"
Loading configuration in code:
import yaml
from dataclasses import dataclass
def load_config(config_path):
with open(config_path, 'r') as f:
config = yaml.safe_load(f)
return config
# Usage
config = load_config('config/baseline.yaml')
set_seed(config['seed'])
model = create_model(config['model'])
Benefits:
- Experiments easily reproducible from config file
- Config files version controlled
- Compare experiments by comparing configs
- No hardcoded values scattered through code
Environment Variables for Sensitive Data
Use environment variables for credentials and paths that differ across machines:
import os
from dotenv import load_dotenv
load_dotenv() # Load from .env file
# Access environment variables
data_path = os.getenv('DATA_PATH', './data') # Default if not set
api_key = os.getenv('API_KEY')
# Validate required variables
required_vars = ['DATA_PATH', 'MODEL_OUTPUT_PATH']
missing = [var for var in required_vars if not os.getenv(var)]
if missing:
raise ValueError(f"Missing environment variables: {missing}")
.env file (NOT version controlled):
DATA_PATH=/mnt/datasets/ml_project
MODEL_OUTPUT_PATH=/mnt/outputs
MLFLOW_TRACKING_URI=http://localhost:5000
.env.example (version controlled template):
DATA_PATH=./data
MODEL_OUTPUT_PATH=./outputs
MLFLOW_TRACKING_URI=http://localhost:5000
API_KEY=your_api_key_here
Reproducibility Checklist
✓ Python version specified
✓ CUDA version documented
✓ System dependencies listed
✓ Lock files generated (if using Poetry/Conda)
✓ Deterministic algorithms enabled
✓ DataLoader seeding configured
✓ Seeds version controlled in configs
✓ Results documented with seed used
✓ Docker image tags pinned
✓ Environment variables documented
✓ Setup instructions verified
✓ Works on fresh machine/container
✓ Expected results documented
✓ Configuration files explained
✓ Known issues/limitations listed
✓ Hardware requirements specified
Testing Reproducibility
Verify reproducibility through systematic testing rather than hoping it works.
Multi-Machine Verification
Test on different machines to catch environment-specific issues:
- Fresh clone on your machine
- Colleague’s machine
- CI/CD environment
- Cloud instance
Document any differences in results and their causes. Some GPU-specific variance is acceptable if documented.
Reproducibility Test Script
Automated reproducibility testing:
import torch
import numpy as np
from train import train_model, set_seed
def test_reproducibility():
"""Test that identical seeds produce identical results."""
results1 = []
results2 = []
for seed in [42, 123, 999]:
# First run
set_seed(seed)
loss1, acc1 = train_model(epochs=5)
results1.append((loss1, acc1))
# Second run with same seed
set_seed(seed)
loss2, acc2 = train_model(epochs=5)
results2.append((loss2, acc2))
# Check equality
assert np.isclose(loss1, loss2, rtol=1e-5), \
f"Loss mismatch for seed {seed}: {loss1} vs {loss2}"
assert np.isclose(acc1, acc2, rtol=1e-5), \
f"Accuracy mismatch for seed {seed}: {acc1} vs {acc2}"
print("✓ Reproducibility test passed")
if __name__ == '__main__':
test_reproducibility()
Include in CI/CD to catch reproducibility regressions early.
Conclusion
Reproducible ML environments require attention to dependency versioning, randomness control, containerization, and documentation that general software projects don’t demand. The investment in setting up proper dependency management (whether Conda, Poetry, or Docker), comprehensive seed setting, configuration management, and testing pays dividends every time you revisit old experiments, onboard new team members, or deploy to production. Reproducibility isn’t a luxury or academic concern—it’s essential for debugging, collaboration, and building ML systems that work reliably.
The practices outlined—pinned dependencies with lock files, seed setting across all randomness sources, containerized environments, version-controlled configurations, and automated reproducibility testing—transform “works on my machine” into “works everywhere, every time.” Start with the basics (requirements.txt and seed setting), then add sophistication (Docker, automated testing) as project complexity demands. The goal isn’t perfection on day one but incremental improvement toward environments where reproducing results is straightforward rather than a frustrating debugging exercise. Build reproducibility into your workflow from the start, and focus your energy on actual machine learning rather than environment troubleshooting.