Machine Learning Project Structure Best Practices

A well-organized machine learning project can mean the difference between a smooth path to production and a chaotic mess that nobody wants to maintain. I’ve seen countless ML projects that started with brilliant ideas but became unmaintainable nightmares because of poor structure. The code worked—at least initially—but when it came time to add features, retrain models, or hand off to another team member, everything fell apart. Good project structure isn’t just about aesthetics; it’s about creating a sustainable, scalable foundation that serves you throughout the entire ML lifecycle.

The reality is that machine learning projects are fundamentally different from traditional software projects. They involve data pipelines, experiments, model artifacts, configurations, and notebooks—all of which need careful organization. Without proper structure, you’ll waste time hunting for files, struggle to reproduce results, and find collaboration nearly impossible. Let’s dive into the best practices that will set your ML projects up for long-term success.

The Foundation: Core Directory Structure

The backbone of any ML project is its directory structure. A good structure should be intuitive, scalable, and aligned with how ML workflows actually operate. Here’s a battle-tested structure that works across different types of projects:

project-name/
├── data/
│   ├── raw/
│   ├── processed/
│   └── external/
├── notebooks/
├── src/
│   ├── data/
│   ├── features/
│   ├── models/
│   └── utils/
├── models/
├── configs/
├── tests/
├── docs/
├── requirements.txt
└── README.md

This structure separates concerns clearly while maintaining flexibility. Let’s break down why each directory matters and how to use it effectively.

The Data Directory: Treating Data as Immutable

Your data/ directory should follow a critical principle: raw data is immutable. Once you place data in data/raw/, it should never be modified. All transformations should output to data/processed/. This approach provides several benefits:

  • Reproducibility: You can always regenerate processed data from raw data
  • Debugging: When something goes wrong, you know your source data hasn’t been corrupted
  • Versioning: You can track changes to processing logic without worrying about data state

The data/external/ subdirectory is for third-party datasets, API downloads, or reference data that comes from outside your primary data sources. Keep it separate to maintain clear data provenance.

Important consideration: Never commit large data files to Git. Instead, use .gitignore to exclude data directories and document data sources in your README. For data versioning, use tools like DVC (Data Version Control) or store data in cloud storage with versioned paths.

The Source Code Directory: Modular and Reusable

The src/ directory is where your production-quality code lives. This is distinct from notebooks—code here should be modular, tested, and reusable. The subdirectories reflect the ML pipeline stages:

src/data/: Contains scripts for data ingestion and cleaning. For example, make_dataset.py might handle downloading raw data, while clean_data.py performs initial cleaning and validation. These scripts should be idempotent—running them multiple times with the same input should produce the same output.

src/features/: Houses feature engineering code. This is where you transform raw data into features suitable for modeling. A typical file might be build_features.py that creates derived features, handles encoding, and performs feature scaling. Keep feature engineering logic modular so you can easily reuse transformations across training and inference.

src/models/: Contains model training, evaluation, and prediction code. Separate these concerns into different files: train_model.py, predict_model.py, and evaluate_model.py. This separation makes it easier to run different stages independently and integrate with orchestration tools.

src/utils/: Utility functions that don’t fit elsewhere—logging helpers, custom metrics, visualization functions, or common data manipulation utilities. Keep this focused on truly reusable components.

Code Organization Principles

✓ Do This

  • Separate concerns into modules
  • Write reusable functions
  • Use configuration files
  • Include docstrings
  • Make code testable

✗ Avoid This

  • Hardcoded parameters
  • Monolithic scripts
  • Duplicate code
  • Notebook-only code
  • Magic numbers

Configuration Management: The Backbone of Reproducibility

One of the most overlooked aspects of ML project structure is configuration management. Your configs/ directory should contain all the parameters that define your experiments—model hyperparameters, data paths, feature selections, training settings, and more.

Why Configuration Files Matter

Hardcoding parameters directly in scripts is a recipe for disaster. When you need to run experiments with different settings, you end up either modifying code constantly (breaking reproducibility) or creating multiple script copies (creating maintenance nightmares). Configuration files solve this by externalizing all variable parameters.

Use YAML or JSON for configuration files as they’re human-readable and widely supported. Here’s an example structure:

yaml

# configs/train_config.yaml
data:
  raw_path: "data/raw/dataset.csv"
  processed_path: "data/processed/features.parquet"
  test_size: 0.2

features:
  numerical: ["age", "income", "tenure"]
  categorical: ["region", "product_type"]
  target: "churn"

model:
  type: "xgboost"
  hyperparameters:
    max_depth: 6
    learning_rate: 0.1
    n_estimators: 100

training:
  cross_validation: 5
  random_seed: 42
  early_stopping_rounds: 10

This approach lets you run different experiments simply by creating new config files or overriding specific parameters via command-line arguments. Your training script can load configurations like this:

python

import yaml
from pathlib import Path

def load_config(config_path):
    with open(config_path, 'r') as f:
        return yaml.safe_load(f)

config = load_config('configs/train_config.yaml')
model_params = config['model']['hyperparameters']

For more sophisticated needs, consider using tools like Hydra or OmegaConf that provide configuration composition, validation, and command-line overrides.

Models Directory: Version Control for Artifacts

The models/ directory stores trained model artifacts, but how you organize it significantly impacts reproducibility and deployment. A flat directory with files like model_v1.pkl, model_v2.pkl quickly becomes unmanageable. Instead, use a structured approach:

models/
├── experiment_001_baseline/
│   ├── model.pkl
│   ├── metrics.json
│   ├── config.yaml
│   └── feature_importance.png
├── experiment_002_xgboost/
│   ├── model.pkl
│   ├── metrics.json
│   ├── config.yaml
│   └── feature_importance.png
└── production/
    └── model_v1.0.0.pkl

Each experiment gets its own directory containing not just the model, but also the metrics, configuration used to train it, and any relevant artifacts. This makes it trivial to trace back what produced a particular model. The production/ subdirectory contains models actually deployed, with semantic versioning.

Consider using MLflow, Weights & Biases, or similar tools for more sophisticated model tracking. These tools provide automatic logging, comparison interfaces, and integration with deployment pipelines.

Notebooks: Exploration vs. Production

Notebooks are incredibly valuable for exploration, analysis, and communicating results, but they shouldn’t contain production code. The notebooks/ directory should be clearly organized by purpose:

notebooks/
├── 01_data_exploration.ipynb
├── 02_feature_analysis.ipynb
├── 03_baseline_model.ipynb
├── 04_model_comparison.ipynb
└── 05_results_visualization.ipynb

Use numeric prefixes to indicate execution order. This helps collaborators understand the analysis flow and makes it easy to reproduce your work sequentially.

Critical practice: Once you’ve prototyped something valuable in a notebook, refactor it into proper Python modules in src/. Notebooks are for exploration; src/ is for production. This discipline ensures that your project remains maintainable as it scales.

Keep notebooks focused on specific tasks. A 1000-cell notebook that does everything is as bad as a 5000-line Python script. Break analysis into logical chunks—data exploration, feature engineering experiments, model prototyping, and results visualization should each have their own notebooks.

Testing: Non-Negotiable for ML Projects

The tests/ directory is where you build confidence in your code. ML projects have unique testing needs beyond traditional software:

Data validation tests: Check that incoming data matches expected schemas, ranges, and distributions. Use libraries like Great Expectations or Pandera for this.

Feature engineering tests: Ensure transformations produce expected outputs. For example, if you have a function that normalizes a feature, test that it correctly handles edge cases like zero variance or missing values.

Model behavior tests: Test that your model produces sensible predictions on known inputs. If you’re building a credit risk model, ensure it assigns higher risk scores to profiles with obvious red flags.

Integration tests: Verify that your entire pipeline runs end-to-end without errors, even if you use smaller datasets for testing.

Here’s a simple example structure:

tests/
├── test_data_processing.py
├── test_feature_engineering.py
├── test_model_training.py
└── test_integration.py

Use pytest as your testing framework—it’s the de facto standard in Python and integrates well with CI/CD pipelines. Run tests automatically on every commit using GitHub Actions, GitLab CI, or similar tools.

Project Setup Checklist

Component Essential Elements
Version Control Git, .gitignore for data/models, clear commit messages
Environment requirements.txt or environment.yml, Python version specified
Documentation README with setup instructions, data documentation
Configuration YAML/JSON configs, no hardcoded parameters
Testing Unit tests, data validation, CI/CD integration

Documentation: Your Future Self Will Thank You

The docs/ directory should contain project documentation beyond the README. This includes:

  • Data dictionary: Descriptions of all features, their types, and meanings
  • Model documentation: Architecture decisions, training procedures, performance benchmarks
  • API documentation: If you expose your model via an API
  • Deployment guide: Instructions for deploying to production environments

Your README.md should serve as the entry point, containing:

  • Project overview and objectives
  • Setup instructions (environment, dependencies, data)
  • How to run training, evaluation, and inference
  • Project structure explanation
  • Contact information for questions

Don’t underestimate documentation—it’s often the difference between a project that others can use and one that gets abandoned.

Dependency Management: Reproducible Environments

Always maintain a requirements.txt (for pip) or environment.yml (for conda) that pins exact versions of all dependencies. Loose version specifications like scikit-learn>=0.24 can lead to different team members having different environments, breaking reproducibility.

For pip, generate your requirements file with exact versions:

bash

pip freeze > requirements.txt

For conda, export your environment:

bash

conda env export > environment.yml

Consider using tools like Poetry or Pipenv for more sophisticated dependency management with automatic lock files and development/production separation.

Practical Example: E-commerce Churn Prediction

Let’s see how this structure works in practice. Imagine you’re building a customer churn prediction model for an e-commerce platform.

You start by placing raw transaction and customer data in data/raw/. Your src/data/make_dataset.py script loads this data, performs initial cleaning, and saves to data/processed/. Next, src/features/build_features.py engineers features like purchase frequency, average order value, days since last purchase, and saves feature-engineered data.

Your configs/train_config.yaml specifies which features to use, model hyperparameters, and training settings. The src/models/train_model.py script reads this config, loads processed features, trains a model, and saves it to models/experiment_001_random_forest/ along with performance metrics.

You document your exploratory analysis in notebooks/01_churn_analysis.ipynb, examining which features correlate with churn. All of this is tested via tests/test_feature_engineering.py to ensure features are computed correctly.

When it’s time to deploy, you promote the best model to models/production/ with proper versioning. The entire project is reproducible because anyone can clone your repo, install dependencies from requirements.txt, and run the pipeline using your configuration files.

Conclusion

Machine learning project structure isn’t just about organizing files—it’s about building a foundation for reproducibility, collaboration, and long-term maintenance. The practices outlined here separate concerns clearly, make experimentation systematic, and ensure that your project can scale from prototype to production. By treating raw data as immutable, externalizing configurations, separating exploration from production code, and maintaining comprehensive documentation, you create projects that serve you well throughout their entire lifecycle.

Start your next ML project with proper structure from day one. The upfront investment in organization pays massive dividends as your project grows, team members join, and requirements evolve. A well-structured project isn’t just easier to work with—it’s more likely to succeed.

Leave a Comment