A well-organized machine learning project can mean the difference between a smooth path to production and a chaotic mess that nobody wants to maintain. I’ve seen countless ML projects that started with brilliant ideas but became unmaintainable nightmares because of poor structure. The code worked—at least initially—but when it came time to add features, retrain models, or hand off to another team member, everything fell apart. Good project structure isn’t just about aesthetics; it’s about creating a sustainable, scalable foundation that serves you throughout the entire ML lifecycle.
The reality is that machine learning projects are fundamentally different from traditional software projects. They involve data pipelines, experiments, model artifacts, configurations, and notebooks—all of which need careful organization. Without proper structure, you’ll waste time hunting for files, struggle to reproduce results, and find collaboration nearly impossible. Let’s dive into the best practices that will set your ML projects up for long-term success.
The Foundation: Core Directory Structure
The backbone of any ML project is its directory structure. A good structure should be intuitive, scalable, and aligned with how ML workflows actually operate. Here’s a battle-tested structure that works across different types of projects:
project-name/
├── data/
│ ├── raw/
│ ├── processed/
│ └── external/
├── notebooks/
├── src/
│ ├── data/
│ ├── features/
│ ├── models/
│ └── utils/
├── models/
├── configs/
├── tests/
├── docs/
├── requirements.txt
└── README.md
This structure separates concerns clearly while maintaining flexibility. Let’s break down why each directory matters and how to use it effectively.
The Data Directory: Treating Data as Immutable
Your data/ directory should follow a critical principle: raw data is immutable. Once you place data in data/raw/, it should never be modified. All transformations should output to data/processed/. This approach provides several benefits:
- Reproducibility: You can always regenerate processed data from raw data
- Debugging: When something goes wrong, you know your source data hasn’t been corrupted
- Versioning: You can track changes to processing logic without worrying about data state
The data/external/ subdirectory is for third-party datasets, API downloads, or reference data that comes from outside your primary data sources. Keep it separate to maintain clear data provenance.
Important consideration: Never commit large data files to Git. Instead, use .gitignore to exclude data directories and document data sources in your README. For data versioning, use tools like DVC (Data Version Control) or store data in cloud storage with versioned paths.
The Source Code Directory: Modular and Reusable
The src/ directory is where your production-quality code lives. This is distinct from notebooks—code here should be modular, tested, and reusable. The subdirectories reflect the ML pipeline stages:
src/data/: Contains scripts for data ingestion and cleaning. For example, make_dataset.py might handle downloading raw data, while clean_data.py performs initial cleaning and validation. These scripts should be idempotent—running them multiple times with the same input should produce the same output.
src/features/: Houses feature engineering code. This is where you transform raw data into features suitable for modeling. A typical file might be build_features.py that creates derived features, handles encoding, and performs feature scaling. Keep feature engineering logic modular so you can easily reuse transformations across training and inference.
src/models/: Contains model training, evaluation, and prediction code. Separate these concerns into different files: train_model.py, predict_model.py, and evaluate_model.py. This separation makes it easier to run different stages independently and integrate with orchestration tools.
src/utils/: Utility functions that don’t fit elsewhere—logging helpers, custom metrics, visualization functions, or common data manipulation utilities. Keep this focused on truly reusable components.
Code Organization Principles
✓ Do This
- Separate concerns into modules
- Write reusable functions
- Use configuration files
- Include docstrings
- Make code testable
✗ Avoid This
- Hardcoded parameters
- Monolithic scripts
- Duplicate code
- Notebook-only code
- Magic numbers
Configuration Management: The Backbone of Reproducibility
One of the most overlooked aspects of ML project structure is configuration management. Your configs/ directory should contain all the parameters that define your experiments—model hyperparameters, data paths, feature selections, training settings, and more.
Why Configuration Files Matter
Hardcoding parameters directly in scripts is a recipe for disaster. When you need to run experiments with different settings, you end up either modifying code constantly (breaking reproducibility) or creating multiple script copies (creating maintenance nightmares). Configuration files solve this by externalizing all variable parameters.
Use YAML or JSON for configuration files as they’re human-readable and widely supported. Here’s an example structure:
yaml
# configs/train_config.yaml
data:
raw_path: "data/raw/dataset.csv"
processed_path: "data/processed/features.parquet"
test_size: 0.2
features:
numerical: ["age", "income", "tenure"]
categorical: ["region", "product_type"]
target: "churn"
model:
type: "xgboost"
hyperparameters:
max_depth: 6
learning_rate: 0.1
n_estimators: 100
training:
cross_validation: 5
random_seed: 42
early_stopping_rounds: 10
This approach lets you run different experiments simply by creating new config files or overriding specific parameters via command-line arguments. Your training script can load configurations like this:
python
import yaml
from pathlib import Path
def load_config(config_path):
with open(config_path, 'r') as f:
return yaml.safe_load(f)
config = load_config('configs/train_config.yaml')
model_params = config['model']['hyperparameters']
For more sophisticated needs, consider using tools like Hydra or OmegaConf that provide configuration composition, validation, and command-line overrides.
Models Directory: Version Control for Artifacts
The models/ directory stores trained model artifacts, but how you organize it significantly impacts reproducibility and deployment. A flat directory with files like model_v1.pkl, model_v2.pkl quickly becomes unmanageable. Instead, use a structured approach:
models/
├── experiment_001_baseline/
│ ├── model.pkl
│ ├── metrics.json
│ ├── config.yaml
│ └── feature_importance.png
├── experiment_002_xgboost/
│ ├── model.pkl
│ ├── metrics.json
│ ├── config.yaml
│ └── feature_importance.png
└── production/
└── model_v1.0.0.pkl
Each experiment gets its own directory containing not just the model, but also the metrics, configuration used to train it, and any relevant artifacts. This makes it trivial to trace back what produced a particular model. The production/ subdirectory contains models actually deployed, with semantic versioning.
Consider using MLflow, Weights & Biases, or similar tools for more sophisticated model tracking. These tools provide automatic logging, comparison interfaces, and integration with deployment pipelines.
Notebooks: Exploration vs. Production
Notebooks are incredibly valuable for exploration, analysis, and communicating results, but they shouldn’t contain production code. The notebooks/ directory should be clearly organized by purpose:
notebooks/
├── 01_data_exploration.ipynb
├── 02_feature_analysis.ipynb
├── 03_baseline_model.ipynb
├── 04_model_comparison.ipynb
└── 05_results_visualization.ipynb
Use numeric prefixes to indicate execution order. This helps collaborators understand the analysis flow and makes it easy to reproduce your work sequentially.
Critical practice: Once you’ve prototyped something valuable in a notebook, refactor it into proper Python modules in src/. Notebooks are for exploration; src/ is for production. This discipline ensures that your project remains maintainable as it scales.
Keep notebooks focused on specific tasks. A 1000-cell notebook that does everything is as bad as a 5000-line Python script. Break analysis into logical chunks—data exploration, feature engineering experiments, model prototyping, and results visualization should each have their own notebooks.
Testing: Non-Negotiable for ML Projects
The tests/ directory is where you build confidence in your code. ML projects have unique testing needs beyond traditional software:
Data validation tests: Check that incoming data matches expected schemas, ranges, and distributions. Use libraries like Great Expectations or Pandera for this.
Feature engineering tests: Ensure transformations produce expected outputs. For example, if you have a function that normalizes a feature, test that it correctly handles edge cases like zero variance or missing values.
Model behavior tests: Test that your model produces sensible predictions on known inputs. If you’re building a credit risk model, ensure it assigns higher risk scores to profiles with obvious red flags.
Integration tests: Verify that your entire pipeline runs end-to-end without errors, even if you use smaller datasets for testing.
Here’s a simple example structure:
tests/
├── test_data_processing.py
├── test_feature_engineering.py
├── test_model_training.py
└── test_integration.py
Use pytest as your testing framework—it’s the de facto standard in Python and integrates well with CI/CD pipelines. Run tests automatically on every commit using GitHub Actions, GitLab CI, or similar tools.
Project Setup Checklist
| Component | Essential Elements |
|---|---|
| Version Control | Git, .gitignore for data/models, clear commit messages |
| Environment | requirements.txt or environment.yml, Python version specified |
| Documentation | README with setup instructions, data documentation |
| Configuration | YAML/JSON configs, no hardcoded parameters |
| Testing | Unit tests, data validation, CI/CD integration |
Documentation: Your Future Self Will Thank You
The docs/ directory should contain project documentation beyond the README. This includes:
- Data dictionary: Descriptions of all features, their types, and meanings
- Model documentation: Architecture decisions, training procedures, performance benchmarks
- API documentation: If you expose your model via an API
- Deployment guide: Instructions for deploying to production environments
Your README.md should serve as the entry point, containing:
- Project overview and objectives
- Setup instructions (environment, dependencies, data)
- How to run training, evaluation, and inference
- Project structure explanation
- Contact information for questions
Don’t underestimate documentation—it’s often the difference between a project that others can use and one that gets abandoned.
Dependency Management: Reproducible Environments
Always maintain a requirements.txt (for pip) or environment.yml (for conda) that pins exact versions of all dependencies. Loose version specifications like scikit-learn>=0.24 can lead to different team members having different environments, breaking reproducibility.
For pip, generate your requirements file with exact versions:
bash
pip freeze > requirements.txt
For conda, export your environment:
bash
conda env export > environment.yml
Consider using tools like Poetry or Pipenv for more sophisticated dependency management with automatic lock files and development/production separation.
Practical Example: E-commerce Churn Prediction
Let’s see how this structure works in practice. Imagine you’re building a customer churn prediction model for an e-commerce platform.
You start by placing raw transaction and customer data in data/raw/. Your src/data/make_dataset.py script loads this data, performs initial cleaning, and saves to data/processed/. Next, src/features/build_features.py engineers features like purchase frequency, average order value, days since last purchase, and saves feature-engineered data.
Your configs/train_config.yaml specifies which features to use, model hyperparameters, and training settings. The src/models/train_model.py script reads this config, loads processed features, trains a model, and saves it to models/experiment_001_random_forest/ along with performance metrics.
You document your exploratory analysis in notebooks/01_churn_analysis.ipynb, examining which features correlate with churn. All of this is tested via tests/test_feature_engineering.py to ensure features are computed correctly.
When it’s time to deploy, you promote the best model to models/production/ with proper versioning. The entire project is reproducible because anyone can clone your repo, install dependencies from requirements.txt, and run the pipeline using your configuration files.
Conclusion
Machine learning project structure isn’t just about organizing files—it’s about building a foundation for reproducibility, collaboration, and long-term maintenance. The practices outlined here separate concerns clearly, make experimentation systematic, and ensure that your project can scale from prototype to production. By treating raw data as immutable, externalizing configurations, separating exploration from production code, and maintaining comprehensive documentation, you create projects that serve you well throughout their entire lifecycle.
Start your next ML project with proper structure from day one. The upfront investment in organization pays massive dividends as your project grows, team members join, and requirements evolve. A well-structured project isn’t just easier to work with—it’s more likely to succeed.