Logging Machine Learning Experiments with MLflow

Machine learning development is inherently experimental. You try different algorithms, tweak hyperparameters, preprocess data in various ways, and iterate through dozens or even hundreds of model variations. Without systematic experiment tracking, this process becomes chaotic—you lose track of what worked, can’t reproduce promising results, and waste time re-running experiments you’ve already tried. MLflow provides a lightweight, flexible solution for logging experiments that integrates seamlessly into your existing ML workflow, whether you’re training models locally, in notebooks, or on distributed clusters.

Understanding MLflow’s Tracking Architecture

MLflow’s tracking system revolves around a few core concepts that, once understood, make experiment logging intuitive. At the highest level, you have experiments, which group related runs together. Each run represents a single execution of your model training code—one hyperparameter configuration, one dataset version, one attempt. Within each run, you log three types of information: parameters (input configurations), metrics (output measurements), and artifacts (files like models, plots, or datasets).

The tracking server can run locally or remotely. For individual development, the local file-based backend suffices:

import mlflow

# Uses ./mlruns directory by default
mlflow.set_tracking_uri("file:./mlruns")

# Create or set experiment
mlflow.set_experiment("sentiment-classification")

For team collaboration, you’d deploy a tracking server with a database backend (PostgreSQL, MySQL) and artifact storage (S3, Azure Blob, GCS). The beauty of MLflow’s design is that your logging code remains identical regardless of backend—change the tracking URI and everything else stays the same.

Understanding run context is crucial. Every logging operation happens within an active run. MLflow provides two patterns for managing run context:

# Pattern 1: Explicit context manager
with mlflow.start_run(run_name="baseline-model"):
    mlflow.log_param("learning_rate", 0.001)
    # training code...
    mlflow.log_metric("accuracy", 0.89)

# Pattern 2: Manual start/end (useful for long-running processes)
run = mlflow.start_run(run_name="experiment-1")
mlflow.log_param("batch_size", 32)
# training code...
mlflow.end_run()

The context manager pattern is cleaner and ensures runs properly close even if exceptions occur. For Jupyter notebooks or interactive sessions where you might pause between logging calls, manual management offers more control.

MLflow Run Hierarchy

📁 Experiment

Groups related runs (e.g., “image-classification-project”)

🔬 Run

Single training execution with unique ID

Parameters: Model config, hyperparameters

Metrics: Loss, accuracy, F1 score

Artifacts: Trained model, plots, data

Logging Parameters and Hyperparameters Effectively

Parameters represent the input configuration for your experiment—anything you set before or during training that affects the outcome. This includes obvious hyperparameters like learning rate and batch size, but also data preprocessing choices, model architecture decisions, and environment information.

The basic logging interface is straightforward:

with mlflow.start_run():
    mlflow.log_param("model_type", "random_forest")
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 10)
    mlflow.log_param("random_state", 42)

For multiple parameters, use log_params with a dictionary:

params = {
    "learning_rate": 0.001,
    "batch_size": 32,
    "epochs": 50,
    "optimizer": "adam",
    "dropout_rate": 0.3
}
mlflow.log_params(params)

A common mistake is logging too many parameters indiscriminately. Focus on parameters that actually vary between experiments or significantly impact results. If you always use the same random seed, don’t log it in every run—it becomes noise. However, do log parameters that might change implicitly, like library versions that could affect reproducibility:

import torch
import transformers

mlflow.log_param("pytorch_version", torch.__version__)
mlflow.log_param("transformers_version", transformers.__version__)

For nested configurations, flatten them meaningfully:

config = {
    "model": {
        "architecture": "resnet50",
        "pretrained": True
    },
    "training": {
        "optimizer": "sgd",
        "lr": 0.01,
        "momentum": 0.9
    }
}

# Flatten with descriptive prefixes
mlflow.log_param("model_architecture", config["model"]["architecture"])
mlflow.log_param("model_pretrained", config["model"]["pretrained"])
mlflow.log_param("train_optimizer", config["training"]["optimizer"])
mlflow.log_param("train_lr", config["training"]["lr"])

This flattening makes parameters searchable in the MLflow UI and enables filtering experiments by specific configuration aspects.

When working with hyperparameter search libraries like Optuna or Ray Tune, log both the search space and the specific values:

# Log search configuration
mlflow.log_param("search_algorithm", "tpe")
mlflow.log_param("n_trials", 100)

# Log selected hyperparameters
mlflow.log_params({
    "lr_trial": trial.suggest_float("lr", 1e-5, 1e-1, log=True),
    "batch_size_trial": trial.suggest_categorical("batch_size", [16, 32, 64]),
    "layers_trial": trial.suggest_int("layers", 2, 6)
})

Tracking Metrics Throughout Training

Metrics capture the quantitative performance of your model. Unlike parameters which are logged once at the start, metrics often need logging multiple times—after each epoch, every N steps, or at specific training milestones. MLflow handles this through step-indexed metric logging:

for epoch in range(num_epochs):
    train_loss = train_epoch(model, train_loader)
    val_loss, val_accuracy = evaluate(model, val_loader)
    
    mlflow.log_metric("train_loss", train_loss, step=epoch)
    mlflow.log_metric("val_loss", val_loss, step=epoch)
    mlflow.log_metric("val_accuracy", val_accuracy, step=epoch)

The step parameter is crucial—it enables plotting metric progression over time in the MLflow UI. Without it, each log_metric call overwrites the previous value, and you lose the training trajectory.

For training loops that operate on batches within epochs, you might want to log at different granularities:

global_step = 0
for epoch in range(num_epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        loss = train_step(model, data, target)
        
        # Log frequently during training
        if global_step % 100 == 0:
            mlflow.log_metric("batch_loss", loss, step=global_step)
        
        global_step += 1
    
    # Log epoch-level metrics
    val_metrics = evaluate(model, val_loader)
    mlflow.log_metrics({
        "epoch_val_loss": val_metrics["loss"],
        "epoch_val_acc": val_metrics["accuracy"],
        "epoch_val_f1": val_metrics["f1"]
    }, step=epoch)

This creates two metric streams: high-frequency batch-level loss for monitoring training stability, and epoch-level validation metrics for overall performance tracking.

For classification tasks, log comprehensive metrics beyond just accuracy:

from sklearn.metrics import precision_recall_fscore_support, roc_auc_score

y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)

# Calculate multiple metrics
precision, recall, f1, _ = precision_recall_fscore_support(
    y_test, y_pred, average='weighted'
)

mlflow.log_metrics({
    "test_accuracy": accuracy_score(y_test, y_pred),
    "test_precision": precision,
    "test_recall": recall,
    "test_f1": f1,
    "test_auc": roc_auc_score(y_test, y_pred_proba, multi_class='ovr')
})

These detailed metrics enable comparing models across multiple dimensions, not just a single accuracy number.

For regression tasks, track multiple error metrics:

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

predictions = model.predict(X_test)

mlflow.log_metrics({
    "test_mse": mean_squared_error(y_test, predictions),
    "test_rmse": mean_squared_error(y_test, predictions, squared=False),
    "test_mae": mean_absolute_error(y_test, predictions),
    "test_r2": r2_score(y_test, predictions)
})

Artifact Logging for Models and Visualizations

Artifacts are files associated with a run—trained models, plots, confusion matrices, dataset samples, or any other binary data. While parameters and metrics are stored in the database, artifacts go to artifact storage (local filesystem, S3, etc.).

The most important artifact is usually the model itself:

# For scikit-learn models
mlflow.sklearn.log_model(model, "model")

# For PyTorch models
mlflow.pytorch.log_model(model, "model")

# For TensorFlow/Keras
mlflow.tensorflow.log_model(model, "model")

# For custom or unsupported frameworks
import joblib
joblib.dump(model, "model.pkl")
mlflow.log_artifact("model.pkl")

The framework-specific logging functions (like mlflow.sklearn.log_model) are preferable because they capture metadata about model signature, requirements, and enable model serving through MLflow’s deployment features.

Visualization artifacts provide crucial insights into model behavior:

import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.savefig("confusion_matrix.png")
mlflow.log_artifact("confusion_matrix.png")
plt.close()

# ROC curve for binary classification
from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(y_test, y_pred_proba[:, 1])
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.savefig("roc_curve.png")
mlflow.log_artifact("roc_curve.png")
plt.close()

For deep learning, log training curves:

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(train_losses, label='Train Loss')
plt.plot(val_losses, label='Val Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.title('Training History - Loss')

plt.subplot(1, 2, 2)
plt.plot(train_accs, label='Train Acc')
plt.plot(val_accs, label='Val Acc')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.title('Training History - Accuracy')

plt.tight_layout()
plt.savefig("training_curves.png")
mlflow.log_artifact("training_curves.png")
plt.close()

For logging entire directories of artifacts:

# Save model checkpoints in a directory
checkpoint_dir = "checkpoints"
os.makedirs(checkpoint_dir, exist_ok=True)

for epoch in range(num_epochs):
    # Save checkpoint
    torch.save(model.state_dict(), f"{checkpoint_dir}/model_epoch_{epoch}.pt")

# Log entire directory
mlflow.log_artifacts(checkpoint_dir, artifact_path="checkpoints")

💡 Artifact Logging Best Practices

✅ Do Log

Final trained models
Performance visualizations
Model architecture diagrams
Sample predictions
Feature importance plots
Data preprocessing pipelines

❌ Avoid Logging

Raw training datasets (usually too large)
Every single checkpoint (be selective)
Redundant file formats
Temporary training files
Debug outputs unless necessary
Uncompressed large files

Integration with Training Frameworks

MLflow provides autologging for popular ML frameworks, dramatically simplifying experiment tracking. Instead of manually logging every parameter and metric, enable autologging and MLflow captures everything automatically:

# For scikit-learn
mlflow.sklearn.autolog()

with mlflow.start_run():
    model = RandomForestClassifier(n_estimators=100, max_depth=10)
    model.fit(X_train, y_train)
    # Automatically logs: parameters, training metrics, model

For deep learning with PyTorch:

mlflow.pytorch.autolog()

with mlflow.start_run():
    model = create_model()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    
    for epoch in range(num_epochs):
        train_epoch(model, train_loader, optimizer)
        # Metrics automatically logged

TensorFlow/Keras integration:

mlflow.tensorflow.autolog()

with mlflow.start_run():
    model = create_keras_model()
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    
    history = model.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        epochs=50,
        batch_size=32
    )
    # All history metrics and model automatically logged

Autologging captures:

Model hyperparameters
Training/validation metrics per epoch
Final model artifacts
Training duration
Framework versions

However, autologging doesn’t replace manual logging entirely. You still need to log custom metrics, domain-specific parameters, or application-level information:

mlflow.sklearn.autolog()

with mlflow.start_run():
    # Autolog handles model training
    model = XGBClassifier(n_estimators=100)
    model.fit(X_train, y_train)
    
    # Manually log additional context
    mlflow.log_param("feature_engineering_version", "v2")
    mlflow.log_param("data_split_date", "2024-01-15")
    
    # Custom business metrics
    predictions = model.predict(X_test)
    cost_savings = calculate_cost_savings(y_test, predictions)
    mlflow.log_metric("estimated_cost_savings", cost_savings)

For HuggingFace Transformers, integration is seamless:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    report_to="mlflow",  # Enable MLflow logging
    logging_steps=100,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

with mlflow.start_run():
    trainer.train()
    # All training metrics automatically logged to MLflow

Organizing and Comparing Experiments

As experiments accumulate, organization becomes critical. MLflow’s experiment grouping provides the first level of organization. Create separate experiments for different projects, model types, or research directions:

# Project-specific experiments
mlflow.set_experiment("sentiment-analysis-v1")
mlflow.set_experiment("sentiment-analysis-v2")

# Model-family experiments
mlflow.set_experiment("bert-variants")
mlflow.set_experiment("gpt-experiments")

# Research direction experiments
mlflow.set_experiment("baseline-models")
mlflow.set_experiment("hyperparameter-tuning")
mlflow.set_experiment("architecture-search")

Within experiments, use meaningful run names and tags:

with mlflow.start_run(run_name="bert-base-lr001-batch32"):
    mlflow.set_tag("model_family", "bert")
    mlflow.set_tag("experiment_type", "baseline")
    mlflow.set_tag("dataset_version", "v2.1")
    mlflow.set_tag("researcher", "team_a")

Tags enable filtering in the UI and programmatic queries. You might want to find all runs from a specific researcher, all experiments with a particular model family, or all baseline runs for comparison.

The MLflow UI provides powerful comparison features. You can select multiple runs and view them side-by-side, comparing parameters, metrics, and artifacts. For programmatic comparison:

from mlflow.tracking import MlflowClient

client = MlflowClient()
experiment = client.get_experiment_by_name("sentiment-analysis-v1")

# Get all runs from experiment
runs = client.search_runs(
    experiment_ids=[experiment.experiment_id],
    order_by=["metrics.val_accuracy DESC"],
    max_results=10
)

# Compare top runs
for run in runs:
    print(f"Run ID: {run.info.run_id}")
    print(f"Accuracy: {run.data.metrics['val_accuracy']:.4f}")
    print(f"LR: {run.data.params['learning_rate']}")
    print("---")

For automated best model selection:

# Find best run by specific metric
best_run = client.search_runs(
    experiment_ids=[experiment.experiment_id],
    filter_string="metrics.val_f1 > 0.85",
    order_by=["metrics.val_f1 DESC"],
    max_results=1
)[0]

# Load best model
model_uri = f"runs:/{best_run.info.run_id}/model"
model = mlflow.sklearn.load_model(model_uri)

This programmatic access enables building automated pipelines that train multiple models, compare results, and deploy the best performer—all tracked in MLflow.

Nested Runs for Complex Workflows

For complex ML workflows like hyperparameter search, cross-validation, or ensemble training, nested runs provide hierarchical organization. The parent run represents the overall experiment, while child runs represent individual trials:

with mlflow.start_run(run_name="hyperparameter-search") as parent_run:
    mlflow.log_param("search_algorithm", "random_search")
    mlflow.log_param("n_trials", 50)
    
    best_score = 0
    for trial in range(50):
        # Create nested run for each trial
        with mlflow.start_run(run_name=f"trial_{trial}", nested=True):
            params = sample_hyperparameters()
            mlflow.log_params(params)
            
            model = train_model(**params)
            score = evaluate_model(model)
            mlflow.log_metric("val_accuracy", score)
            
            if score > best_score:
                best_score = score
    
    # Log best score in parent run
    mlflow.log_metric("best_trial_accuracy", best_score)

This creates a clean hierarchy in the MLflow UI where you can drill down from the search overview into individual trials, while keeping high-level summary metrics at the parent level.

For cross-validation:

from sklearn.model_selection import KFold

with mlflow.start_run(run_name="5-fold-cv") as parent_run:
    kf = KFold(n_splits=5)
    fold_scores = []
    
    for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
        with mlflow.start_run(run_name=f"fold_{fold}", nested=True):
            X_train_fold = X[train_idx]
            y_train_fold = y[train_idx]
            X_val_fold = X[val_idx]
            y_val_fold = y[val_idx]
            
            model = train_model(X_train_fold, y_train_fold)
            score = model.score(X_val_fold, y_val_fold)
            
            mlflow.log_metric("fold_accuracy", score)
            fold_scores.append(score)
    
    # Log aggregate metrics in parent run
    mlflow.log_metric("mean_cv_accuracy", np.mean(fold_scores))
    mlflow.log_metric("std_cv_accuracy", np.std(fold_scores))

Systematic experiment logging transforms machine learning from an ad-hoc process into a reproducible, analyzable science. MLflow’s flexible tracking system adapts to your workflow—whether you’re running quick experiments in notebooks, orchestrating hyperparameter searches, or managing production model training pipelines. The combination of parameters, metrics, and artifacts provides complete visibility into model development, enabling better decisions about what works and why.

The investment in proper experiment tracking pays immediate dividends. You stop wasting time re-running forgotten experiments, can confidently claim reproducibility, and build institutional knowledge about what approaches succeed in your domain. As your MLflow experiment database grows, it becomes a valuable asset for training new team members, analyzing trends in model performance, and making data-driven decisions about research directions. Start logging your experiments today, and you’ll wonder how you ever developed models without it.