How to Use W&B Sweeps for Hyperparameter Search

Manual hyperparameter tuning — picking learning rate, batch size, and regularisation by intuition and editing config files — is time-consuming and rarely finds the actual optimum. W&B Sweeps automates this with Bayesian optimisation, random search, or grid search, running agents in parallel, logging every trial to your W&B project, and surfacing the best runs in a leaderboard you can inspect visually. This article covers how sweeps work mechanically, how to configure them for real training jobs, how to run distributed sweep agents, and how to interpret sweep results to actually improve your model rather than just collect charts.

How W&B Sweeps Work

A W&B sweep consists of two components: a sweep controller and one or more sweep agents. The controller runs on W&B’s servers and maintains a model of which hyperparameter configurations have been tried and which are promising to try next. Each agent is a process you launch on your own compute — a GPU machine, a Kubernetes pod, a cloud VM — that asks the controller for the next configuration to try, runs your training script with those hyperparameters, reports metrics back to the controller, and then asks for another configuration. This architecture lets you run many agents in parallel across multiple machines while the controller coordinates the search globally.

The three search strategies serve different purposes. Random search samples configurations independently from the specified distributions and is surprisingly effective when you have more than five hyperparameters, because random sampling covers the space more evenly than grid search at the same number of trials. Grid search exhaustively tries every combination in the specified discrete sets, which is only practical when you have a small, well-defined set of discrete choices (for example, trying three specific learning rates and two weight decay values). Bayesian optimisation fits a probabilistic model (a Gaussian process or, in W&B’s implementation, a tree-structured Parzen estimator) to the results of completed trials and uses it to select the next configuration most likely to improve on the current best. Bayesian search is more sample-efficient than random for low-dimensional, smooth objective functions — typically useful when each training run is expensive and you want to find good configurations in fewer total trials.

Setting Up a Basic Sweep

import wandb

# Define sweep configuration as a dictionary
sweep_config = {
    "method": "bayes",         # "bayes", "random", or "grid"
    "metric": {
        "name": "val/loss",    # metric to optimise (must be logged in your training script)
        "goal": "minimize",    # "minimize" or "maximize"
    },
    "parameters": {
        "learning_rate": {
            "distribution": "log_uniform_values",
            "min": 1e-5,
            "max": 1e-2,
        },
        "batch_size": {
            "values": [16, 32, 64, 128],
        },
        "weight_decay": {
            "distribution": "log_uniform_values",
            "min": 1e-6,
            "max": 1e-2,
        },
        "warmup_steps": {
            "distribution": "int_uniform",
            "min": 0,
            "max": 500,
        },
        "dropout": {
            "distribution": "uniform",
            "min": 0.0,
            "max": 0.5,
        },
    },
    # Optional: stop underperforming runs early
    "early_terminate": {
        "type": "hyperband",
        "min_iter": 3,         # minimum epochs before a run can be terminated
        "eta": 3,              # aggressiveness; higher = more aggressive pruning
    },
}

# Create the sweep — returns a sweep_id
sweep_id = wandb.sweep(sweep_config, project="my-ml-project")
print(f"Sweep ID: {sweep_id}")

The log_uniform_values distribution samples uniformly in log space between min and max, which is the right distribution for learning rates and weight decay because the optimal values span several orders of magnitude and you want equal probability mass on [1e-5, 1e-4] as on [1e-4, 1e-3]. Using uniform for learning rate would concentrate most samples near the top of the range and almost never explore small values. The early_terminate block applies Hyperband pruning: runs that are not in the top 1/eta fraction after min_iter steps are terminated, freeing up compute for more promising configurations.

Writing a Sweep-Compatible Training Script

Your training script needs one change to work with sweeps: instead of reading hyperparameters from argparse or a config file, read them from wandb.config after calling wandb.init(). The sweep agent populates wandb.config with the configuration the controller assigned before your script runs.

import wandb
import torch
import torch.nn as nn
from torch.utils.data import DataLoader

def train():
    # wandb.init() reads the sweep-assigned config automatically when run by an agent
    run = wandb.init()
    config = wandb.config   # populated by the sweep controller

    # Use config values — never hardcode hyperparameters inside sweep training functions
    model = build_model(dropout=config.dropout)
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=config.learning_rate,
        weight_decay=config.weight_decay,
    )
    train_loader = DataLoader(train_dataset, batch_size=config.batch_size, shuffle=True)
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=config.warmup_steps,
        num_training_steps=len(train_loader) * NUM_EPOCHS,
    )

    for epoch in range(NUM_EPOCHS):
        train_loss = run_epoch(model, train_loader, optimizer, scheduler)
        val_loss, val_acc = evaluate(model, val_loader)

        # Log metrics — the sweep controller watches these to build its model
        wandb.log({
            "epoch": epoch,
            "train/loss": train_loss,
            "val/loss": val_loss,      # must match "metric.name" in sweep_config
            "val/accuracy": val_acc,
        })

    run.finish()

# Launch agents — each agent calls train() in a loop until the sweep is done
# Run this on each machine / GPU you want to use for the sweep
wandb.agent(sweep_id, function=train, count=20)  # count: max runs per agent

Running Parallel Agents Across Multiple GPUs

The main advantage of W&B sweeps over single-process hyperparameter search tools like Optuna is that you can trivially parallelise across machines. Each agent is an independent process with no shared state except the W&B controller — start as many as you have compute for.

# On machine 1 (e.g., 4xA100 node) — run one agent per GPU
for GPU_ID in 0 1 2 3; do
    CUDA_VISIBLE_DEVICES=$GPU_ID python train.py         --sweep_id $SWEEP_ID         --count 10 &
done

# On machine 2 — same sweep_id, agents join the same sweep
for GPU_ID in 0 1 2 3; do
    CUDA_VISIBLE_DEVICES=$GPU_ID python train.py         --sweep_id $SWEEP_ID         --count 10 &
done
# In train.py: accept sweep_id and count as CLI args for multi-machine use
import argparse, wandb

parser = argparse.ArgumentParser()
parser.add_argument("--sweep_id", type=str, required=True)
parser.add_argument("--count", type=int, default=20)
args = parser.parse_args()

wandb.agent(args.sweep_id, function=train, count=args.count)

If you are on a cluster with a job scheduler like SLURM, you can submit each agent as a separate job array task. This lets you use idle cluster capacity for hyperparameter search without reserving a block of nodes for the duration of the sweep — each job task runs one or more trials and exits when it hits its count limit or when the sweep is complete.

#!/bin/bash
#SBATCH --job-name=wandb_sweep_agent
#SBATCH --array=0-15          # 16 agents
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00

source activate myenv
python train.py --sweep_id $SWEEP_ID --count 5

Hyperband Early Termination in Practice

Early termination is one of the highest-leverage features of sweeps for expensive training jobs. Without it, every configuration runs to completion regardless of how bad the early-epoch validation loss looks. With Hyperband, the sweep controller eliminates the bottom fraction of runs after a minimum number of epochs and reallocates their compute to the survivors. The result is that the same total GPU-hours explore a much larger region of the hyperparameter space compared to running every configuration to completion.

The two key settings are min_iter (the minimum number of logged steps before a run is eligible for termination) and eta (the pruning factor). Setting eta=3 means only the top 1/3 of runs survive each pruning round. For most training jobs, setting min_iter to 20–30% of your total training epochs is a good starting point — enough for the validation loss to distinguish good runs from bad ones, but early enough to save significant compute on the worst configurations. Be cautious with very small values of min_iter for tasks with slow initial learning (like fine-tuning large models from a poor initialisation) — you may eliminate runs that would have caught up.

Interpreting Sweep Results

The W&B sweep UI shows a parallel coordinates plot where each vertical axis is a hyperparameter or metric and each line is one run coloured by the target metric. Runs that performed well show as a coherent band, and you can trace which hyperparameter ranges they came from. The most useful things to look for: first, whether the best runs cluster in a narrow region of the hyperparameter space (indicating convergence — the sweep has found a good region) or are scattered (indicating either more exploration is needed or the objective is noisy). Second, which hyperparameters show a strong gradient in the parallel coordinates plot versus which are flat — flat axes correspond to hyperparameters that do not matter much for your objective, and you can fix them at any reasonable value in subsequent searches.

import wandb

# Programmatically retrieve the best run from a completed sweep
api = wandb.Api()
sweep = api.sweep(f"your_entity/your_project/{sweep_id}")

# Sort runs by the target metric
best_run = min(sweep.runs, key=lambda r: r.summary.get("val/loss", float("inf")))
print(f"Best run: {best_run.name}")
print(f"Best val/loss: {best_run.summary['val/loss']:.4f}")
print(f"Config: {dict(best_run.config)}")

# Get all runs as a DataFrame for custom analysis
import pandas as pd
runs_data = []
for run in sweep.runs:
    row = dict(run.config)
    row["val_loss"] = run.summary.get("val/loss", None)
    row["val_acc"]  = run.summary.get("val/accuracy", None)
    row["run_name"] = run.name
    runs_data.append(row)
df = pd.DataFrame(runs_data).dropna(subset=["val_loss"]).sort_values("val_loss")
print(df.head(10))

Common Pitfalls and How to Avoid Them

The most common mistake is searching over too many hyperparameters simultaneously. Adding more axes to the search space grows it exponentially, meaning each additional hyperparameter reduces the density of samples in every region — the same 50 runs cover a 10-dimensional space far more sparsely than a 4-dimensional one. In practice, fix architectural hyperparameters (number of layers, hidden size, attention heads) before running a sweep, and focus the sweep on training dynamics hyperparameters: learning rate, learning rate schedule, weight decay, dropout, and batch size. These have the most impact on final performance and are the right scope for a sweep of 30–100 runs.

The second common issue is not logging enough intermediate metrics. The Hyperband pruner and the Bayesian model both rely on the metric values you log during training — if you only log the final validation loss at the end of training, early termination cannot work and the Bayesian model has less signal to work with. Log your target metric (at minimum) at the end of every epoch. For long training runs, logging every 100 steps gives the controller finer-grained information to work with. The logging overhead is negligible compared to the training compute.

Third: do not run a single sweep and treat the best configuration as final. Bayesian optimisation converges toward a region of the space, but the best individual run is subject to variance from random initialisation and data shuffling. Once the sweep has identified a promising region — a range of learning rates that consistently produce good runs, for example — run three to five final training jobs with the best configuration using different random seeds, and report the mean and standard deviation of those runs as your actual result. The sweep is a search tool, not a replacement for proper evaluation.

Sweep Strategies for Fine-Tuning vs Training from Scratch

The right sweep strategy depends on whether you are training from scratch or fine-tuning a pretrained model, because the optimal hyperparameter ranges and the sensitivity of the objective to each hyperparameter differ substantially between the two settings. When training from scratch, the model is sensitive to learning rate across a wide range — runs with learning rates 10x too high will diverge, and runs 10x too low will converge too slowly or plateau early. The optimal weight decay and warmup settings also have a strong effect on final performance. Bayesian search with log-uniform distributions over learning rate and weight decay is effective here, and you typically need 50–100 runs to find a good configuration for a non-trivial architecture.

Fine-tuning pretrained models is different in two important ways. First, the optimal learning rate range is much narrower and lower — typically 1e-5 to 5e-4 for most transformer fine-tuning tasks, compared to 1e-4 to 1e-1 for training from scratch. Second, fine-tuning is often less sensitive to weight decay and dropout because the pretrained model’s representations are already well-regularised. In practice, for fine-tuning sweeps you can narrow the search ranges significantly based on the model size and the task, reduce the number of hyperparameters to search (often just learning rate and warmup steps are worth sweeping), and get good results with 20–30 runs rather than 100. Using the default AdamW settings from the HuggingFace Trainer as a starting point and running a tight random search around those values is often more efficient than a wide Bayesian sweep when fine-tuning.

Using Sweep Results to Refine Your Search

A sweep is most valuable when treated as an iterative process rather than a one-shot search. The first sweep establishes which regions of the hyperparameter space are promising and which parameters matter. The parallel coordinates plot and the feature importance panel in the W&B sweep UI both help with this: the feature importance panel shows a correlation-based ranking of which hyperparameters account for the most variance in the target metric. If learning rate accounts for 70% of variance and dropout accounts for 3%, you should fix dropout at a reasonable value, narrow the learning rate search range to the region where your best runs clustered, and run a second, focused sweep. This iterative narrowing approach consistently finds better configurations in the same total number of trials compared to running a single wide sweep with many hyperparameters.

The W&B API also lets you resume a sweep — adding more runs to an existing sweep rather than starting over. If the first 30 runs have given the Bayesian model enough signal to focus its suggestions, adding another 20 runs to the same sweep will explore the high-promise region more densely than starting a new sweep would. You resume a sweep simply by passing the existing sweep_id to wandb.agent() — the controller picks up where it left off and continues to update its model with each new completed run.

The decision of when to stop a sweep comes down to two signals: whether the best validation metric has stopped improving across the last 15–20 runs (indicating the search has converged), and whether the parallel coordinates plot shows the best runs concentrating in a well-defined region rather than scattered across the full search space. When both are true, you have extracted the useful signal from the sweep and should move to final training runs with the best configuration and multiple seeds rather than continuing to accumulate more sweep trials.

Leave a Comment