DDP vs FSDP vs DeepSpeed ZeRO: Choosing the Right Multi-GPU Training Strategy

When a model fits on a single GPU, training is straightforward. When it doesn’t — or when you need to scale throughput across many GPUs — you have three main options: PyTorch’s DistributedDataParallel (DDP), PyTorch’s Fully Sharded Data Parallel (FSDP), and Microsoft’s DeepSpeed with its ZeRO optimizer stages. All three distribute training across multiple GPUs, but they make different trade-offs between memory efficiency, communication overhead, and implementation complexity. Choosing the wrong one costs you either OOM errors, slower training than necessary, or weeks of debugging an overly complex setup.

How GPU Memory Is Actually Used During Training

To understand why these three approaches exist, you need to know what consumes GPU memory during training. For a model with N parameters in fp32, the memory breakdown is: 4N bytes for model weights, 4N bytes for gradients, and 8N bytes for Adam optimizer states (first and second moment). That’s 16N bytes total just for the training state, before activations. A 7B parameter model requires roughly 112GB in fp32 — more than any single GPU. In bf16 mixed precision, weights and gradients use 2 bytes each, but optimizer states remain in fp32, giving about 16N bytes total. For a 7B model that’s still around 112GB. This is the memory problem all three approaches are trying to solve.

DistributedDataParallel (DDP)

DDP is PyTorch’s standard multi-GPU training approach. Each GPU holds a complete copy of the model, gradients, and optimizer states. Different GPUs process different batches of data. After each backward pass, DDP synchronizes gradients across all GPUs using an all-reduce operation, then each GPU independently runs the optimizer step on its local copy.

DDP does not solve the memory problem — it replicates it. Every GPU needs enough memory to hold the full model, gradients, and optimizer states. What DDP solves is throughput: with 8 GPUs you get 8x the effective batch size and roughly 8x the throughput minus communication overhead. For models that fit comfortably on a single GPU, DDP is the right choice. Its gradient synchronization is highly optimized and it is the simplest distributed training approach to implement and debug.

DDP’s gradient bucketing overlaps communication with backward computation. Rather than synchronizing one gradient at a time, DDP groups gradients into buckets (default 25MB) and starts the all-reduce as soon as a bucket is full. In practice, DDP on 8 A100s achieves 7–7.5x single-GPU throughput. The hard limit: DDP requires the full model to fit on each GPU. A 13B parameter model in bf16 with optimizer states needs over 100GB per GPU — more than an A100 80GB can hold.

Fully Sharded Data Parallel (FSDP)

FSDP, introduced in PyTorch 1.12 and significantly improved through PyTorch 2.x, shards model parameters, gradients, and optimizer states across GPUs. Each GPU holds only 1/N of the total training state. A model requiring 112GB trained across 8 GPUs means each GPU holds roughly 14GB. This directly solves the memory problem DDP cannot.

FSDP works by wrapping individual modules. During the forward pass, FSDP all-gathers the parameters for each module from all GPUs, runs the forward computation, then discards the gathered parameters, keeping only the local shard. The same happens during the backward pass. Parameters are gathered and discarded repeatedly — less memory per GPU, more communication.

FSDP has three sharding strategies. FULL_SHARD shards parameters, gradients, and optimizer states — maximum memory savings, most communication. SHARD_GRAD_OP shards only gradients and optimizer states, keeping full parameters on each GPU — less communication at higher memory cost. NO_SHARD is equivalent to DDP. In practice you use FULL_SHARD when you need FSDP for memory.

The wrapping policy determines which modules get their own FSDP unit. The transformer auto-wrap policy wraps each transformer block independently and is almost always correct for transformer models. HuggingFace Accelerate handles this automatically for supported architectures — the recommended way to use FSDP for standard fine-tuning workflows.

CPU offloading is FSDP’s escape valve for extreme memory constraints. FSDP can offload sharded parameters and optimizer states to CPU RAM, dramatically reducing GPU memory at the cost of PCIe bandwidth becoming the bottleneck. CPU offload typically reduces throughput by 30–50%. Always profile before enabling it.

DeepSpeed ZeRO

DeepSpeed, from Microsoft Research, takes a similar sharding approach to FSDP but packages it with a broader set of optimizations. The ZeRO (Zero Redundancy Optimizer) stages progressively shard more of the training state.

ZeRO Stage 1 shards only optimizer states. Each GPU holds full model weights and gradients, but Adam states are split across GPUs. This gives roughly 4x memory reduction for optimizer states at minimal communication overhead. Stage 1 is often overlooked — it’s a low-risk starting point with almost no throughput cost and meaningful memory savings.

ZeRO Stage 2 adds gradient sharding. Each GPU handles the optimizer step for its gradient shard, then broadcasts updated parameters. Memory savings are roughly 8x compared to a naive baseline. Communication overhead is comparable to DDP because the gradient all-reduce in DDP and the reduce-scatter plus all-gather in ZeRO Stage 2 have similar total communication volume. Stage 2 is the practical sweet spot: significant memory savings at DDP-comparable throughput.

ZeRO Stage 3 shards parameters as well, equivalent to FSDP FULL_SHARD. Memory scales linearly with GPU count — 16 GPUs gives 16x the effective memory capacity. Communication overhead is highest here because parameter all-gathers happen during every forward and backward pass. For models where even Stage 2 doesn’t fit, Stage 3 is the path forward. ZeRO-Infinity extends Stage 3 with NVMe offloading for research-scale pre-training at hundreds of billions of parameters.

Beyond ZeRO stages, DeepSpeed includes a fused CUDA Adam implementation faster than PyTorch’s native optimizer, tight gradient accumulation integration, and activation checkpointing. These extras are why some teams use DeepSpeed Stage 2 even when FSDP would give similar memory savings — the bundled optimizer can improve throughput meaningfully.

Head-to-Head on the Axes That Matter

Memory efficiency: FSDP FULL_SHARD and ZeRO Stage 3 are equivalent — both shard everything and scale linearly with GPU count. ZeRO Stage 1 and 2 offer intermediate options DDP doesn’t have. If your model fits on a single GPU, DDP wins on simplicity.

Throughput: For models that fit with DDP, DDP is usually fastest. FSDP FULL_SHARD and ZeRO Stage 3 typically reduce throughput 10–20% versus DDP at the same batch size due to parameter all-gathers. ZeRO Stage 2 is competitive with DDP for throughput while giving substantial memory savings — this is the sweet spot for many fine-tuning workloads.

Setup complexity: DDP is easiest — wrap your model in DistributedDataParallel and launch with torchrun. FSDP is straightforward via HuggingFace Accelerate. DeepSpeed requires a JSON config file and has the steepest learning curve, though the HuggingFace Accelerate integration removes most of the pain for standard workflows.

Debugging: DDP is easiest to debug. FSDP can produce confusing errors when wrapping is misconfigured. DeepSpeed Stage 3 has the hardest debugging surface — parameter gathering issues, non-standard state dict formats, and checkpoint load incompatibilities are common pain points for teams new to it.

Checkpoint compatibility: DDP checkpoints are standard PyTorch state dicts. FSDP requires either consolidating shards before saving or saving distributed checkpoints with matching GPU count. DeepSpeed Stage 3 has its own checkpoint format requiring conversion to use outside of DeepSpeed. Both have improved tooling recently, but it’s a real operational consideration when planning your training pipeline.

A Practical Example: Fine-Tuning Llama 3 8B and 70B

Llama 3 8B in bf16 with Adam optimizer states requires approximately 60GB. On 4x A100 80GB GPUs: with DDP, each GPU needs the full 60GB — fits on an 80GB A100 with 20GB left for activations, and DDP works well. With FSDP FULL_SHARD, each GPU holds 15GB of sharded state, leaving 65GB for activations and larger batches. With ZeRO Stage 2, similar memory profile with potentially faster optimizer steps via DeepSpeed’s fused Adam.

Llama 3 70B changes the picture entirely. The training state alone is roughly 560GB. You need at least 8x A100 80GB GPUs. With FSDP FULL_SHARD across 8 GPUs, each holds 70GB — tight but feasible. Across 16 GPUs, each holds 35GB with comfortable room for activations. With ZeRO Stage 3 across 16 GPUs, the memory profile is equivalent to FSDP; the choice comes down to whether you want DeepSpeed’s optimizer and offloading options or PyTorch’s native tooling.

The Decision Framework

Use DDP if your model fits on a single GPU including optimizer states and activations at your target batch size. It’s simpler, faster to set up, and easier to debug. Don’t introduce sharding complexity you don’t need.

Use FSDP FULL_SHARD if your model doesn’t fit on a single GPU and you want to stay within PyTorch native tooling. This is the right default for 7B–70B fine-tuning in 2026. Configure it via HuggingFace Accelerate rather than the raw FSDP API.

Use ZeRO Stage 2 if memory is manageable but you want DeepSpeed’s fused optimizer for throughput, or if you need the intermediate memory savings of gradient sharding without full parameter sharding overhead. It’s the best choice when throughput matters more than maximum memory reduction.

Use ZeRO Stage 3 for 70B+ parameter training where FSDP is tight or insufficient, or where you need CPU/NVMe offloading via ZeRO-Infinity. Accept the added debugging overhead as the cost of operating at that scale.

Don’t reach for ZeRO Stage 3 when FSDP works. FSDP has better PyTorch integration, cleaner checkpoint tooling, and is easier to debug. ZeRO Stage 3’s advantages are its offloading options and the DeepSpeed optimizer — if neither is a priority, FSDP is the cleaner choice.

Network Topology and Bandwidth Considerations

The communication pattern each strategy uses matters significantly depending on your hardware topology. DDP’s AllReduce is a well-optimized collective that performs well on NVLink-connected GPUs within a single node — NCCL’s ring-AllReduce saturates NVLink bandwidth efficiently, and gradient synchronization typically takes 5–15% of step time for models up to 13B parameters. Across nodes over InfiniBand, DDP remains efficient because AllReduce is latency-tolerant: it transfers data once and combines in a single pass.

FSDP’s communication pattern is more complex. It issues AllGather calls before each forward and backward pass to reconstitute sharded parameters, and ReduceScatter calls after the backward to re-shard the gradients. The total communication volume is roughly double that of DDP’s AllReduce, but because the AllGather can be prefetched — FSDP pre-fetches the next layer’s parameters while the current layer is computing — the communication can be largely hidden behind compute on fast interconnects. On NVLink this works well; on slower PCIe multi-GPU setups or across nodes over 100GbE Ethernet, the communication overhead can become the bottleneck. Always benchmark on your actual hardware rather than extrapolating from NVLink benchmarks.

DeepSpeed ZeRO Stage 3’s communication overhead depends heavily on the stage. Stage 1 and Stage 2 add modest overhead over DDP. Stage 3’s AllGather/ReduceScatter on parameters is the most communication-intensive and benefits most from high-bandwidth interconnects. ZeRO-Infinity’s NVMe offloading introduces storage I/O as a bottleneck — NVMe bandwidth (typically 3–7 GB/s per drive) becomes the ceiling for parameter loading, and training throughput drops accordingly. ZeRO-Infinity is a tool for fitting models that otherwise can’t train at all, not for maximizing throughput on models that fit without it.

Checkpointing and Resuming Training

Saving and loading checkpoints reliably is a practical concern that varies significantly across the three strategies. DDP is the simplest: only one process (typically rank 0) needs to save the model, since all processes have identical full model copies. Loading is equally straightforward — load on rank 0 and let DDP distribute.

FSDP requires more care. Each process holds only a shard of the model, so naively saving from rank 0 only saves rank 0’s shard. The correct approach is either to use FSDP’s FULL_STATE_DICT option, which gathers all shards to rank 0 before saving (simple but memory-intensive — requires enough CPU RAM to hold the full model), or to use SHARDED_STATE_DICT, which saves each rank’s shard separately and requires the same number of ranks to load correctly. PyTorch’s distributed checkpoint API (torch.distributed.checkpoint) handles sharded saving and loading and is the recommended approach for large models where gathering to rank 0 would OOM.

DeepSpeed has its own checkpoint format that includes optimizer states, ZeRO sharding state, and the model weights together. DeepSpeed checkpoints are not directly loadable with standard PyTorch tools — you need to either use DeepSpeed’s own loading utilities or run a conversion script to extract HuggingFace-compatible weights. This is a real friction point when switching between DeepSpeed training and HuggingFace inference tooling, and worth factoring into your choice if checkpoint portability matters for your workflow.

Mixed Precision and Memory Efficiency

All three strategies work with bf16 mixed precision training and interact with it in slightly different ways. DDP with bf16 is the most straightforward: parameters are kept in bf16, gradients accumulated in fp32 (or bf16 with stochastic rounding depending on configuration), and optimizer states in fp32 as standard. The memory savings from bf16 are significant — roughly halving parameter and gradient memory versus fp32 — and the Tensor Core acceleration on Ampere+ GPUs makes it strictly better than fp16 for most use cases due to the wider dynamic range that eliminates the need for loss scaling.

FSDP with bf16 enables mixed precision at the sharding level: sharded parameters are stored in bf16, AllGathered for computation in bf16, and gradients reduced in fp32 via the reduce_dtype parameter. This gives the memory benefits of bf16 storage without sacrificing the precision of gradient accumulation. The MixedPrecision policy in FSDP is worth configuring explicitly rather than accepting defaults — matching param_dtype, reduce_dtype, and buffer_dtype to your model’s requirements avoids subtle precision issues that can cause training divergence on some architectures.

DeepSpeed’s bf16 integration via the bf16 config block or the ZeRO optimizer’s dtype settings is functional but requires more manual configuration than PyTorch-native mixed precision. The DeepSpeed BF16Optimizer handles the precision casting internally, which means standard PyTorch GradScaler usage doesn’t apply — remove any GradScaler calls from your training loop when using DeepSpeed’s bf16 optimizer or you’ll encounter errors.

Leave a Comment