How to Reduce GPU Memory During LLM Training

Running out of GPU memory mid-training is one of the most common blockers for ML engineers working with large models. The error message is unhelpful — CUDA out of memory — and the causes are varied. Memory pressure comes from model weights, optimizer states, gradients, activations, and the KV cache all competing for the same limited VRAM. The good news is that most OOM situations are solvable without buying more hardware. This guide covers the techniques that actually work, in roughly the order you should try them.

Understand Where Your Memory Is Going

Before applying fixes, measure. torch.cuda.memory_summary() gives a snapshot of current allocations. Log torch.cuda.max_memory_allocated() at the end of each training step to see peak usage. For a model with N parameters in bf16 mixed precision with Adam: weights use 2N bytes, gradients 2N bytes, Adam states 8N bytes, and the fp32 master weight copy 4N bytes — totalling 16N bytes before a single activation is computed. A 7B model’s training state alone is 112GB. Knowing this breakdown tells you which lever to pull first.

Mixed Precision Training

If you are not already training in bf16 mixed precision, start here. Switching from fp32 to bf16 halves the memory used by weights and activations, and on Ampere and later GPUs bf16 matrix multiplications run 2–8x faster via Tensor Cores. Enable it with torch.autocast and keep a fp32 master copy for the optimizer step, which PyTorch handles automatically. BF16 is preferred over fp16 for LLM training because it has the same exponent range as fp32, avoiding the overflow issues that require loss scaling with fp16.

Gradient Checkpointing

Activations stored during the forward pass for use in the backward pass are typically the largest memory consumer at large batch sizes or long sequences. By default, PyTorch stores all activations from the forward pass until the backward pass completes. For a transformer with 32 layers processing long sequences, this is substantial.

Gradient checkpointing trades compute for memory by not storing activations during the forward pass — instead recomputing them on the fly during the backward pass. The memory saving is large, often 60–70% of activation memory, at the cost of roughly 30% additional compute. Enable it with model.gradient_checkpointing_enable() in HuggingFace Transformers or torch.utils.checkpoint.checkpoint for custom models. For most LLM fine-tuning workloads this is a near-mandatory optimization once your model exceeds a few billion parameters.

Gradient Accumulation

Gradient accumulation lets you simulate a large effective batch size without holding multiple batches in memory simultaneously. Instead of computing gradients on a batch of 32 and stepping the optimizer, you compute on 8 micro-batches of 4, accumulate the gradients, then step. Memory usage is determined by the micro-batch size, not the effective batch size. The gradient update is mathematically identical to using a large batch.

One pitfall with DDP: gradient synchronization happens after every backward pass by default. Use the no_sync context manager on accumulation steps to defer synchronization to the final step, avoiding unnecessary all-reduce communication overhead on intermediate steps.

Optimizer Memory Reduction

Adam’s two moment estimates consume 8 bytes per parameter in fp32. For a 7B model that’s 56GB just for optimizer states. Several alternatives cut this significantly.

8-bit Adam from bitsandbytes quantizes optimizer states to 8-bit, reducing their memory footprint by 4x with negligible quality loss. Enable it by swapping torch.optim.AdamW for bitsandbytes.optim.AdamW8bit — no other code changes required. This is one of the highest-leverage, lowest-risk memory optimizations available and should be on by default for any large model training run.

Adafactor replaces both moment estimates with a factored approximation scaling as O(n) rather than O(2n), using roughly 4x less memory than Adam. The trade-off is that Adafactor can be less stable, particularly at training start and with small learning rates. It works well for fine-tuning on standard tasks but warrants careful learning rate tuning. Lion optimizer is another Adam alternative with smaller memory footprint and competitive quality on many tasks.

LoRA and QLoRA

If you need to fine-tune a large model on limited hardware, LoRA is the most impactful single intervention. By freezing the base model and training only small low-rank adapter matrices, LoRA reduces trainable parameters to under 1% of total. Optimizer states are only computed for trainable parameters — for a 7B model with r=8 LoRA adapters, Adam states consume megabytes rather than tens of gigabytes.

QLoRA goes further by quantizing the frozen base model to 4-bit NF4, reducing its footprint by roughly 4x. A 7B model that would require 112GB for full fine-tuning can be QLoRA fine-tuned in 10–12GB — within reach of a consumer RTX 4090. The quality trade-off versus full fine-tuning is minimal for most instruction tuning and domain adaptation tasks.

Flash Attention

Standard attention materializes the full N×N attention score matrix in GPU memory, which grows quadratically with sequence length. For long sequences, this dominates memory usage. Flash Attention avoids this by tiling the computation into blocks that fit in fast on-chip SRAM, streaming results to HBM without ever materializing the full matrix. For a 4,096-token sequence, standard attention stores hundreds of megabytes of intermediate scores per layer; Flash Attention reduces this to near zero.

Enable Flash Attention in HuggingFace Transformers with attn_implementation=”flash_attention_2″. PyTorch 2.0 and later also provides F.scaled_dot_product_attention which routes to Flash Attention automatically on CUDA. If you are training on sequences longer than 2,048 tokens without Flash Attention enabled, it should be the first thing you add — both the memory and speed gains are significant.

Sequence Length and Batch Size Tuning

Activation memory scales quadratically with sequence length. Halving your sequence length reduces activation memory by roughly 4x. If your task doesn’t require the full context length your model supports, truncating sequences is the highest-leverage, lowest-cost memory reduction available. Use dynamic padding — pad to the longest sequence in each batch rather than a fixed maximum — to eliminate wasted compute and memory on short sequences padded unnecessarily.

For batch size, reduce it and compensate with gradient accumulation to maintain the same effective batch size. Batch size 1 with 32 accumulation steps produces an equivalent gradient update to batch size 32 at a fraction of the memory cost.

CPU Offloading

As a last resort, CPU offloading moves optimizer states and parameters to CPU RAM, retrieving them to GPU only when needed. DeepSpeed ZeRO Stage 3 with CPU offload and FSDP’s cpu_offload option both support this. The cost is real — PCIe bandwidth is 10–50x slower than GPU HBM, reducing training throughput by 30–60%. But for teams with large CPU RAM (512GB+) and models that otherwise can’t fit, it makes otherwise-impossible training runs feasible.

Putting It Together: A Practical Checklist

Apply these in order and check memory after each: enable bf16 mixed precision; add gradient checkpointing; switch to 8-bit Adam; reduce micro-batch size and add gradient accumulation; enable Flash Attention if using sequences over 2,048 tokens; switch to LoRA if full fine-tuning isn’t required; try QLoRA if memory is still tight; use CPU offloading only as a last resort.

Most OOM situations resolve within the first three or four steps. The combination of gradient checkpointing, 8-bit Adam, and LoRA together reduces the memory required to fine-tune a 7B model from 112GB to under 20GB — achievable on a single A100 40GB with room for activations at reasonable sequence lengths. Each technique is independently useful; together they make large model fine-tuning accessible on hardware that most teams actually have access to.

Multi-GPU Sharding

If you have multiple GPUs and the model still doesn’t fit after applying the above techniques, sharding the training state across GPUs is the next step. FSDP FULL_SHARD and DeepSpeed ZeRO Stage 3 both shard model weights, gradients, and optimizer states across all GPUs, with each GPU holding 1/N of the total. This scales linearly — 8 GPUs gives 8x the effective memory. The communication overhead is higher than DDP, but for models that genuinely can’t fit otherwise, it’s the right tool.

ZeRO Stage 2 is worth considering as an intermediate option: it shards only gradients and optimizer states (not weights), giving roughly 8x memory reduction on optimizer states at communication overhead comparable to DDP. For 7B–13B models on 4–8 GPUs where optimizer states are the limiting factor, ZeRO Stage 2 is often the sweet spot — meaningful memory savings without the full parameter-gather overhead of Stage 3 or FSDP FULL_SHARD.

Monitoring Memory During Training

Debugging memory issues is much easier with proper instrumentation. Add these three logging calls to your training loop: torch.cuda.max_memory_allocated() after the first few steps to see peak activation memory; torch.cuda.memory_reserved() to see how much the CUDA allocator has reserved from the OS; and torch.cuda.reset_peak_memory_stats() at the start of each step to get accurate per-step peaks rather than cumulative maxima. Logging these to your experiment tracker (wandb, MLflow, or similar) lets you see exactly when memory spikes occur — before the forward pass, after backward, or during the optimizer step — which directly tells you which technique to apply next.

One often-overlooked source of memory leaks during training is Python references keeping tensors alive longer than expected. If memory grows step-over-step without plateauing, check that you are detaching loss values before accumulating them for logging (loss.item(), not loss), and that intermediate tensors aren’t being appended to a list that persists across steps. These are easy bugs to write and hard to spot without per-step memory logging.

Activation Memory in Detail

Activation memory is often the largest and most controllable component of training memory, yet it’s the least well-understood. When PyTorch runs a forward pass, it saves intermediate tensors at each operation so they’re available for gradient computation in the backward pass. For a transformer layer, this includes the input to each sublayer, the attention scores, the softmax output, the attention output before projection, and the intermediate state of the FFN. Each of these tensors scales with batch size and sequence length.

The activation memory for a single transformer layer during training in bf16 is approximately 2 * batch_size * seq_len * hidden_dim * num_sublayers bytes, where num_sublayers accounts for the multiple points at which activations are saved. For a Llama 3 8B layer (hidden_dim=4096) with batch size 4 and sequence length 2048: roughly 2 * 4 * 2048 * 4096 * 10 ≈ 640MB per layer. Multiply by 32 layers and activation memory alone is over 20GB before parameters, gradients, or optimizer states.

This is why gradient checkpointing has such large impact on memory: it breaks the O(L) scaling of activation memory and replaces it with O(sqrt(L)) or O(1) depending on granularity. The recomputation cost is paid in compute time, but the memory savings unlock larger batch sizes or longer sequences that would otherwise be impossible. For long-context training (8K+ tokens), gradient checkpointing is effectively mandatory regardless of model size — activation memory scales quadratically with sequence length for the attention sublayer specifically, since attention scores are seq_len × seq_len tensors.

CPU Offloading

When GPU memory is truly exhausted and you can’t reduce further without compromising training quality, CPU offloading moves tensors that aren’t immediately needed to CPU RAM. This works because CPU RAM is typically much larger than GPU VRAM — a workstation with 128GB RAM can supplement a 24GB GPU substantially. The cost is PCIe bandwidth: transfers between CPU and GPU over PCIe 4.0 x16 top out at around 32 GB/s bidirectional, which is fast enough for optimizer state offloading (which only moves data once per optimizer step) but too slow for activation offloading (which would need to happen every layer forward and backward).

DeepSpeed ZeRO-Offload implements optimizer state and gradient offloading to CPU automatically. For single-GPU fine-tuning of large models, this can make the difference between fitting a 13B model and running OOM. The throughput hit depends on how often the offloaded tensors are accessed — for optimizer states accessed once per step, the overhead is manageable; for frequently accessed tensors, PCIe becomes the bottleneck and training slows significantly.

For models that don’t fit in GPU memory even with all other optimizations applied, ZeRO-Infinity extends offloading to NVMe storage. NVMe bandwidth (3–7 GB/s) is slower than PCIe RAM transfers, but NVMe capacity is essentially unlimited. This is primarily useful for running inference or evaluation on very large models on consumer hardware, and for training at extreme scales (100B+ parameters) where neither GPU nor CPU memory is sufficient. For typical fine-tuning workloads, ZeRO-Offload to CPU RAM is the relevant tool; NVMe offloading is a last resort with significant throughput implications.

Leave a Comment