If you’ve worked with deep learning models in PyTorch, you’ve probably encountered the dreaded error message: “RuntimeError: CUDA out of memory”. This is one of the most common problems when training or fine-tuning models on GPUs. It can be both frustrating and time-consuming, especially when you’re unsure why it’s happening or how to fix it.
In this guide, we’ll explore the PyTorch CUDA out of memory error in depth. You’ll learn why it happens, how to diagnose it, and most importantly, how to prevent and resolve it using practical tips and best practices. Whether you’re training CNNs, LSTMs, or transformer models, this guide will help you optimize memory usage and avoid GPU crashes.
What Is the CUDA Out of Memory Error in PyTorch?
When PyTorch tries to allocate more GPU memory than is available, it throws the following error:
RuntimeError: CUDA out of memory. Tried to allocate X GiB (GPU 0; total capacity Y GiB; already allocated Z GiB; free N MiB; reserved M MiB)
This error means your model, data, or intermediate operations are exceeding the memory limits of your GPU. PyTorch uses a caching allocator, so it doesn’t immediately release memory, which can further complicate things.
Common Causes of CUDA Out of Memory Errors
1. Large Model Architectures
- Deep CNNs, RNNs, and transformers with millions of parameters can consume significant memory.
- Even medium-sized models can cause OOM if batch sizes are too large.
2. Large Batch Sizes
- The most common culprit.
- Each batch of data requires GPU memory for inputs, activations, gradients, and weights.
3. No Gradient Clearing
- If you forget to clear gradients using
optimizer.zero_grad(), memory accumulates over iterations.
4. Storing Outputs or Intermediate Tensors
- Keeping outputs in lists or dictionaries without
detach()can cause memory retention.
5. Not Using torch.no_grad() During Inference
- Autograd tracks gradients by default. This is unnecessary during evaluation.
6. Memory Fragmentation
- Even if free memory is available, fragmentation can prevent large block allocation.
How to Diagnose Memory Issues in PyTorch
Before fixing the error, it’s important to understand your memory usage.
Tools to Monitor GPU Usage:
nvidia-smi
- Run this in a terminal to check current GPU memory allocation.
Use PyTorch’s Memory Functions:
print(torch.cuda.memory_summary())
print(torch.cuda.memory_allocated())
print(torch.cuda.memory_reserved())
These help determine how much memory is in use and cached.
Practical Solutions to Fix CUDA Out of Memory Errors
1. Reduce Batch Size
This is the quickest and most effective solution.
train_loader = DataLoader(dataset, batch_size=16) # Reduce to fit GPU memory
If you need large batches for stability, try gradient accumulation:
accum_steps = 4
for i, (inputs, labels) in enumerate(loader):
outputs = model(inputs)
loss = criterion(outputs, labels) / accum_steps
loss.backward()
if (i + 1) % accum_steps == 0:
optimizer.step()
optimizer.zero_grad()
2. Use torch.no_grad() During Inference
Disables gradient calculation and saves memory.
with torch.no_grad():
output = model(input)
3. Use optimizer.zero_grad() or model.zero_grad()
Always clear gradients to avoid memory buildup:
optimizer.zero_grad()
4. Delete Unused Variables
Tensors that are no longer needed should be deleted and memory cleared.
del variable
import gc
gc.collect()
torch.cuda.empty_cache()
Note: empty_cache() does not free memory back to the system but makes it available for PyTorch.
5. Use Mixed Precision Training
Leverage NVIDIA’s Automatic Mixed Precision (AMP) to reduce memory usage.
from torch.cuda.amp import GradScaler, autocast
scaler = GradScaler()
for inputs, labels in loader:
optimizer.zero_grad()
with autocast():
outputs = model(inputs)
loss = criterion(outputs, labels)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
6. Use detach() to Prevent Tensor History
Don’t store outputs without detaching:
outputs = model(inputs)
store.append(outputs.detach())
7. Use Checkpointing (Recompute Activations)
Trade computation for memory by re-computing intermediate activations.
from torch.utils.checkpoint import checkpoint
output = checkpoint(model, input)
8. Clear the CUDA Cache Regularly
import torch
import gc
gc.collect()
torch.cuda.empty_cache()
torch.cuda.ipc_collect()
Use this after validation loops or long evaluations.
9. Simplify the Model or Reduce Parameters
If you’re designing the model from scratch, reduce:
- Number of layers
- Channels per layer
- Hidden dimensions
Use smaller architectures like MobileNet or EfficientNet for constrained environments.
10. Use a Smaller Input Resolution
Lowering image dimensions can drastically reduce memory usage.
transform = transforms.Resize((128, 128))
Best Practices to Prevent Future Memory Errors
✅ Set torch.backends.cudnn.benchmark = True
Allows PyTorch to find optimal GPU kernels for your setup.
✅ Monitor Training Regularly
Log memory usage to detect slow growth or leaks.
✅ Use DataLoaders Efficiently
- Use
pin_memory=True - Increase
num_workersto speed up data throughput
✅ Avoid Memory Leaks in Custom Loops
Be cautious when storing tensors, especially in lists or dictionaries.
Example: Putting It All Together
Here’s an example that incorporates most of the tips:
model = MyModel().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
scaler = GradScaler()
for epoch in range(10):
model.train()
for inputs, targets in train_loader:
inputs, targets = inputs.to(device), targets.to(device)
optimizer.zero_grad()
with autocast():
outputs = model(inputs)
loss = criterion(outputs, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
torch.cuda.empty_cache()
When to Upgrade Hardware or Move to the Cloud
If you consistently run into memory limitations:
- Upgrade to GPUs with more VRAM (e.g., 16GB+)
- Use multi-GPU setups with
DataParallelorDistributedDataParallel - Offload large models to cloud platforms like AWS, GCP, or Azure
Conclusion
The PyTorch CUDA out of memory error is one of the most frequent challenges in deep learning training. Thankfully, it’s also one of the most manageable—with the right techniques.
From reducing batch size and using mixed precision, to clearing gradients and leveraging checkpointing, there are many effective strategies for preventing and resolving memory issues in PyTorch.
Don’t let OOM errors slow you down. Monitor, optimize, and experiment with your training setup to achieve both memory efficiency and faster model convergence.
FAQs
Q: What does torch.cuda.empty_cache() do?
It releases unused memory from PyTorch’s caching allocator, making it available for future allocations.
Q: Why does memory usage grow over time during training?
You may be storing tensors with autograd history or not clearing gradients properly.
Q: Is it okay to use del and gc.collect() regularly?
Yes, especially after large tensor usage or evaluation phases.
Q: Can mixed precision training hurt model accuracy?
In most cases, no. AMP maintains accuracy by selectively using FP16 and FP32 operations.
Q: How much VRAM is enough for deep learning?
8GB is the bare minimum for small models. 16–24GB is ideal for modern transformers and large datasets.