Why VRAM Is the Binding Constraint
When running an LLM locally or deploying one in production, GPU VRAM is almost always the binding constraint — not compute power, not CPU speed, not disk I/O. The model weights must fit in VRAM to run on the GPU at all. If they do not fit, the inference engine falls back to CPU offloading or system RAM, which is typically 10–50x slower than pure GPU inference. Understanding exactly how much VRAM a given model requires — and what factors affect that number — is fundamental to selecting hardware, estimating costs, and sizing deployments correctly.
This guide gives you the formulas, lookup tables, and rules of thumb to calculate VRAM requirements for any model, at any precision, with any context length, for both inference and training workloads.
The Core Formula: Model Weight Memory
The dominant component of VRAM usage is storing the model’s weights. The calculation is straightforward:
Model weight memory = (number of parameters) × (bytes per parameter)
The bytes per parameter depends on the numerical precision you use:
- FP32 (float32): 4 bytes per parameter — rarely used for inference, mostly for training
- BF16 / FP16 (bfloat16 / float16): 2 bytes per parameter — standard for modern inference
- INT8: 1 byte per parameter — 8-bit quantisation, modest quality trade-off
- INT4 / Q4: 0.5 bytes per parameter — 4-bit quantisation, larger quality trade-off
- Q4_K_M, Q5_K_M, Q8_0: GGUF quantisation formats with sizes between these values
For a 7 billion parameter model at FP16: 7 × 10⁹ × 2 bytes = 14 GB. For the same model at Q4: 7 × 10⁹ × 0.5 bytes = 3.5 GB. This is why quantisation is so widely used — it makes models that would not fit on consumer GPUs suddenly viable.
Quick Reference: Weight Memory by Model Size
Here are the model weight memory requirements for common sizes at each precision. These are the weights only — you need to add KV cache and overhead on top:
Model Size | FP16 | INT8 | Q4 (approx)
------------|---------|---------|------------
1B params | 2 GB | 1 GB | 0.5 GB
3B params | 6 GB | 3 GB | 1.5 GB
7B params | 14 GB | 7 GB | 3.5 GB
8B params | 16 GB | 8 GB | 4.0 GB
13B params | 26 GB | 13 GB | 6.5 GB
20B params | 40 GB | 20 GB | 10.0 GB
34B params | 68 GB | 34 GB | 17.0 GB
70B params | 140 GB | 70 GB | 35.0 GB
72B params | 144 GB | 72 GB | 36.0 GB
405B params | 810 GB | 405 GB | 202 GB
These numbers tell you the minimum VRAM needed just to load the weights. Real deployments require more — add 20–30% overhead for the KV cache, activations, and framework buffers at minimum.
The KV Cache: The Other Big Consumer
Beyond model weights, the KV (key-value) cache for the attention mechanism is the second major VRAM consumer during inference. The KV cache grows with context length and batch size, and for long-context workloads it can actually exceed the weight memory.
The formula for KV cache memory:
KV cache = 2 × num_layers × num_heads × head_dim × seq_len × batch_size × bytes_per_element
For practical estimation, a simpler approximation works well:
KV cache ≈ (seq_len × batch_size × num_layers × hidden_dim × 2 × 2) / (1024³) GB
For Llama 3.1 8B (32 layers, 4096 hidden dim) at batch size 1 and 4096 token context in FP16:
KV cache ≈ (4096 × 1 × 32 × 4096 × 2 × 2) / (1024³)
≈ 2,147,483,648 bytes
≈ 2.0 GB
At 32K context that becomes ~16 GB — larger than the model weights at Q4. This is why context length has such a dramatic impact on VRAM requirements for long-context workloads.
Total VRAM Estimate: A Practical Formula
For inference, a reliable rule of thumb that accounts for weights, KV cache at typical context lengths, and framework overhead:
Total VRAM ≈ (model weights) × 1.2 + KV cache
The 1.2 multiplier adds 20% for activations, temporary buffers, and CUDA context overhead. Here is a Python calculator you can adapt:
def estimate_vram_gb(
params_billions: float,
precision: str = "fp16",
context_length: int = 4096,
batch_size: int = 1,
num_layers: int = None,
hidden_dim: int = None
) -> dict:
bytes_per_param = {"fp32": 4, "fp16": 2, "bf16": 2, "int8": 1, "int4": 0.5, "q4": 0.5, "q8": 1}
bpp = bytes_per_param.get(precision.lower(), 2)
weight_gb = (params_billions * 1e9 * bpp) / (1024**3)
# Estimate KV cache if layer/dim info provided
kv_gb = 0.0
if num_layers and hidden_dim:
kv_bytes = 2 * num_layers * hidden_dim * context_length * batch_size * 2 # 2 bytes for fp16
kv_gb = kv_bytes / (1024**3)
else:
# Rough approximation: ~0.5 GB per billion params per 4K context at fp16
kv_gb = (params_billions * 0.5 * (context_length / 4096))
total_gb = weight_gb * 1.2 + kv_gb
return {
"weight_gb": round(weight_gb, 2),
"kv_cache_gb": round(kv_gb, 2),
"overhead_gb": round(weight_gb * 0.2, 2),
"total_gb": round(total_gb, 2),
"recommended_vram": f"{int(total_gb * 1.15 // 8 + 1) * 8} GB" # Round up to nearest 8GB
}
# Examples
print(estimate_vram_gb(7, "fp16", 4096))
# {'weight_gb': 13.08, 'kv_cache_gb': 3.5, 'overhead_gb': 2.62, 'total_gb': 19.28, 'recommended_vram': '24 GB'}
print(estimate_vram_gb(70, "int4", 8192))
# {'weight_gb': 32.69, 'kv_cache_gb': 35.0, 'overhead_gb': 6.54, 'total_gb': 74.23, 'recommended_vram': '80 GB'}
Common GPU VRAM Capacities in 2026
Here is where commonly available GPUs sit on the VRAM ladder, and what they can handle:
GPU | VRAM | Max Model (FP16) | Max Model (Q4)
-----------------------|-------|------------------|----------------
RTX 3060 | 12 GB | 3B | 7B
RTX 3090 / 4090 | 24 GB | 7B | 13B–20B
RTX 4000 Ada (laptop) | 20 GB | 7B | 13B
A10G (AWS g5) | 24 GB | 7B | 13B–20B
RTX 6000 Ada | 48 GB | 20B | 34B
A30 | 24 GB | 7B | 13B–20B
A40 | 48 GB | 20B | 34B
A100 40GB | 40 GB | 13B | 30B
A100 80GB | 80 GB | 34B | 70B
H100 80GB | 80 GB | 34B | 70B
H100 NVL | 188 GB | 70B | 140B+
2× A100 80GB | 160 GB | 70B | 140B
4× A100 80GB | 320 GB | 140B | 405B (partial)
8× H100 80GB | 640 GB | 405B (FP16) | 405B
“Max Model” here means the largest model whose weights fit with minimal context budget. For longer contexts or larger batch sizes, the effective maximum is smaller.
Multi-GPU Tensor Parallelism
When a model does not fit on a single GPU, tensor parallelism splits the model across multiple GPUs. With tensor parallelism of degree N, each GPU holds 1/N of the weights, so the effective per-GPU VRAM requirement for weights drops by N. However, the GPUs must communicate at each layer, so they need high-bandwidth interconnects — NVLink is strongly preferred over PCIe for tensor parallelism with N ≥ 4.
For planning multi-GPU deployments:
Required GPUs = ceil(total_vram_needed / vram_per_gpu)
# Example: 70B FP16 model, 4096 context, batch 1
total_vram = 140 (weights) * 1.2 + 35 (KV) = 203 GB
# With 4× A100 80GB (320 GB total): fits with headroom
# With 2× A100 80GB (160 GB total): tight, reduce context or quantise
Training vs. Inference VRAM: A Very Different Calculation
Training a model requires dramatically more VRAM than inference because you need to store not just the weights but also gradients, optimiser states, and activations for backpropagation. For full fine-tuning with AdamW:
Training VRAM ≈ 16 × (parameters in billions) GB at FP16
This “16× rule” breaks down as: 2 bytes for FP16 weights + 4 bytes for FP32 master weights + 4 bytes for gradient + 8 bytes for AdamW momentum and variance = 18 bytes per parameter, plus activations. In practice 16× is a reasonable planning estimate.
For a 7B model: 7 × 16 = 112 GB — far beyond any single consumer GPU. This is why full fine-tuning of large models requires multi-GPU clusters, and why parameter-efficient methods like LoRA are so important for practical fine-tuning.
With LoRA (only training a small fraction of parameters), the VRAM requirement drops dramatically. A rough estimate for LoRA fine-tuning:
# LoRA VRAM ≈ weight_memory + (rank * 2 * target_layers * hidden_dim * 4 bytes)
# For 7B model, rank=16, targeting attention layers:
# LoRA params ≈ 0.1% of total → gradient/optimiser overhead is negligible
# Practical LoRA VRAM for 7B ≈ 14 GB (weights) + 4 GB (activations) = ~18 GB
# Fits on a single 24 GB GPU at FP16, or 8 GB GPU at Q4 base weights
Quantisation: The VRAM Multiplier
Quantisation is the most effective lever for reducing VRAM requirements. The quality trade-off varies by quantisation level and model size — larger models tolerate more aggressive quantisation with less quality loss than smaller ones.
Practical guidance on quality versus VRAM savings:
Q8 / INT8: Near-identical quality to FP16 in most benchmarks. 50% VRAM savings. Always worth using if you are VRAM-constrained and do not need FP16 precision. The recommended starting point for quantisation.
Q5_K_M / Q6_K: Very small quality gap from FP16, typically imperceptible in practice. 60–65% VRAM savings. The sweet spot for most use cases — much smaller than FP16, very close in quality.
Q4_K_M: The most popular quantisation level. Roughly 70% VRAM savings with a small but measurable quality reduction on some tasks. Most users cannot reliably distinguish Q4_K_M responses from FP16 for general use, though the gap is visible on complex reasoning benchmarks.
Q3 / Q2: Aggressive quantisation with significant quality loss. The model becomes noticeably less capable. Useful when VRAM is extremely limited and some degradation is acceptable, but not recommended for production use cases where quality matters.
For most deployment scenarios, Q4_K_M or Q5_K_M at the GGUF format (for Ollama/llama.cpp) or AWQ/GPTQ for GPU-native serving (vLLM, TGI) are the right defaults. They allow you to run models 1–2 size tiers larger than you could at FP16 on the same hardware, with quality that is acceptable for the vast majority of production tasks.
Apple Silicon: Unified Memory Changes the Equation
Apple Silicon Macs (M1 through M4) use unified memory architecture — there is no separate GPU VRAM. The entire memory pool is shared between the CPU and GPU. This means an M3 Max with 128 GB of memory can effectively use all 128 GB for LLM inference, at GPU speeds for the portions that fit in the GPU’s dedicated bandwidth, and at CPU speeds for the rest.
The practical implication is that Apple Silicon can run larger models than their equivalent VRAM tier would suggest for discrete GPUs. An M3 Max (128 GB) can run 70B models at Q4 reasonably well, even though no discrete GPU with 128 GB VRAM exists at a comparable price point. The tradeoff is throughput — tokens per second on Apple Silicon is typically lower than equivalent VRAM in a dedicated NVIDIA GPU, but it is still very usable for many applications.
For Apple Silicon, the VRAM formula still applies but substitute total system memory for VRAM. Leave 16–20 GB for the operating system and other applications, and the remainder is available for LLM inference.
Practical Sizing Examples
Here are complete sizing calculations for common real-world scenarios:
Scenario 1: Developer workstation for local experimentation. Budget: RTX 4090 (24 GB VRAM). Target: best quality model that fits. At Q4_K_M, a 13B model uses roughly 7.5 GB weights + 3 GB KV cache at 4K context = ~12 GB, well within budget. At Q5_K_M, a 13B model fits too. A 20B model at Q4_K_M uses ~12 GB weights + KV cache — tight but workable for short contexts. Recommendation: 13B at Q5_K_M for best quality/VRAM ratio.
Scenario 2: Production API server, single A100 80GB. Target: 70B model, 4K context, batch size 4. 70B at FP16 = 140 GB weights — does not fit. At Q4: 35 GB weights × 1.2 = 42 GB + KV cache (70B has ~80 layers, 8192 hidden: 2 × 80 × 8192 × 4096 × 4 × 2 bytes ≈ 17 GB for batch 4) = ~59 GB. Fits with headroom. Recommendation: 70B at AWQ Q4 on a single A100 80GB.
Scenario 3: Fine-tuning a 7B model with LoRA. Base weights at FP16 = 14 GB. LoRA adds ~0.5 GB for the adapter. Activations and optimiser states for LoRA ≈ 4 GB. Total ≈ 18.5 GB. An RTX 3090 or 4090 (24 GB) handles this comfortably with room for batch size 2–4. Recommendation: 7B base + LoRA rank 16 on single RTX 4090.
Tools for VRAM Estimation
Rather than doing all calculations by hand, several tools automate the process. The Hugging Face Model Memory Calculator at huggingface.co/spaces/hf-accelerate/model-memory-usage estimates training memory for any model on the Hub. For inference, the llama.cpp project maintains a GGUF model size calculator, and vLLM’s documentation includes guidance on KV cache sizing for their framework specifically. For quick estimates, the rule of thumb that a model needs approximately (parameter count in billions × precision in bytes × 1.2) gigabytes for inference gives you a number accurate to within 15–20% for most standard transformer architectures — sufficient for hardware selection decisions even without exact architectural details.
When VRAM Is Not Enough: Fallback Options
When your model does not fit in VRAM, you have several fallback options before giving up or buying more GPUs. CPU offloading splits the model between GPU VRAM and system RAM, keeping the most recently used layers on GPU and offloading the rest to CPU. Tools like llama.cpp and Ollama support this with the --n-gpu-layers parameter — set it to the number of layers that fit in VRAM and the rest run on CPU. Throughput drops dramatically (often 5–20x) but the model runs. Flash attention reduces KV cache memory by recomputing attention incrementally rather than storing the full cache, at the cost of more compute. Most modern inference frameworks support it via a flag. It reduces KV cache memory by roughly 30–40% with no quality impact. Gradient checkpointing (for training) trades compute for memory by not storing all activations, recomputing them during the backward pass. It roughly halves activation memory at the cost of about 33% more compute time. These fallbacks are not as good as having enough VRAM, but they can make the difference between a model running and not running on the hardware you have today.
The fundamental principle to carry through all these calculations is that VRAM is the resource to optimise around, not compute. A GPU with twice the VRAM but half the TFLOPS will often give you better practical performance for LLM inference than one with twice the compute but half the memory — because the compute is wasted if the model cannot fit in VRAM to run at all. Buy for memory first, compute second, when configuring hardware for LLM workloads.