If you’re planning to run large language models locally, the first question you need to answer isn’t about CPU speed or storage—it’s about VRAM. Video memory determines what models you can run, at what quality level, and how responsive they’ll be. Get this wrong and you’ll either overspend on hardware you don’t need or build a system that can’t run the models you want. Let’s cut through the confusion and give you exact numbers for Llama 3, Mistral, and other popular models from 7B to 70B parameters.
Understanding the VRAM Requirement Formula
Before we dive into specific numbers, you need to understand how VRAM requirements are calculated. Unlike regular applications where memory usage is flexible, LLMs have rigid requirements based on a simple formula:
VRAM Required = (Model Parameters × Precision in Bytes) + Overhead
Here’s what each component means:
Model Parameters are the billions of values the model uses to generate text. A 7B model has 7 billion parameters, a 13B has 13 billion, and so on. This is the primary driver of memory usage.
Precision in Bytes determines how much memory each parameter consumes:
- FP16 (16-bit floating point): 2 bytes per parameter
- FP8 (8-bit floating point): 1 byte per parameter
- Q8 (8-bit quantization): 1 byte per parameter
- Q4 (4-bit quantization): 0.5 bytes per parameter
Overhead includes the inference engine, KV cache for context, and system buffers. This typically adds 10-30% depending on your context length and batch size.
Let’s apply this formula to real models.
Exact VRAM Requirements by Model Size and Precision
| Model Size | FP16 | Q8 | Q4 | Minimum GPU |
|---|---|---|---|---|
| 7B | 14-16GB | 8-10GB | 4-6GB | RTX 3060 12GB (Q8), RTX 4060 8GB (Q4) |
| 13B | 26-30GB | 15-18GB | 10-14GB | RTX 4090 24GB (Q8), RTX 3060 12GB (Q4) |
| 34B | 68-75GB | 38-45GB | 20-25GB | RTX 4090 24GB (Q4 tight), A6000 48GB (Q8) |
| 70B | 140-160GB | 75-85GB | 35-45GB | Multi-GPU (FP16/Q8), 2× RTX 4090 (Q4) |
These numbers include typical overhead for 4K context length. If you’re running longer contexts (8K, 16K, or 32K), add 10-20% more VRAM to these estimates.
Deep Dive: 7B Models (Llama 3.1 7B, Mistral 7B)
The 7B parameter models are the sweet spot for local LLM usage in 2026. They’re capable enough for real work while being accessible on consumer hardware.
FP16 (Full Precision): 14-16GB VRAM
Running a 7B model at full 16-bit precision requires approximately 14GB of VRAM for the model weights alone. With inference overhead and a reasonable context window (4K tokens), you’re looking at 15-16GB total.
Calculation breakdown:
- Model weights: 7 billion × 2 bytes = 14GB
- KV cache (4K context): ~1GB
- Inference engine overhead: ~0.5-1GB
- Total: 15.5-16GB
This is why you can’t run Llama 3.1 7B at FP16 on a 12GB GPU—you’re about 4GB short. However, there’s rarely a good reason to run 7B models at full precision. The quality improvement over Q8 is minimal while the VRAM cost is substantial.
Who needs FP16 for 7B models?
- Researchers doing fine-tuning or model analysis
- Benchmarking to establish baseline quality metrics
- Specific use cases where maximum precision matters
For everyone else, quantized versions are the way to go.
Q8 (8-bit Quantization): 8-10GB VRAM
Q8 quantization halves the VRAM requirement while maintaining 95-98% of the model’s quality. For most users, the quality loss is imperceptible in real-world usage.
Calculation breakdown:
- Model weights: 7 billion × 1 byte = 7GB
- KV cache (4K context): ~1GB
- Inference engine overhead: ~0.5-1GB
- Total: 8.5-9GB
This is the ideal configuration for 7B models. An RTX 3060 12GB handles this comfortably with room to spare. Even an RTX 4060 8GB can run Q8 7B models, though you’re closer to the limit.
Performance characteristics:
- Token generation speed: Nearly identical to FP16
- Quality loss: Minimal, often undetectable in blind tests
- VRAM saved: ~6-7GB compared to FP16
- Recommended for: Almost everyone running 7B models
Q4 (4-bit Quantization): 4-6GB VRAM
Q4 quantization quarters the memory footprint, making 7B models accessible on GPUs with as little as 6GB VRAM.
Calculation breakdown:
- Model weights: 7 billion × 0.5 bytes = 3.5GB
- KV cache (4K context): ~1GB
- Inference engine overhead: ~0.5-1GB
- Total: 5-5.5GB
The quality trade-off is more noticeable at Q4 than Q8, but for many use cases it’s still perfectly acceptable.
Quality considerations:
- Slight degradation in reasoning tasks
- More frequent minor coherence issues in very long responses
- Still excellent for coding assistance, writing, and general chat
- Not recommended for tasks requiring maximum accuracy
When to use Q4:
- Limited VRAM (8GB or less)
- Running multiple models simultaneously
- Experimentation and testing
- Non-critical applications where speed matters more than perfect quality
Deep Dive: 13B Models (Llama 3.1 13B, Vicuna 13B)
Thirteen billion parameter models represent a significant quality jump over 7B models. They show better reasoning, follow complex instructions more reliably, and produce more sophisticated outputs. But this capability comes at a steep VRAM cost.
FP16 (Full Precision): 26-30GB VRAM
The base VRAM requirement for 13B models at FP16 immediately puts you into workstation GPU territory.
Calculation breakdown:
- Model weights: 13 billion × 2 bytes = 26GB
- KV cache (4K context): ~1.5GB
- Inference engine overhead: ~1-1.5GB
- Total: 28.5-29GB
No consumer GPU has this much VRAM. You need either:
- NVIDIA RTX 6000 Ada (48GB)
- NVIDIA A6000 (48GB)
- Two consumer GPUs in SLI/NVLink configuration
For most people, FP16 13B models simply aren’t an option for local inference.
Q8 (8-bit Quantization): 15-18GB VRAM
Q8 brings 13B models into the realm of possibility for high-end consumer hardware.
Calculation breakdown:
- Model weights: 13 billion × 1 byte = 13GB
- KV cache (4K context): ~1.5GB
- Inference engine overhead: ~1-1.5GB
- Total: 15.5-16GB
The RTX 4090 with 24GB VRAM handles Q8 13B models comfortably. You’ll have enough headroom for longer context windows (8K-16K) and still maintain responsive performance.
Performance characteristics:
- Token generation: 40-55 tokens/second on RTX 4090
- Quality: Excellent, minimal loss from FP16
- Context length: Can support 8K-16K contexts
- Use cases: Professional coding assistance, complex analysis, creative writing
This is the configuration most serious LLM users should target if they’re investing in hardware specifically for local inference.
Q4 (4-bit Quantization): 10-14GB VRAM
Q4 quantization makes 13B models accessible on GPUs like the RTX 3060 12GB, though you’re pushing the limits.
Calculation breakdown:
- Model weights: 13 billion × 0.5 bytes = 6.5GB
- KV cache (4K context): ~1.5GB
- Inference engine overhead: ~1-1.5GB
- Total: 9-9.5GB
On a 12GB GPU, you can run Q4 13B models but with limited headroom. Context length will be restricted to 4K or less, and you won’t be able to run other applications simultaneously.
Trade-offs at Q4:
- Noticeable quality degradation compared to Q8
- More frequent errors in complex reasoning
- Still usable for many tasks
- Context length severely limited on 12GB GPUs
Deep Dive: 70B Models (Llama 3.1 70B)
Seventy billion parameter models deliver state-of-the-art performance that rivals commercial APIs. However, running them locally is challenging even with high-end hardware.
FP16 (Full Precision): 140-160GB VRAM
Running 70B models at full precision requires enterprise-grade hardware.
Calculation breakdown:
- Model weights: 70 billion × 2 bytes = 140GB
- KV cache (4K context): ~4GB
- Inference engine overhead: ~4-6GB
- Total: 148-150GB
You need either:
- 4× A100 40GB GPUs
- 3× A100 80GB GPUs
- 2× H100 80GB GPUs
This is well beyond consumer budgets and really only makes sense for research institutions or companies.
Q8 (8-bit Quantization): 75-85GB VRAM
Q8 70B models still require multiple high-end GPUs.
Calculation breakdown:
- Model weights: 70 billion × 1 byte = 70GB
- KV cache (4K context): ~4GB
- Inference engine overhead: ~3-5GB
- Total: 77-79GB
Minimum configurations:
- 2× RTX 4090 24GB (48GB total, tight fit with Q4-like compression)
- 2× A6000 48GB (96GB total, comfortable)
- 4× RTX 3090 24GB (96GB total)
Performance will be slower than smaller models due to cross-GPU communication overhead.
Q4 (4-bit Quantization): 35-45GB VRAM
Q4 quantization brings 70B models tantalizingly close to single high-end GPU territory.
Calculation breakdown:
- Model weights: 70 billion × 0.5 bytes = 35GB
- KV cache (4K context): ~4GB
- Inference engine overhead: ~3-4GB
- Total: 42-43GB
An RTX 4090 with 24GB can technically run Q4 70B models using techniques like:
- Offloading some layers to system RAM
- Aggressive KV cache pruning
- Reduced context windows (2K instead of 4K)
However, performance will be poor—expect 5-12 tokens per second, which is borderline usable for interactive work. The quality loss from Q4 quantization is also more pronounced at this scale.
Better approach: Use 2× RTX 4090s (48GB total) for comfortable Q4 70B inference at 15-25 tokens/second.
VRAM Calculator: Find Your Requirements
Quick VRAM Calculator
Use this formula to calculate VRAM requirements for any model:
| Precision | Bytes per Parameter | Typical Overhead |
| FP16 | 2.0 | +15-20% |
| Q8 | 1.0 | +15-20% |
| Q4 | 0.5 | +20-30% |
For longer context windows:
- 8K context: Add 30-40% overhead instead
- 16K context: Add 50-60% overhead instead
- 32K context: Add 80-100% overhead instead
Context length has a significant impact on VRAM usage because the KV cache grows linearly with context size.
How Context Length Affects VRAM Requirements
Context length is often overlooked when calculating VRAM needs, but it can make or break your ability to run a model.
The KV (key-value) cache stores attention information for all previous tokens in the conversation. As context grows, so does this cache.
For a 7B model:
- 2K context: ~0.5GB KV cache
- 4K context: ~1GB KV cache
- 8K context: ~2GB KV cache
- 16K context: ~4GB KV cache
For a 13B model:
- 2K context: ~0.8GB KV cache
- 4K context: ~1.5GB KV cache
- 8K context: ~3GB KV cache
- 16K context: ~6GB KV cache
This means a 13B Q8 model that needs 15.5GB at 4K context actually needs 18GB+ at 8K context and 21GB+ at 16K context.
Practical implications:
If you have a 12GB GPU and want to run 13B Q4 models, you’re limited to 4K context maximum. Want 8K context? You need 16GB+ VRAM.
If you have a 24GB GPU running 13B Q8 models, you can comfortably handle 8K contexts but 16K will be tight.
For 70B models, context length management becomes critical. Even at Q4, a 16K context could push you over 50GB VRAM.
Quantization Quality: What You’re Actually Trading
Understanding what you lose with quantization helps you make informed decisions about VRAM trade-offs.
FP16 to Q8: Minimal Quality Loss
The step from FP16 to Q8 is almost free in terms of quality. Most users cannot detect the difference in blind tests.
What’s preserved:
- Reasoning capability
- Instruction following
- Factual accuracy
- Coherence and fluency
- Complex task performance
What changes:
- Extremely subtle numerical precision in edge cases
- Rare minor artifacts in very technical outputs
Bottom line: Unless you’re doing research or have specific precision requirements, Q8 is indistinguishable from FP16 for practical use.
Q8 to Q4: Noticeable but Manageable
The jump from Q8 to Q4 does introduce observable quality degradation, but the severity depends on model size and use case.
For 7B models at Q4:
- Reasoning: Slightly less reliable on complex multi-step problems
- Factual accuracy: Occasional errors increase modestly
- Coherence: Minor issues in very long responses
- Overall: Still highly usable for most applications
For 13B+ models at Q4:
- The quality loss is more pronounced
- Complex reasoning tasks show degradation
- Instruction following can be less precise
- Creative tasks may produce less sophisticated outputs
When Q4 is acceptable:
- Coding assistance (syntax still reliable)
- General chat and Q&A
- Drafting and brainstorming
- Experimentation and testing
When Q8 or higher is needed:
- Complex reasoning or analysis
- Professional writing requiring nuance
- Technical documentation
- Research or academic use
Going Below Q4: Generally Not Recommended
Q3 and Q2 quantization exist but are rarely worth using. The quality degradation becomes severe enough that you’re better off using a smaller model at higher precision.
A 7B model at Q8 will outperform a 13B model at Q2 in most tasks while using similar VRAM.
Real-World Hardware Recommendations by Use Case
Let’s translate these VRAM requirements into actual GPU recommendations for different user profiles.
Budget Experimentation ($250-400)
Goal: Run 7B models comfortably, occasionally test 13B at Q4
Recommended GPU: RTX 3060 12GB (used $250-300, new $320-380)
What you can run:
- 7B models at Q8: Excellent performance
- 7B models at Q4: Very fast, extra headroom
- 13B models at Q4: Possible with 4K context limit
- Multiple 7B models simultaneously: For comparison testing
Limitations:
- No 13B at Q8
- No 70B models at any precision
- Limited context length on larger models
Serious Hobbyist ($800-1,200)
Goal: Run 13B models at Q8 with good performance
Recommended GPU: RTX 4080 16GB ($900-1,100) or used RTX 3090 24GB ($800-900)
What you can run:
- 7B models at FP16: If you really want maximum quality
- 7B models at Q8: Extremely fast
- 13B models at Q8: Good performance with 8K context support
- 13B models at Q4: Very fast with long context
Limitations:
- 70B models only at Q4 with extreme optimization
- Better to wait for dual-GPU setup for 70B
Professional/Power User ($1,500-2,000)
Goal: Run 13B at Q8 optimally, touch 70B territory
Recommended GPU: RTX 4090 24GB ($1,600-2,000)
What you can run:
- 7B models at FP16: Maximum quality if needed
- 13B models at Q8: Excellent performance, 16K context possible
- 13B models at FP16: With reduced context length
- 34B models at Q4: Usable performance
- 70B models at Q4: Technically possible with optimizations
This is the sweet spot for serious local LLM work without going into enterprise hardware.
Enterprise/Multi-GPU ($3,000+)
Goal: Run 70B models at Q8 or larger models comfortably
Recommended Setup: 2× RTX 4090 24GB ($3,200-4,000 total)
What you can run:
- Everything the single 4090 can do
- 70B models at Q4: Good performance (15-25 tokens/second)
- 70B models at Q8: With careful optimization
- Multiple 13B models simultaneously
- Fine-tuning 13B models
Common VRAM Pitfalls to Avoid
1. Forgetting about context length
Don’t just calculate base model VRAM. If you plan to use 8K or 16K contexts regularly, factor that into your GPU purchase.
2. Assuming newer GPUs are always better
The RTX 4060 8GB is newer than the RTX 3060 12GB but worse for LLMs. VRAM capacity matters more than generation for inference.
3. Underestimating overhead
The formulas give you model size, but operating systems, drivers, and inference engines all consume VRAM too. Always budget 15-30% extra.
4. Planning for exactly one model
You’ll want to experiment with different models, run comparisons, or have multiple models loaded. Buy more VRAM than your calculations suggest you need.
5. Ignoring system RAM
Some frameworks can offload layers to system RAM, but this kills performance. Don’t rely on this as a solution—buy enough VRAM.
Conclusion
VRAM requirements for LLMs are straightforward once you understand the formula: model parameters multiplied by bytes per parameter, plus overhead. For most users, 7B models at Q8 (8-10GB VRAM) or 13B models at Q8 (15-18GB VRAM) represent the sweet spot of capability and accessibility. The RTX 3060 12GB handles 7B models excellently, while the RTX 4090 24GB is the only consumer card that truly handles 13B models well at Q8.
The key is being honest about what you’ll actually run. If you’re primarily using 7B models, don’t overspend on a 4090. If you know you want 13B quality, don’t try to make a 12GB GPU work. Match your hardware to your actual needs, and you’ll build a system that serves you well for years.