How Much VRAM Do You Really Need for LLMs? (7B–70B Explained)

If you’re planning to run large language models locally, the first question you need to answer isn’t about CPU speed or storage—it’s about VRAM. Video memory determines what models you can run, at what quality level, and how responsive they’ll be. Get this wrong and you’ll either overspend on hardware you don’t need or build a system that can’t run the models you want. Let’s cut through the confusion and give you exact numbers for Llama 3, Mistral, and other popular models from 7B to 70B parameters.

Understanding the VRAM Requirement Formula

Before we dive into specific numbers, you need to understand how VRAM requirements are calculated. Unlike regular applications where memory usage is flexible, LLMs have rigid requirements based on a simple formula:

VRAM Required = (Model Parameters × Precision in Bytes) + Overhead

Here’s what each component means:

Model Parameters are the billions of values the model uses to generate text. A 7B model has 7 billion parameters, a 13B has 13 billion, and so on. This is the primary driver of memory usage.

Precision in Bytes determines how much memory each parameter consumes:

FP16 (16-bit floating point): 2 bytes per parameter
FP8 (8-bit floating point): 1 byte per parameter
Q8 (8-bit quantization): 1 byte per parameter
Q4 (4-bit quantization): 0.5 bytes per parameter

Overhead includes the inference engine, KV cache for context, and system buffers. This typically adds 10-30% depending on your context length and batch size.

Let’s apply this formula to real models.

Exact VRAM Requirements by Model Size and Precision

Model Size	FP16	Q8	Q4	Minimum GPU
7B	14-16GB	8-10GB	4-6GB	RTX 3060 12GB (Q8), RTX 4060 8GB (Q4)
13B	26-30GB	15-18GB	10-14GB	RTX 4090 24GB (Q8), RTX 3060 12GB (Q4)
34B	68-75GB	38-45GB	20-25GB	RTX 4090 24GB (Q4 tight), A6000 48GB (Q8)
70B	140-160GB	75-85GB	35-45GB	Multi-GPU (FP16/Q8), 2× RTX 4090 (Q4)

These numbers include typical overhead for 4K context length. If you’re running longer contexts (8K, 16K, or 32K), add 10-20% more VRAM to these estimates.

Deep Dive: 7B Models (Llama 3.1 7B, Mistral 7B)

The 7B parameter models are the sweet spot for local LLM usage in 2026. They’re capable enough for real work while being accessible on consumer hardware.

FP16 (Full Precision): 14-16GB VRAM

Running a 7B model at full 16-bit precision requires approximately 14GB of VRAM for the model weights alone. With inference overhead and a reasonable context window (4K tokens), you’re looking at 15-16GB total.

Calculation breakdown:

Model weights: 7 billion × 2 bytes = 14GB
KV cache (4K context): ~1GB
Inference engine overhead: ~0.5-1GB
Total: 15.5-16GB

This is why you can’t run Llama 3.1 7B at FP16 on a 12GB GPU—you’re about 4GB short. However, there’s rarely a good reason to run 7B models at full precision. The quality improvement over Q8 is minimal while the VRAM cost is substantial.

Who needs FP16 for 7B models?

Researchers doing fine-tuning or model analysis
Benchmarking to establish baseline quality metrics
Specific use cases where maximum precision matters

For everyone else, quantized versions are the way to go.

Q8 (8-bit Quantization): 8-10GB VRAM

Q8 quantization halves the VRAM requirement while maintaining 95-98% of the model’s quality. For most users, the quality loss is imperceptible in real-world usage.

Calculation breakdown:

Model weights: 7 billion × 1 byte = 7GB
KV cache (4K context): ~1GB
Inference engine overhead: ~0.5-1GB
Total: 8.5-9GB

This is the ideal configuration for 7B models. An RTX 3060 12GB handles this comfortably with room to spare. Even an RTX 4060 8GB can run Q8 7B models, though you’re closer to the limit.

Performance characteristics:

Token generation speed: Nearly identical to FP16
Quality loss: Minimal, often undetectable in blind tests
VRAM saved: ~6-7GB compared to FP16
Recommended for: Almost everyone running 7B models

Q4 (4-bit Quantization): 4-6GB VRAM

Q4 quantization quarters the memory footprint, making 7B models accessible on GPUs with as little as 6GB VRAM.

Calculation breakdown:

Model weights: 7 billion × 0.5 bytes = 3.5GB
KV cache (4K context): ~1GB
Inference engine overhead: ~0.5-1GB
Total: 5-5.5GB

The quality trade-off is more noticeable at Q4 than Q8, but for many use cases it’s still perfectly acceptable.

Quality considerations:

Slight degradation in reasoning tasks
More frequent minor coherence issues in very long responses
Still excellent for coding assistance, writing, and general chat
Not recommended for tasks requiring maximum accuracy

When to use Q4:

Limited VRAM (8GB or less)
Running multiple models simultaneously
Experimentation and testing
Non-critical applications where speed matters more than perfect quality

Deep Dive: 13B Models (Llama 3.1 13B, Vicuna 13B)

Thirteen billion parameter models represent a significant quality jump over 7B models. They show better reasoning, follow complex instructions more reliably, and produce more sophisticated outputs. But this capability comes at a steep VRAM cost.

FP16 (Full Precision): 26-30GB VRAM

The base VRAM requirement for 13B models at FP16 immediately puts you into workstation GPU territory.

Calculation breakdown:

Model weights: 13 billion × 2 bytes = 26GB
KV cache (4K context): ~1.5GB
Inference engine overhead: ~1-1.5GB
Total: 28.5-29GB

No consumer GPU has this much VRAM. You need either:

NVIDIA RTX 6000 Ada (48GB)
NVIDIA A6000 (48GB)
Two consumer GPUs in SLI/NVLink configuration

For most people, FP16 13B models simply aren’t an option for local inference.

Q8 (8-bit Quantization): 15-18GB VRAM

Q8 brings 13B models into the realm of possibility for high-end consumer hardware.

Calculation breakdown:

Model weights: 13 billion × 1 byte = 13GB
KV cache (4K context): ~1.5GB
Inference engine overhead: ~1-1.5GB
Total: 15.5-16GB

The RTX 4090 with 24GB VRAM handles Q8 13B models comfortably. You’ll have enough headroom for longer context windows (8K-16K) and still maintain responsive performance.

Performance characteristics:

Token generation: 40-55 tokens/second on RTX 4090
Quality: Excellent, minimal loss from FP16
Context length: Can support 8K-16K contexts
Use cases: Professional coding assistance, complex analysis, creative writing

This is the configuration most serious LLM users should target if they’re investing in hardware specifically for local inference.

Q4 (4-bit Quantization): 10-14GB VRAM

Q4 quantization makes 13B models accessible on GPUs like the RTX 3060 12GB, though you’re pushing the limits.

Calculation breakdown:

Model weights: 13 billion × 0.5 bytes = 6.5GB
KV cache (4K context): ~1.5GB
Inference engine overhead: ~1-1.5GB
Total: 9-9.5GB

On a 12GB GPU, you can run Q4 13B models but with limited headroom. Context length will be restricted to 4K or less, and you won’t be able to run other applications simultaneously.

Trade-offs at Q4:

Noticeable quality degradation compared to Q8
More frequent errors in complex reasoning
Still usable for many tasks
Context length severely limited on 12GB GPUs

Deep Dive: 70B Models (Llama 3.1 70B)

Seventy billion parameter models deliver state-of-the-art performance that rivals commercial APIs. However, running them locally is challenging even with high-end hardware.

FP16 (Full Precision): 140-160GB VRAM

Running 70B models at full precision requires enterprise-grade hardware.

Calculation breakdown:

Model weights: 70 billion × 2 bytes = 140GB
KV cache (4K context): ~4GB
Inference engine overhead: ~4-6GB
Total: 148-150GB

You need either:

4× A100 40GB GPUs
3× A100 80GB GPUs
2× H100 80GB GPUs

This is well beyond consumer budgets and really only makes sense for research institutions or companies.

Q8 (8-bit Quantization): 75-85GB VRAM

Q8 70B models still require multiple high-end GPUs.

Calculation breakdown:

Model weights: 70 billion × 1 byte = 70GB
KV cache (4K context): ~4GB
Inference engine overhead: ~3-5GB
Total: 77-79GB

Minimum configurations:

2× RTX 4090 24GB (48GB total, tight fit with Q4-like compression)
2× A6000 48GB (96GB total, comfortable)
4× RTX 3090 24GB (96GB total)

Performance will be slower than smaller models due to cross-GPU communication overhead.

Q4 (4-bit Quantization): 35-45GB VRAM

Q4 quantization brings 70B models tantalizingly close to single high-end GPU territory.

Calculation breakdown:

Model weights: 70 billion × 0.5 bytes = 35GB
KV cache (4K context): ~4GB
Inference engine overhead: ~3-4GB
Total: 42-43GB

An RTX 4090 with 24GB can technically run Q4 70B models using techniques like:

Offloading some layers to system RAM
Aggressive KV cache pruning
Reduced context windows (2K instead of 4K)

However, performance will be poor—expect 5-12 tokens per second, which is borderline usable for interactive work. The quality loss from Q4 quantization is also more pronounced at this scale.

Better approach: Use 2× RTX 4090s (48GB total) for comfortable Q4 70B inference at 15-25 tokens/second.

VRAM Calculator: Find Your Requirements

Quick VRAM Calculator

Use this formula to calculate VRAM requirements for any model:

VRAM = (Parameters × Bytes per Parameter) + Overhead

Precision	Bytes per Parameter	Typical Overhead
FP16	2.0	+15-20%
Q8	1.0	+15-20%
Q4	0.5	+20-30%

Example: Llama 3.1 13B at Q8 = (13B × 1 byte) + 20% = 15.6GB

For longer context windows:

8K context: Add 30-40% overhead instead
16K context: Add 50-60% overhead instead
32K context: Add 80-100% overhead instead

Context length has a significant impact on VRAM usage because the KV cache grows linearly with context size.

How Context Length Affects VRAM Requirements

Context length is often overlooked when calculating VRAM needs, but it can make or break your ability to run a model.

The KV (key-value) cache stores attention information for all previous tokens in the conversation. As context grows, so does this cache.

For a 7B model:

2K context: ~0.5GB KV cache
4K context: ~1GB KV cache
8K context: ~2GB KV cache
16K context: ~4GB KV cache

For a 13B model:

2K context: ~0.8GB KV cache
4K context: ~1.5GB KV cache
8K context: ~3GB KV cache
16K context: ~6GB KV cache

This means a 13B Q8 model that needs 15.5GB at 4K context actually needs 18GB+ at 8K context and 21GB+ at 16K context.

Practical implications:

If you have a 12GB GPU and want to run 13B Q4 models, you’re limited to 4K context maximum. Want 8K context? You need 16GB+ VRAM.

If you have a 24GB GPU running 13B Q8 models, you can comfortably handle 8K contexts but 16K will be tight.

For 70B models, context length management becomes critical. Even at Q4, a 16K context could push you over 50GB VRAM.

Quantization Quality: What You’re Actually Trading

Understanding what you lose with quantization helps you make informed decisions about VRAM trade-offs.

FP16 to Q8: Minimal Quality Loss

The step from FP16 to Q8 is almost free in terms of quality. Most users cannot detect the difference in blind tests.

What’s preserved:

Reasoning capability
Instruction following
Factual accuracy
Coherence and fluency
Complex task performance

What changes:

Extremely subtle numerical precision in edge cases
Rare minor artifacts in very technical outputs

Bottom line: Unless you’re doing research or have specific precision requirements, Q8 is indistinguishable from FP16 for practical use.

Q8 to Q4: Noticeable but Manageable

The jump from Q8 to Q4 does introduce observable quality degradation, but the severity depends on model size and use case.

For 7B models at Q4:

Reasoning: Slightly less reliable on complex multi-step problems
Factual accuracy: Occasional errors increase modestly
Coherence: Minor issues in very long responses
Overall: Still highly usable for most applications

For 13B+ models at Q4:

The quality loss is more pronounced
Complex reasoning tasks show degradation
Instruction following can be less precise
Creative tasks may produce less sophisticated outputs

When Q4 is acceptable:

Coding assistance (syntax still reliable)
General chat and Q&A
Drafting and brainstorming
Experimentation and testing

When Q8 or higher is needed:

Complex reasoning or analysis
Professional writing requiring nuance
Technical documentation
Research or academic use

Going Below Q4: Generally Not Recommended

Q3 and Q2 quantization exist but are rarely worth using. The quality degradation becomes severe enough that you’re better off using a smaller model at higher precision.

A 7B model at Q8 will outperform a 13B model at Q2 in most tasks while using similar VRAM.

Real-World Hardware Recommendations by Use Case

Let’s translate these VRAM requirements into actual GPU recommendations for different user profiles.

Budget Experimentation ($250-400)

Goal: Run 7B models comfortably, occasionally test 13B at Q4

Recommended GPU: RTX 3060 12GB (used $250-300, new $320-380)

What you can run:

7B models at Q8: Excellent performance
7B models at Q4: Very fast, extra headroom
13B models at Q4: Possible with 4K context limit
Multiple 7B models simultaneously: For comparison testing

Limitations:

No 13B at Q8
No 70B models at any precision
Limited context length on larger models

Serious Hobbyist ($800-1,200)

Goal: Run 13B models at Q8 with good performance

Recommended GPU: RTX 4080 16GB ($900-1,100) or used RTX 3090 24GB ($800-900)

What you can run:

7B models at FP16: If you really want maximum quality
7B models at Q8: Extremely fast
13B models at Q8: Good performance with 8K context support
13B models at Q4: Very fast with long context

Limitations:

70B models only at Q4 with extreme optimization
Better to wait for dual-GPU setup for 70B

Professional/Power User ($1,500-2,000)

Goal: Run 13B at Q8 optimally, touch 70B territory

Recommended GPU: RTX 4090 24GB ($1,600-2,000)

What you can run:

7B models at FP16: Maximum quality if needed
13B models at Q8: Excellent performance, 16K context possible
13B models at FP16: With reduced context length
34B models at Q4: Usable performance
70B models at Q4: Technically possible with optimizations

This is the sweet spot for serious local LLM work without going into enterprise hardware.

Enterprise/Multi-GPU ($3,000+)

Goal: Run 70B models at Q8 or larger models comfortably

Recommended Setup: 2× RTX 4090 24GB ($3,200-4,000 total)

What you can run:

Everything the single 4090 can do
70B models at Q4: Good performance (15-25 tokens/second)
70B models at Q8: With careful optimization
Multiple 13B models simultaneously
Fine-tuning 13B models

Common VRAM Pitfalls to Avoid

1. Forgetting about context length

Don’t just calculate base model VRAM. If you plan to use 8K or 16K contexts regularly, factor that into your GPU purchase.

2. Assuming newer GPUs are always better

The RTX 4060 8GB is newer than the RTX 3060 12GB but worse for LLMs. VRAM capacity matters more than generation for inference.

3. Underestimating overhead

The formulas give you model size, but operating systems, drivers, and inference engines all consume VRAM too. Always budget 15-30% extra.

4. Planning for exactly one model

You’ll want to experiment with different models, run comparisons, or have multiple models loaded. Buy more VRAM than your calculations suggest you need.

5. Ignoring system RAM

Some frameworks can offload layers to system RAM, but this kills performance. Don’t rely on this as a solution—buy enough VRAM.

Conclusion

VRAM requirements for LLMs are straightforward once you understand the formula: model parameters multiplied by bytes per parameter, plus overhead. For most users, 7B models at Q8 (8-10GB VRAM) or 13B models at Q8 (15-18GB VRAM) represent the sweet spot of capability and accessibility. The RTX 3060 12GB handles 7B models excellently, while the RTX 4090 24GB is the only consumer card that truly handles 13B models well at Q8.

The key is being honest about what you’ll actually run. If you’re primarily using 7B models, don’t overspend on a 4090. If you know you want 13B quality, don’t try to make a 12GB GPU work. Match your hardware to your actual needs, and you’ll build a system that serves you well for years.

Understanding the VRAM Requirement Formula

Exact VRAM Requirements by Model Size and Precision

Deep Dive: 7B Models (Llama 3.1 7B, Mistral 7B)

FP16 (Full Precision): 14-16GB VRAM

Q8 (8-bit Quantization): 8-10GB VRAM

Q4 (4-bit Quantization): 4-6GB VRAM

Deep Dive: 13B Models (Llama 3.1 13B, Vicuna 13B)

FP16 (Full Precision): 26-30GB VRAM

Q8 (8-bit Quantization): 15-18GB VRAM

Q4 (4-bit Quantization): 10-14GB VRAM

Deep Dive: 70B Models (Llama 3.1 70B)

FP16 (Full Precision): 140-160GB VRAM

Q8 (8-bit Quantization): 75-85GB VRAM

Q4 (4-bit Quantization): 35-45GB VRAM

VRAM Calculator: Find Your Requirements

Quick VRAM Calculator

How Context Length Affects VRAM Requirements

Quantization Quality: What You’re Actually Trading

FP16 to Q8: Minimal Quality Loss

Q8 to Q4: Noticeable but Manageable

Going Below Q4: Generally Not Recommended

Real-World Hardware Recommendations by Use Case

Budget Experimentation ($250-400)

Serious Hobbyist ($800-1,200)

Professional/Power User ($1,500-2,000)

Enterprise/Multi-GPU ($3,000+)

Common VRAM Pitfalls to Avoid

Conclusion

Leave a Comment Cancel reply