What Is KV Cache and Why It Affects LLM Speed

If you’ve ever wondered why your local LLM slows down during long conversations or why context length has such a dramatic impact on performance, the answer lies in something called KV cache. This seemingly technical concept is actually the primary bottleneck determining how fast large language models can generate tokens—and understanding it will help you optimize your LLM setup, whether you’re running models locally or using cloud APIs.

Understanding LLM Token Generation: The Foundation

Before we dive into KV cache specifically, you need to understand how LLMs generate text. Unlike traditional software that executes instructions sequentially, language models generate one token at a time through a process that requires looking at all previous tokens in the conversation.

When you prompt a model with “Write a Python function that”, the model:

Processes your entire prompt (called the prompt processing phase)
Generates the first token, maybe “def”
Looks at your prompt + “def” to generate the next token, maybe “calculate”
Looks at your prompt + “def calculate” to generate the next token, maybe “_”
Continues this process until completion

Each new token requires the model to “see” all previous tokens. This is where the computational challenge emerges.

The naive approach would recalculate attention scores for every previous token at each generation step. If your context is 1,000 tokens and you’re generating token 1,001, the model would:

Calculate attention for all 1,000 previous tokens
Generate token 1,001
Calculate attention for all 1,001 previous tokens
Generate token 1,002
Calculate attention for all 1,002 previous tokens
And so on…

This recalculation is incredibly wasteful. The attention scores for previously processed tokens don’t change—they’re fixed once computed. Recalculating them thousands of times is pure waste.

KV cache solves this problem by storing the intermediate attention calculations so they never need to be recomputed.

What Exactly Is KV Cache?

KV cache stands for Key-Value cache. To understand what this means, we need to briefly discuss how transformer attention mechanisms work.

The Attention Mechanism (Simplified)

In transformer models, attention works through three components for each token:

Query (Q): “What am I looking for?”
Key (K): “What information do I have?”
Value (V): “What is that information?”

When generating a new token, the model:

Creates a Query from the current position
Compares this Query against the Keys of all previous tokens
Uses these comparisons to weight the Values
Combines the weighted Values to inform the next token

The critical insight: the Keys and Values for previously processed tokens never change. Once you’ve processed “Write a Python function”, the K and V representations for those tokens are fixed. Only the new Query changes as you generate each new token.

KV Cache: Storing Past Calculations

KV cache is simply a data structure that stores these Keys and Values for every token that’s already been processed. Instead of recalculating them for each new token, the model:

Looks up the cached Keys and Values for all previous tokens
Calculates only the new Query for the current position
Performs attention using cached K/V and new Q
Generates the next token
Adds the new token’s K/V to the cache

This transforms token generation from an O(n²) operation (where each token requires reprocessing all previous tokens) into an O(n) operation (where each token only requires looking up previous calculations).

Why KV Cache Affects Speed: The Real-World Impact

The performance difference between using KV cache and not using it is staggering.

Without KV cache:

Llama 3.1 7B generating 100 tokens with 1,000 token context: ~2-3 tokens/second
Each token requires full reprocessing of all previous tokens
Generation time scales quadratically with context length

With KV cache:

Llama 3.1 7B generating 100 tokens with 1,000 token context: 25-40 tokens/second
Each token only requires attention lookup, not recomputation
Generation time scales linearly with context length

The speedup is typically 10-20x for typical workloads. Without KV cache, modern LLMs would be unusably slow for interactive applications.

The Memory Trade-Off

KV cache trades memory for speed. Instead of recalculating attention, you store it—and this storage requirement grows with:

Context length: More tokens = more cached Keys and Values
Model size: Larger models have larger K/V representations per token
Batch size: Processing multiple requests simultaneously multiplies cache requirements

Let’s quantify this with real numbers.

KV Cache Memory Requirements: The Math

The memory required for KV cache can be calculated precisely:

KV Cache Memory Formula

Memory = 2 × num_layers × hidden_size × context_length × bytes_per_value × batch_size

Where:

2 = Keys and Values (two separate caches)
num_layers = Number of transformer layers in the model
hidden_size = Dimension of the model’s hidden state
context_length = Number of tokens in context
bytes_per_value = 2 for FP16, 1 for FP8, etc.
batch_size = Number of concurrent requests

Practical Examples

Llama 3.1 7B with 4K context (FP16):

Layers: 32
Hidden size: 4096
Context: 4,096 tokens
Precision: FP16 (2 bytes)
Batch size: 1

Memory = 2 × 32 × 4096 × 4096 × 2 × 1 = 2,147,483,648 bytes ≈ 2GB

Llama 3.1 13B with 8K context (FP16):

Layers: 40
Hidden size: 5120
Context: 8,192 tokens
Precision: FP16 (2 bytes)
Batch size: 1

Memory = 2 × 40 × 5120 × 8192 × 2 × 1 = 6,710,886,400 bytes ≈ 6.7GB

Llama 3.1 70B with 4K context (FP16):

Layers: 80
Hidden size: 8192
Context: 4,096 tokens
Precision: FP16 (2 bytes)
Batch size: 1

Memory = 2 × 80 × 8192 × 4096 × 2 × 1 = 10,737,418,240 bytes ≈ 10.7GB

Why Context Length Matters So Much

Notice how the memory scales linearly with context length. Double your context from 4K to 8K, and your KV cache memory doubles.

Model	2K Context	4K Context	8K Context	16K Context
Llama 3.1 7B	1.0 GB	2.0 GB	4.0 GB	8.0 GB
Llama 3.1 13B	1.7 GB	3.4 GB	6.7 GB	13.4 GB
Llama 3.1 70B	5.4 GB	10.7 GB	21.5 GB	42.9 GB

This is in addition to the model weights themselves. A Llama 3.1 70B Q4 model requires ~35GB for weights. At 16K context, you need another 43GB for KV cache—nearly 80GB total memory.

This is why context length has such a dramatic impact on VRAM requirements and why you can’t just run arbitrarily long contexts even if the model theoretically supports them.

How Context Length Affects Generation Speed

Beyond memory, KV cache size directly impacts generation speed through memory bandwidth constraints.

The bottleneck: Modern LLM inference is memory-bandwidth-bound, not compute-bound. The GPU/CPU can perform the calculations faster than it can fetch the data from memory.

Each token generation requires:

Loading all cached Keys from memory
Loading all cached Values from memory
Computing attention scores
Writing new K/V to cache

As context length grows, more data must be loaded from memory for each token. This creates a direct relationship between context length and generation speed.

Real-World Performance Impact

RTX 4090 running Llama 3.1 7B Q4:

2K context: 95 tokens/second
4K context: 87 tokens/second (8% slower)
8K context: 76 tokens/second (20% slower)
16K context: 61 tokens/second (36% slower)
32K context: 47 tokens/second (51% slower)

M2 Max running Llama 3.1 13B Q8:

2K context: 46 tokens/second
4K context: 44 tokens/second (4% slower)
8K context: 39 tokens/second (15% slower)
16K context: 31 tokens/second (33% slower)

The performance degradation is nearly linear with context length because the memory bandwidth requirement grows linearly with context.

Prompt Processing vs Token Generation: Two Different Phases

KV cache behaves differently during the two phases of LLM inference.

Phase 1: Prompt Processing (Prefill)

When you first submit a prompt, the model processes all tokens simultaneously in parallel. For a 1,000 token prompt:

All 1,000 tokens are processed together
Attention is computed for the full sequence
Keys and Values are calculated for all tokens
These K/V pairs are stored in cache
This phase is compute-bound and benefits from parallelization

Speed: Measured in tokens/second, typically 500-2,000 t/s on modern hardware

Phase 2: Token Generation (Decode)

After the prompt is processed, generation happens one token at a time:

Load all cached K/V pairs from memory
Compute attention for the new token against all cached K/V
Generate next token
Add new token’s K/V to cache
Repeat

This phase is memory-bandwidth-bound and cannot be parallelized

Speed: Measured in tokens/second, typically 20-100 t/s on consumer hardware

Why This Matters

The two-phase nature means:

Long prompts: Slow to start (processing 10,000 tokens might take 5-15 seconds), but once generation begins, speed is normal

Short prompts: Near-instant start, but if the conversation grows long, generation slows as KV cache fills with previous messages

This is why chatbots that keep the full conversation history eventually slow down—the KV cache is growing with every exchange, making each new token slower to generate.

KV Cache Optimization Techniques

Understanding KV cache opens up several optimization strategies.

1. Context Length Management

The problem: Letting conversations grow unbounded fills the KV cache and slows generation.

The solution: Implement a context window that keeps only recent messages:

Keep last 4,096 tokens of conversation
Discard older messages when limit is reached
Optionally summarize discarded content into a system prompt

Impact: Maintains consistent generation speed even in multi-hour conversations

2. Grouped Query Attention (GQA)

Newer models like Llama 3.1 use GQA to reduce KV cache size:

Traditional attention: Each layer has its own full set of Keys and Values
GQA: Multiple attention heads share the same Keys and Values

Memory savings: 4-8x reduction in KV cache size with minimal quality impact

Llama 3.1 70B uses GQA, which is why its KV cache requirements are more manageable than you’d expect for a model that size.

3. Quantizing KV Cache

Just as model weights can be quantized, so can KV cache:

FP16 KV cache: Full precision, 2 bytes per value
FP8 KV cache: Half precision, 1 byte per value (50% memory savings)
INT8 KV cache: Integer quantization, 1 byte per value

Quality impact: FP8 KV cache typically has minimal impact on output quality while halving memory requirements

Performance impact: Smaller cache = less memory bandwidth needed = faster generation

4. Paged Attention (Used by vLLM)

Traditional KV cache allocates a contiguous block of memory. Paged attention breaks cache into pages:

Memory allocated in small chunks (pages)
Pages can be non-contiguous
Pages can be shared across requests with common prefixes
Pages can be swapped to CPU memory if GPU memory is tight

Benefits:

Better memory utilization
Support for longer contexts
Higher throughput for serving multiple requests

This is why production LLM serving frameworks like vLLM significantly outperform naive implementations.

KV Cache in Multi-User Scenarios

For services handling multiple concurrent users, KV cache management becomes critical.

The challenge: Each user’s conversation needs its own KV cache. With batch_size = 32:

Llama 3.1 13B with 4K context: 3.4GB × 32 = 108.8GB just for KV cache
This exceeds most GPU memory even before loading model weights

Solutions:

Continuous batching: Start generating for new requests as soon as previous ones complete, rather than waiting for the entire batch to finish. Maximizes GPU utilization.

Request prioritization: Allocate KV cache to high-priority requests first, queue others.

Prefix caching: If multiple users have the same system prompt, share that portion of KV cache across requests.

Dynamic batching: Adjust batch size based on current KV cache memory usage and incoming request patterns.

Measuring KV Cache Impact in Your Setup

If you’re running LLMs locally, you can observe KV cache behavior directly.

Using Ollama

Ollama displays memory usage including KV cache:

ollama run llama3.1:7b-q4
>>> How are you?
[Response generates...]

# Check memory usage
ollama ps

ollama run llama3.1:7b-q4
>>> How are you?
[Response generates...]

# Check memory usage
ollama ps

You’ll see:

Model size (weights)
Context size (current tokens in conversation)
Memory usage (weights + KV cache)

As the conversation continues, watch memory usage grow linearly with message count.

Using llama.cpp

llama.cpp provides detailed performance metrics:

./main -m llama-3.1-7b-q4.gguf -p "Hello" -n 100 --verbose

./main -m llama-3.1-7b-q4.gguf -p "Hello" -n 100 --verbose

Output includes:

Prompt processing speed (t/s)
Generation speed (t/s)
KV cache size
Memory bandwidth utilization

Monitoring Tools

For NVIDIA GPUs:

nvidia-smi dmon -s u

nvidia-smi dmon -s u

Shows GPU memory usage in real-time, including KV cache allocation.

For Apple Silicon:

sudo powermetrics --samplers gpu_power -i 1000

sudo powermetrics --samplers gpu_power -i 1000

Displays unified memory usage including active inference memory (model + KV cache).

Why Some Models Are Faster Despite Being Larger

Understanding KV cache helps explain counterintuitive performance patterns.

Example: Llama 3.1 70B with GQA can sometimes generate faster than Llama 2 70B without GQA, despite being similar in total parameters.

Reason:

Llama 3.1 70B KV cache at 4K context: ~10.7GB
Llama 2 70B KV cache at 4K context: ~28GB (no GQA)

The 2.6x smaller KV cache means less memory bandwidth needed per token, enabling faster generation even though model weights are similar.

This is why architectural improvements matter just as much as parameter count for inference performance.

Common Misconceptions About KV Cache

Misconception 1: “KV cache only matters for long contexts”

False. KV cache is used from the very first token. Even a 100-token conversation has a 100-token KV cache. The impact scales with length, but it’s always present.

Misconception 2: “Disabling KV cache will reduce memory usage”

Technically true, but generation becomes 10-20x slower. The memory savings are small compared to model weights anyway. This is never worth it.

Misconception 3: “KV cache is the same as context window”

Related but distinct. Context window is the maximum supported length. KV cache is the mechanism that makes processing that context efficient. You can have a 32K context window model but currently only have 2K tokens in KV cache.

Misconception 4: “More GPU cores = faster generation despite KV cache”

Not necessarily. Generation is memory-bandwidth-bound. An RTX 4090 with 1TB/s bandwidth will outperform two RTX 4060s with combined 544GB/s bandwidth for LLM inference, despite the 4060s having more total CUDA cores.

Conclusion

KV cache is the mechanism that makes modern LLM interaction practical by avoiding redundant computation. By storing the Key and Value calculations from previously processed tokens, models achieve 10-20x faster generation at the cost of linear memory growth with context length. This trade-off—memory for speed—is fundamental to LLM architecture and directly affects every aspect of performance from VRAM requirements to token generation speed.

Understanding KV cache explains why context length has such dramatic impacts on both memory usage and speed, why longer conversations eventually slow down, and why techniques like grouped query attention and paged attention matter for production deployments. Whether you’re running models locally or optimizing cloud API usage, KV cache behavior is the primary factor determining your LLM’s responsiveness.