How to Reduce LLM Inference Latency: KV Cache, Batching, and Quantization

Inference latency is the wall-clock time from receiving a request to returning the first token (time to first token, TTFT) or the full response (end-to-end latency). For production LLM deployments, latency is often the primary constraint: users notice delays above 200–300ms for first token, and slow generation feels broken even when the output is correct. Reducing latency without degrading output quality requires understanding where time is actually spent and applying targeted optimizations. This guide covers the three highest-leverage levers: KV cache management, request batching, and quantization.

Where Latency Comes From

LLM inference has two distinct phases with different performance characteristics. The prefill phase processes the input prompt in a single forward pass — all input tokens are processed in parallel, leveraging the GPU’s full matrix multiplication throughput. Prefill is compute-bound: a longer prompt means more compute but the time scales sub-linearly because the GPU can parallelize across tokens. For a 1,000-token prompt on an A100, prefill typically takes 50–150ms.

The decode phase generates output tokens one at a time. Each token requires a forward pass through the full model, but with batch size 1 (for single-request serving) and sequence length 1 for the new token, the arithmetic intensity is extremely low — the GPU spends most of its time moving weights from HBM to compute units rather than doing useful computation. Decode is memory-bandwidth bound. For a 7B model in bf16, generating each token requires reading roughly 14GB of weights from memory; at A100’s 2TB/s HBM bandwidth, that’s approximately 7ms per token, regardless of how many FLOPS the GPU could theoretically perform. This is why decode throughput is almost entirely determined by memory bandwidth and weight size, not raw compute.

Time to first token is dominated by prefill latency. Tokens per second (generation throughput) is dominated by decode latency. Optimizations that help one may not help the other — understanding which metric matters for your application guides which optimizations to prioritize.

KV Cache: What It Is and Why It Matters

During the decode phase, the model needs the key and value tensors for every previous token to compute attention for the new token. Without caching, this would require recomputing these tensors from scratch on every decode step — processing the full growing sequence each time. The KV cache stores key and value tensors as they’re computed, so each decode step only needs to compute the new token’s query and look up the cached keys and values from all previous tokens.

The KV cache grows linearly with sequence length. For each token generated, the cache adds 2 * num_layers * num_kv_heads * head_dim * sizeof(dtype) bytes. For Llama 3 8B in bf16 (32 layers, 8 KV heads, head dim 128): 2 * 32 * 8 * 128 * 2 = 131KB per token. A 4,096-token conversation consumes roughly 512MB of KV cache — significant but manageable. At 128K tokens, the KV cache alone is 16GB. KV cache memory is the primary constraint on maximum context length and concurrent request capacity.

PagedAttention, implemented in vLLM, manages KV cache memory in fixed-size pages rather than pre-allocating contiguous blocks per request. This eliminates internal fragmentation (wasted space within pre-allocated blocks) and external fragmentation (unusable gaps between blocks), increasing effective KV cache utilization from 50–60% (naive allocation) to over 90%. Higher KV cache utilization means more concurrent requests at the same GPU memory budget, which directly improves throughput without any change to the model.

KV cache quantization is a more aggressive memory reduction. Quantizing the KV cache from bf16 to int8 halves its memory footprint with minimal quality impact — attention score computation is relatively robust to 8-bit precision in the keys and values. This either doubles the maximum context length or doubles the concurrent request capacity at the same memory budget. vLLM supports KV cache quantization via the –kv-cache-dtype fp8 flag. For latency-sensitive deployments where context lengths are long and concurrency is high, KV cache quantization is one of the highest-leverage memory optimizations available.

Batching Strategies

Batching multiple requests together is the primary way to improve GPU utilization for decode. In decode, a single request uses a tiny fraction of the GPU’s compute — the memory bandwidth bottleneck means the GPU is mostly idle waiting for weight transfers. Running multiple requests in parallel amortizes the weight transfer cost across all requests in the batch, approaching the GPU’s memory bandwidth ceiling and dramatically improving throughput per GPU-hour.

Static batching groups requests into fixed-size batches and processes each batch together from start to finish. This is simple to implement but inefficient: when a short request in the batch finishes, its GPU allocation sits idle waiting for the longest request to complete before the batch is released. For requests with highly variable lengths (common in production), static batching wastes 30–50% of GPU compute on idle padding time.

Continuous batching (also called iteration-level scheduling or in-flight batching) solves this by adding new requests to the batch as existing ones finish, at the token level rather than the batch level. When a request generates its final token, its slot is immediately freed and a new request from the queue takes its place. The GPU stays fully utilized as long as there are requests in the queue. vLLM, TGI, and most production serving frameworks implement continuous batching by default. The throughput improvement over static batching is typically 2–5x for mixed-length workloads.

The interaction between batching and latency is a fundamental trade-off. Larger batches improve throughput but increase latency for individual requests: a request that arrives just after a batch starts must wait for the next batch cycle (static) or join an already-in-progress decode phase (continuous). For latency-sensitive applications, keep batch sizes small and use continuous batching to minimize queuing delay. For throughput-optimized deployments (offline batch processing, async workloads), maximize batch size up to the memory limit.

Quantization for Latency

Quantization reduces the precision of model weights, which reduces the memory they occupy and therefore how long it takes to load them from HBM during decode. Since decode is memory-bandwidth bound, loading weights faster directly translates to faster token generation. A 4-bit quantized 7B model has weights of roughly 3.5GB instead of 14GB — loading 4x less data per decode step gives approximately 4x higher token throughput at the same memory bandwidth.

GPTQ and AWQ are the two most widely used weight-only quantization methods for LLM inference. Both quantize weights to 4 bits while keeping activations in bf16, which preserves most of the model’s quality while delivering the memory bandwidth benefits of reduced weight size. AWQ (Activation-aware Weight Quantization) generally achieves slightly better quality than GPTQ at the same bit-width by identifying and protecting the most salient weights from aggressive quantization. For most production use cases, AWQ 4-bit is the recommended default: it provides 3–4x throughput improvement over bf16 with perplexity degradation of less than 1% on most benchmarks.

Speculative decoding is a fundamentally different approach to reducing decode latency. A small, fast draft model generates multiple candidate tokens in sequence, then a single forward pass of the large verifier model accepts or rejects the candidates. When the draft model is accurate (which it typically is for predictable outputs), multiple tokens are accepted per verifier forward pass, effectively increasing tokens per second without changing the output distribution. Speculative decoding works best when outputs are predictable (code generation, structured outputs, repetitive text) and less well for highly creative or unpredictable generation where the draft model frequently proposes wrong tokens that the verifier rejects.

Prefill Optimization

For applications with long system prompts or fixed context that’s the same across many requests (RAG with static documents, applications with long fixed instructions), prompt caching eliminates redundant prefill computation. The KV cache for the fixed prefix is computed once and reused across all requests that share it. Anthropic’s API, OpenAI’s API, and self-hosted deployments via vLLM’s prefix caching feature all support this. For a 2,000-token system prompt repeated across 1,000 requests, prefix caching eliminates 2 million tokens of prefill computation — a significant latency and cost reduction.

Chunked prefill processes long prompts in chunks rather than all at once, interleaving prefill chunks with decode steps from other requests. This prevents long prefills from blocking the decode phase for concurrent requests, reducing tail latency at the expense of slightly higher average prefill time. It’s most beneficial for serving mixed workloads where some requests have very long prompts and others are short — without chunked prefill, a 100K-token prefill blocks all concurrent decode steps for several seconds.

Putting It Together

For latency-optimized single-request serving: use a quantized model (AWQ 4-bit), enable torch.compile with reduce-overhead mode and static KV cache, and minimize batch size. For throughput-optimized batch serving: maximize batch size with continuous batching, use PagedAttention for efficient KV cache management, enable prefix caching for shared context, and consider speculative decoding for predictable output types. For balanced production serving: continuous batching with moderate batch sizes, AWQ quantization, PagedAttention, and KV cache quantization to fp8 if context lengths are long. These aren’t mutually exclusive — most of these optimizations compose cleanly and applying all of them simultaneously delivers the combined benefit.

Measuring Latency Correctly

Latency benchmarks for LLM inference are frequently misleading because they conflate different metrics or measure under conditions that don’t reflect production. Time to first token (TTFT) and time per output token (TPOT) are the two metrics that matter for user experience, and they need to be measured separately. TTFT is dominated by prefill and is largely independent of output length; TPOT is determined by decode throughput and is roughly constant per token for a given model and hardware configuration.

Always measure under realistic load. A single isolated request to an otherwise-idle GPU will show better latency than the same request arriving during peak traffic, because under load the request must wait in queue and competes for KV cache memory with concurrent requests. Benchmark at your target concurrency level — if you plan to serve 20 concurrent users, benchmark at 20 concurrent requests, not at 1. P99 latency under load is the metric that determines whether your SLA is met in production, not median latency on isolated requests.

When comparing optimization techniques, control carefully for what’s being held constant. A comparison of bf16 versus AWQ 4-bit that doesn’t hold batch size constant is misleading — the quantized model may allow a larger batch, which improves throughput but increases per-request latency. Test at the same batch size to isolate the quantization effect, then separately test at the optimal batch size for each configuration to understand the full trade-off space.

Hardware Choice and Latency

The memory bandwidth of the GPU is the primary determinant of decode latency for memory-bandwidth-bound generation. A100 80GB provides 2TB/s HBM bandwidth; H100 SXM provides 3.35TB/s — roughly 1.6x more bandwidth, which translates to approximately 1.6x higher decode tokens per second for the same model and batch size. H100’s higher bandwidth is the primary reason for its higher throughput on LLM inference, more so than its higher FLOPS count (which matters more for prefill and training).

For latency-sensitive single-request serving where cost is a secondary concern, H100s provide meaningfully lower TTPT compared to A100s. For throughput-optimized batch inference where the goal is maximizing tokens per dollar, the cost-per-token difference between A100 and H100 cloud instances narrows considerably when both are running at optimal batch sizes. Calculate cost per million tokens at your target concurrency for both hardware options before committing — the right choice depends on your workload’s specific balance of prefill-heavy versus decode-heavy requests and whether you’re optimizing for latency SLA or cost efficiency.