How to Reduce LLM Inference Latency: Flash Attention, Speculative Decoding, and KV Cache Optimisation

The Three Components of LLM Latency

LLM inference latency has three distinct components that require different optimisation strategies. Time to first token (TTFT) is the delay between sending a request and receiving the first token of the response — dominated by prefill time, the cost of processing all input tokens. Time per output token (TPOT) is the average time to generate each subsequent token — dominated by memory bandwidth and KV cache size. Total response latency is TTFT plus (TPOT × number of output tokens). For interactive applications, TTFT matters most for perceived responsiveness; for batch processing, total throughput matters more than either individual metric.

Most optimisation techniques target either prefill (reducing TTFT) or decoding (reducing TPOT), and it is important to know which bottleneck you are actually hitting before applying optimisations. Profile your application on real traffic and measure TTFT and TPOT separately — the right fix depends on which one is unacceptable.

Flash Attention: The Baseline Optimisation

Flash Attention is a memory-efficient attention algorithm that rewrites the standard attention computation to avoid materialising the full attention matrix in GPU HBM memory. Standard attention reads and writes the N×N attention matrix multiple times, making it bandwidth-bound for long sequences. Flash Attention fuses the operations and tiles the computation to keep intermediate values in the much faster on-chip SRAM, dramatically reducing HBM memory reads and writes.

The practical impact: Flash Attention reduces attention computation memory usage from O(N²) to O(N) in HBM, and delivers 2–4x faster attention for long sequences with no change in output quality — it produces mathematically identical results to standard attention. For sequences of 4K tokens, the speedup is noticeable; for sequences of 32K or 128K tokens, Flash Attention is essential for practical throughput.

Flash Attention 2 and 3 extend the original with further optimisations for newer GPU architectures. Most modern inference frameworks enable it by default:

# PyTorch — enabled by default in PyTorch 2.0+ via SDPA
import torch
with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
    output = torch.nn.functional.scaled_dot_product_attention(q, k, v)

# vLLM — enabled by default, no configuration needed
# Transformers — enable via model config
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16
)

If you are running a production LLM inference stack and Flash Attention is not enabled, enabling it is the single highest-impact, lowest-risk latency improvement available. Check your framework’s documentation to confirm it is active.

Speculative Decoding: Faster Generation via Draft Models

Speculative decoding uses a small, fast “draft” model to speculatively generate several tokens ahead, then verifies them with the large target model in a single forward pass. When the draft model’s predictions are correct — which happens frequently for common words, phrases, and continuations — you get multiple tokens for roughly one large model forward pass, significantly reducing generation latency.

The mechanics: the draft model generates k candidate tokens (typically 4–8). The target model processes them all in parallel in one forward pass and accepts the longest prefix that matches what it would have generated, then generates the next token from the point of divergence. For text distributions where the draft model is reliable (common phrases, code boilerplate, templated outputs), acceptance rates of 70–85% are achievable, translating to 2–3x speedups on generation latency.

# vLLM speculative decoding
python -m vllm.entrypoints.openai.api_server   --model meta-llama/Llama-3.1-70B-Instruct   --speculative-model meta-llama/Llama-3.2-1B-Instruct   --num-speculative-tokens 5   --tensor-parallel-size 4

# Hugging Face Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

draft_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
target_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B-Instruct")

# Use the AssistantModel argument in generate()
outputs = target_model.generate(
    inputs,
    assistant_model=draft_model,
    do_sample=False
)

Speculative decoding is most beneficial at low batch sizes where the GPU is underutilised during decoding. At high batch sizes (many concurrent requests), the GPU is already fully utilised and the parallel verification step provides less benefit. It is primarily a latency optimisation for interactive applications, not a throughput optimisation for batch inference.

KV Cache Optimisation

The KV cache stores key and value tensors for all previously generated tokens, allowing the model to avoid recomputing attention over the full history at each step. It is essential for efficient autoregressive generation but can consume enormous amounts of GPU memory for long sequences or large batches — often more than the model weights themselves for long-context workloads.

PagedAttention (used by vLLM) manages KV cache memory as fixed-size blocks rather than contiguous allocations, eliminating fragmentation and allowing the GPU memory to be fully utilised across many concurrent sequences. If you are not using vLLM or another PagedAttention-based framework, switching to one is one of the highest-impact changes for production serving efficiency.

KV cache quantisation stores the cache in INT8 or FP8 rather than BF16, halving or quartering the memory required with a small quality impact:

# vLLM KV cache quantisation
python -m vllm.entrypoints.openai.api_server   --model meta-llama/Llama-3.1-70B-Instruct   --kv-cache-dtype fp8_e5m2

Prefix caching reuses the KV cache for a shared prompt prefix (system prompt, few-shot examples) across multiple requests, so the model processes the prefix only once rather than on every call. This reduces TTFT for requests with long shared prefixes to near zero for the cached portion:

# vLLM prefix caching
python -m vllm.entrypoints.openai.api_server   --model meta-llama/Llama-3.1-8B-Instruct   --enable-prefix-caching

For applications with a large fixed system prompt, prefix caching alone can reduce TTFT by 40–60% on warm requests.

Continuous Batching and Scheduling

Traditional static batching groups requests into fixed batches and processes them together — all sequences start and finish as a unit. This wastes GPU slots: short sequences finish early but wait for long ones before the batch is released. Continuous batching (iteration-level scheduling) replaces completed sequences with new requests at each generation step, keeping the GPU fully utilised regardless of sequence length variation.

vLLM, TGI, and most modern serving frameworks implement continuous batching by default. If you are running a custom inference loop without it, the throughput difference is significant — often 2–3x better utilisation on variable-length production traffic. The implementation is not trivial to build from scratch, which is the strongest practical reason to use a production-grade framework rather than rolling your own serving layer.

Model Quantisation for Latency

Quantisation reduces model weight precision, decreasing the amount of data the GPU must load per forward pass and directly improving bandwidth-bound inference throughput. AWQ (Activation-aware Weight Quantisation) and GPTQ are the most widely deployed INT4 quantisation methods for GPU inference:

# Load AWQ-quantised model in vLLM
python -m vllm.entrypoints.openai.api_server   --model TheBloke/Llama-2-70B-Chat-AWQ   --quantization awq

# Load GPTQ-quantised model
python -m vllm.entrypoints.openai.api_server   --model TheBloke/Llama-2-70B-Chat-GPTQ   --quantization gptq

INT4 quantisation typically reduces TPOT by 30–50% compared to BF16 for the same model on the same GPU, by reducing the memory bandwidth required per token. The quality trade-off is small for most production tasks — AWQ specifically minimises quality degradation by accounting for activation magnitudes when choosing quantisation scales.

Tensor Parallelism and Pipeline Parallelism

For models that span multiple GPUs, how you parallelise the computation significantly affects latency. Tensor parallelism splits individual weight matrices across GPUs, requiring an all-reduce communication at each layer. It reduces TTFT and TPOT proportionally with the number of GPUs, but introduces communication overhead that limits scaling efficiency beyond 4–8 GPUs on a single node. It is the standard approach for latency-optimised serving of 70B models across 4–8 GPUs.

Pipeline parallelism assigns different layers to different GPUs, sending activations between them sequentially. It has lower communication overhead than tensor parallelism but adds pipeline bubble latency — idle time at the boundaries of pipeline stages. Pipeline parallelism is more efficient for throughput-optimised batch processing but worse for interactive latency than tensor parallelism.

For production interactive serving, tensor parallelism within a single node is the standard. Use the smallest tensor parallel degree that fits your model in GPU memory while meeting your latency SLA — adding more GPUs beyond what is needed for memory only adds communication overhead without proportional latency benefit.

Measuring and Monitoring Latency in Production

Effective latency optimisation requires accurate measurement. Log TTFT and TPOT separately for every request, alongside input token count, output token count, and batch size at the time of the request. These four dimensions let you identify whether latency spikes come from prefill bottlenecks (high input tokens), decoding bottlenecks (high output tokens), or queue saturation (high concurrent batch sizes). vLLM exposes these metrics via its Prometheus endpoint automatically, and tools like Grafana make it straightforward to visualise p50, p95, and p99 latencies in real time. Treating latency as a first-class production metric — with alerting on SLA violations and regular analysis of the tail distribution — is what separates teams that systematically improve their LLM infrastructure from those that react to latency complaints after users notice them.

The Latency Optimisation Hierarchy

When approaching LLM latency optimisation, address these in order for the best return on effort. First, confirm Flash Attention is enabled — this is free, risk-free, and meaningful. Second, enable prefix caching if your application has a shared system prompt or few-shot examples. Third, evaluate quantisation (AWQ or GPTQ INT4) if you are on a GPU where model weights are the bandwidth bottleneck. Fourth, add speculative decoding if your application is interactive and single-user at any given moment. Fifth, tune your tensor parallel degree to the minimum that fits your model with adequate KV cache headroom. Sixth, consider KV cache quantisation for long-context workloads where cache memory is the binding constraint.

Each of these steps can be applied independently, and the combined effect of all six can reduce latency by 60–75% compared to a naive out-of-the-box deployment with none of them enabled. The first three — Flash Attention, prefix caching, and quantisation — are almost always worth doing. The latter three are situation-dependent and should be applied based on profiling data showing they address your specific bottleneck.

Hardware Choices and Their Impact on Latency

Optimisation at the software level has limits set by the underlying hardware. For TTFT (prefill), compute throughput is the bottleneck — faster GPUs with higher TFLOPS directly improve prefill time. H100 with its 3x higher BF16 throughput versus A100 cuts TTFT roughly in half for long prompts. For TPOT (decoding), memory bandwidth is the bottleneck — again, H100’s 64% higher bandwidth over A100 translates directly into faster token generation. Apple Silicon Mac Studio M4 Ultra’s 800+ GB/s bandwidth delivers competitive TPOT against A100 despite lower compute, precisely because decoding is bandwidth-bound. When purchasing or renting hardware for latency-sensitive LLM inference, prioritise memory bandwidth over raw TFLOPS for decoding workloads, and balance both for mixed prefill/decode workloads. The relationship between hardware specs and real-world latency is more direct for LLM inference than for almost any other GPU workload, making the bandwidth number on the spec sheet a reliable proxy for decoding performance.

Latency SLAs and User Experience

Setting the right latency targets requires understanding what users actually notice. Research on perceived responsiveness consistently shows that responses beginning within 200–300ms feel immediate; responses starting in 500–1,000ms feel fast but slightly delayed; and responses starting after 2 seconds feel slow. For LLM chat interfaces, TTFT is the metric that drives perceived responsiveness — users notice when the first token is slow far more than when generation speed is moderate. Optimise TTFT first for interactive applications, then TPOT, then total response length. A 3-second TTFT followed by 80 tokens per second feels slower to users than a 300ms TTFT followed by 20 tokens per second, even if the full response arrives at the same time — the streaming experience shapes perception as much as the final result.

Leave a Comment