The deployment of large language models (LLMs) in production environments has become increasingly critical for businesses seeking to leverage AI capabilities. However, one of the most significant challenges organisations face is managing inference speed—the time it takes for a model to generate predictions or responses. Slow inference not only degrades user experience but also increases computational costs and limits scalability. Understanding how to optimise inference speed is essential for anyone working with LLMs in real-world applications.
Inference speed optimisation isn’t just about making models faster; it’s about finding the right balance between performance, quality, and resource utilisation. Whether you’re deploying a chatbot, building a content generation system, or implementing semantic search, the strategies you employ can mean the difference between a responsive application and one that frustrates users with lag times.
Understanding Inference Bottlenecks
Before diving into optimisation techniques, it’s crucial to understand where bottlenecks occur during inference. The inference process involves loading model weights into memory, processing input tokens, generating outputs token-by-token, and managing the key-value cache that stores attention computations. Each of these stages presents opportunities for optimisation.
The most significant bottleneck in LLM inference is typically memory bandwidth. Large models require fetching billions of parameters from memory for each forward pass, and the speed at which this data can be transferred often limits overall performance. This is particularly pronounced in autoregressive generation, where each new token requires another complete pass through the model. Understanding this fundamental constraint helps explain why many optimisation techniques focus on reducing memory access or improving memory efficiency.
Model Quantization: Reducing Precision for Speed
Quantization stands as one of the most effective techniques for accelerating LLM inference. This approach reduces the precision of model weights and activations from 32-bit or 16-bit floating-point numbers to 8-bit integers or even lower bit-widths. The benefits are substantial: reduced memory footprint, faster memory transfers, and the ability to leverage specialised integer arithmetic hardware.
8-bit quantization has emerged as a sweet spot for many applications, offering 2-4x speedups while maintaining model quality that’s nearly indistinguishable from full-precision versions. Modern quantization techniques like LLM.int8() and GPTQ employ sophisticated algorithms to identify and preserve outlier features that are critical for model performance, while aggressively quantizing the majority of weights.
For even more aggressive optimisation, 4-bit quantization can deliver 4-8x memory reduction and corresponding speedups. While this level of compression introduces more quality degradation, techniques like GGUF (GPT-Generated Unified Format) and AWQ (Activation-aware Weight Quantization) have made 4-bit models surprisingly capable for many use cases. The key is to evaluate whether the quality-speed tradeoff aligns with your specific application requirements.
Practical implementation considerations:
- Start with 8-bit quantization as a baseline—it offers excellent speedups with minimal quality loss
- Use calibration datasets that represent your actual use cases when applying quantization
- Test quantized models thoroughly against your quality metrics before deployment
- Consider dynamic quantization for inputs and activations alongside static weight quantization
⚡ Quantization Impact Example
Typical speedup from quantization on a 7B parameter model
Batching Strategies for Throughput Optimization
While individual request latency is important, throughput—the number of requests processed per second—is equally critical for production systems. Batching allows you to process multiple requests simultaneously, amortising the overhead of model loading and leveraging parallel processing capabilities of modern GPUs.
Static batching groups requests together until a batch size threshold is met or a timeout occurs. This approach is straightforward to implement and can significantly improve throughput. For example, processing 8 requests in a single batch might only take 2-3x longer than processing a single request, effectively multiplying your throughput by 3-4x.
However, static batching introduces latency for requests that must wait for a batch to fill. This is where continuous batching (also called dynamic batching or iteration-level batching) becomes powerful. Instead of waiting for all sequences in a batch to complete, continuous batching allows finished sequences to be removed and new ones added at each generation step. This dramatically improves GPU utilisation and reduces average latency.
Advanced systems like vLLM and TensorRT-LLM implement PagedAttention, which manages the key-value cache more efficiently by breaking it into blocks that can be dynamically allocated and shared across sequences. This enables much larger batch sizes without running out of memory, as the cache memory isn’t preallocated for maximum sequence lengths.
Key batching considerations:
- Implement continuous batching for production systems handling variable-length sequences
- Monitor your request patterns to determine optimal batch size thresholds
- Use techniques like prefix caching to share computation across requests with common prefixes
- Balance batch size against latency requirements—larger batches improve throughput but increase individual request latency
KV Cache Optimization
The key-value (KV) cache stores computed attention keys and values from previous tokens, allowing the model to avoid recomputing them during autoregressive generation. However, this cache can consume enormous amounts of memory—sometimes exceeding the model weights themselves for long sequences. Optimising KV cache management is crucial for both speed and memory efficiency.
Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce the size of the KV cache by sharing key and value projections across multiple attention heads. Instead of each attention head maintaining its own K and V tensors, MQA uses a single shared pair, while GQA groups heads to share K and V within groups. This can reduce KV cache size by 4-8x with minimal impact on model quality, directly translating to faster inference and support for longer contexts.
KV cache quantization applies similar principles as weight quantization but to the cached attention tensors. Since the cache is generated dynamically during inference, this requires careful management to avoid accumulating errors. Recent research has shown that 8-bit or even 4-bit quantization of the KV cache can be highly effective, especially when combined with per-channel or per-token scaling factors.
For applications with predictable request patterns, prefix caching can eliminate redundant computation. When multiple requests share common prefixes (like system prompts or document contexts), the KV cache for these prefixes can be computed once and reused. This is particularly powerful for chatbots with lengthy system instructions or retrieval-augmented generation systems that prepend the same context documents to multiple queries.
Speculative Decoding and Parallel Processing
Traditional autoregressive generation is inherently sequential—each token must be generated before the next can begin. Speculative decoding challenges this limitation by using a smaller, faster “draft” model to propose multiple tokens, which are then verified in parallel by the target model. When the draft model’s predictions are accurate, you effectively generate multiple tokens in a single step of the larger model.
This technique works remarkably well because smaller models often predict simple tokens correctly (common words, punctuation, obvious continuations), and the parallel verification is memory-bandwidth efficient. In practice, speculative decoding can achieve 2-3x speedups on certain tasks with no quality degradation, since the final output is always verified by the target model.
Parallel sampling techniques like Medusa add multiple prediction heads to the model, allowing it to predict several future tokens simultaneously. While this requires model architecture modifications, it can deliver substantial speedups without requiring a separate draft model.
Flash Attention and Kernel Optimization
At a lower level, optimising the attention mechanism itself yields significant performance gains. Flash Attention reorganises the attention computation to be more memory-efficient by fusing operations and reducing the number of memory reads and writes. This algorithm-hardware co-design approach delivers 2-4x speedups for attention operations while using less memory.
The key insight of Flash Attention is that the standard attention implementation is memory-bound, not compute-bound. By breaking the attention computation into blocks and carefully orchestrating data movement between GPU memory hierarchies (HBM, SRAM), Flash Attention maximises the use of fast on-chip memory and minimises slow HBM accesses. The latest version, Flash Attention-3, further optimises these patterns for modern GPU architectures.
Beyond attention, custom CUDA kernels and framework-level optimisations can accelerate other bottlenecks. Libraries like FasterTransformer, TensorRT-LLM, and DeepSpeed provide hand-tuned kernels for common operations. Using these optimised implementations rather than generic framework operations can yield 20-50% additional speedups.
🎯 Optimization Priority Matrix
8-bit quantization, Flash Attention libraries, continuous batching
KV cache optimization, PagedAttention, speculative decoding
Custom kernels, model architecture modifications, 4-bit quantization
Hardware Selection and Infrastructure Considerations
The choice of hardware fundamentally impacts inference speed. Modern GPUs like NVIDIA’s H100 or A100 offer massive memory bandwidth and specialised Tensor Cores for accelerated matrix operations. However, the optimal hardware depends on your specific workload, budget, and deployment scale.
For smaller models or budget-conscious deployments, consumer GPUs like the RTX 4090 can provide excellent performance-per-dollar ratios. The key is matching your model’s memory requirements and computational characteristics to the hardware’s capabilities. A quantized 7B model might run efficiently on a 24GB GPU, while a 70B model might require multiple A100s or aggressive quantization to fit on consumer hardware.
CPU inference has improved dramatically with optimised libraries like llama.cpp and GGML. While generally slower than GPU inference, CPUs offer advantages for certain scenarios: no GPU availability requirements, ability to serve multiple small concurrent requests, and excellent support for quantized models. For applications where sub-second latency isn’t critical, CPU inference can be cost-effective.
Don’t overlook the importance of model serving frameworks like vLLM, TensorRT-LLM, or Text Generation Inference (TGI). These frameworks implement many optimisations out-of-the-box, including continuous batching, PagedAttention, and optimised kernels. Simply switching from a basic serving setup to one of these specialised frameworks can deliver 2-5x throughput improvements with minimal code changes.
Measuring and Monitoring Performance
Effective optimisation requires rigorous measurement. Track multiple metrics to get a complete picture: first-token latency (time to generate the first output token), inter-token latency (time between subsequent tokens), throughput (requests per second), and memory utilisation. Different optimisations affect these metrics differently—quantization primarily improves throughput and memory, while speculative decoding primarily reduces inter-token latency.
Create realistic benchmarks that reflect your actual use cases. Synthetic benchmarks with uniform sequence lengths and simple prompts may not capture the performance characteristics of real user requests. Include a variety of prompt lengths, output lengths, and complexity levels in your test suite.
Conclusion
Optimising LLM inference speed requires a multi-faceted approach that combines model-level techniques like quantization, algorithmic improvements like Flash Attention, and infrastructure choices around batching and hardware selection. The strategies you prioritise should align with your specific constraints—whether that’s minimising latency for interactive applications, maximising throughput for batch processing, or reducing costs for large-scale deployments.
Start with high-impact, low-complexity optimisations like 8-bit quantization and optimised serving frameworks, then progressively implement more advanced techniques as needed. Measure rigorously, test thoroughly, and remember that the goal isn’t just raw speed but achieving the right balance between performance, quality, and resource efficiency for your particular application. With the right combination of techniques, you can often achieve 5-10x or greater improvements in inference speed while maintaining model quality.