Deploying large language models at scale presents a fundamental challenge: how do you serve thousands or millions of requests efficiently without requiring a data center full of expensive GPUs? Raw LLM inference is computationally intensive—a single forward pass through a model like GPT-3 or Llama-70B involves billions of operations. Naive approaches that process requests individually waste computational resources and deliver poor throughput. The key to economically viable LLM deployments lies in two critical optimization strategies: batching requests to maximize hardware utilization, and caching results to avoid redundant computation.
These optimizations aren’t just nice-to-haves—they’re essential for production systems. Without effective batching, your GPUs sit idle most of the time despite burning through your cloud budget. Without intelligent caching, you recompute identical or similar responses repeatedly. Together, these strategies can improve throughput by 10-100x while dramatically reducing costs. Let’s explore how to implement batching and caching effectively for high-throughput LLM inference.
Understanding LLM Inference Characteristics
Before diving into optimizations, understanding how LLM inference works reveals why batching and caching matter so much.
The two-phase inference process:
LLM inference happens in two distinct phases with different computational characteristics. The prefill phase processes the input prompt, computing key-value (KV) cache entries for each token. This phase is compute-bound—matrix multiplications dominate, and GPUs can achieve high utilization. The computational complexity grows linearly with prompt length.
The decode phase generates output tokens one at a time, using the cached keys and values from previous tokens. Each new token requires a full forward pass through the model, but only for a single token. This phase is memory-bound—you’re constantly reading model weights and KV cache from memory, but doing relatively little computation per read. GPU utilization drops significantly during decode.
This memory-bound nature of the decode phase is crucial. If you’re processing one request at a time, your expensive GPU is mostly waiting on memory accesses, wasting computational capacity. Batching multiple requests together utilizes this idle capacity by interleaving memory accesses from different requests.
Memory vs. compute trade-offs:
LLMs require enormous memory to store model weights—70B parameter models need 140GB in FP16 format. The KV cache for attention mechanisms adds additional memory requirements that scale with sequence length and batch size. A single request with 2048 token context might need several GB of KV cache.
These memory requirements create a constraint on batching—you can’t batch unlimited requests because you’ll run out of GPU memory. Finding the optimal batch size requires balancing memory capacity against computational efficiency. Too small, and you waste compute. Too large, and you run out of memory or increase latency unacceptably.
Latency vs. throughput trade-offs:
Single-request latency and system throughput often conflict. Processing requests individually minimizes latency—each request starts immediately and completes as fast as possible. But this approach delivers poor throughput because GPU utilization is low.
Batching improves throughput by processing multiple requests together, but increases individual request latency. Requests must wait for others in the batch to complete, and batch processing takes longer than single-request processing. Effective batching strategies navigate this trade-off, maximizing throughput while keeping latency acceptable for your use case.
Static Batching: The Foundation
Static batching is the simplest form of batching—accumulate requests until you reach a target batch size, then process the entire batch together.
How static batching works:
Incoming requests enter a queue. When the queue reaches a configured batch size (say, 32 requests) or a timeout expires (perhaps 50ms), you process the batch. All requests in the batch run through prefill together, then all decode together, generating tokens in parallel until all requests reach their stopping criteria or maximum length.
This parallelism comes from the fact that transformer architectures can efficiently process multiple sequences simultaneously. The matrix multiplications that dominate computation naturally batch—multiplying a (batch_size × seq_len × hidden_dim) tensor with weight matrices is nearly as fast as processing a single sequence, up to GPU memory limits.
Implementation considerations:
Static batching requires padding all sequences in a batch to the same length—typically the longest sequence in the batch. This padding adds computational waste. If your batch contains one request with 1000 tokens and 31 requests with 50 tokens, you’re padding 31 sequences to 1000 tokens, wasting ~95% of the computation.
Modern frameworks implement optimizations to reduce padding overhead. Attention masking ensures padded positions don’t contribute to attention computations. Some implementations use packed sequences, concatenating all sequences and using offset arrays to indicate boundaries, eliminating padding entirely.
Batch size optimization:
Finding the optimal static batch size requires experimentation with your specific workload, hardware, and model. Start with small batches (4-8) and gradually increase while monitoring GPU memory usage and throughput. You’ll find a sweet spot where throughput plateaus—beyond that batch size, memory bandwidth becomes the bottleneck and further increases provide diminishing returns.
For most models on modern GPUs, optimal batch sizes fall between 16-64 for inference. Larger models with more parameters can often benefit from larger batches because the increased compute per token helps amortize memory access costs.
⚡ Batching Impact on Throughput
Batch Size 8: 30-50 tokens/second (3-5x improvement)
Batch Size 32: 80-120 tokens/second (8-12x improvement)
Batch Size 64: 100-150 tokens/second (diminishing returns, approaching memory bandwidth limit)
Note: Actual numbers vary significantly by model size, hardware, and sequence length
Dynamic Batching: Continuous Processing
Static batching has a major limitation—requests that finish early must wait for the entire batch to complete. For generative tasks where output lengths vary significantly, this wastes resources and increases latency. Dynamic batching addresses this by allowing requests to enter and exit the batch continuously.
Continuous batching fundamentals:
Also called “iteration-level batching,” continuous batching processes requests at the granularity of individual decode steps rather than entire sequences. After generating each token, the system checks if any requests have finished. Completed requests exit the batch, and new requests from the queue enter to fill empty slots.
This approach dramatically improves resource utilization. A batch of 32 requests might complete at very different rates—some generate 10 tokens, others 500 tokens. Continuous batching keeps the batch size stable by constantly adding new work as old requests complete, maintaining high GPU utilization throughout.
Implementation with paged attention:
Continuous batching interacts powerfully with paged attention—a technique that manages KV cache memory more efficiently. Traditional attention stores KV cache for each request in contiguous memory, requiring large pre-allocated buffers. This creates memory fragmentation and limits flexibility.
Paged attention, pioneered by systems like vLLM, breaks KV cache into fixed-size blocks (typically 16-64 tokens per block). Blocks can be stored anywhere in memory and referenced through a page table, similar to virtual memory in operating systems. This enables several optimizations:
- Efficient memory sharing: Different requests can share identical prefixes (like system prompts) by pointing to the same cached blocks
- Reduced fragmentation: Small memory blocks are easier to allocate without waste
- Dynamic growth: KV cache grows incrementally as sequences generate, rather than pre-allocating maximum length
Managing variable-length sequences:
Dynamic batching naturally handles variable sequence lengths without padding waste. Each position in the batch can have different current lengths. The attention mechanism processes only valid tokens, skipping past-the-end positions automatically.
This flexibility comes at a cost—implementation complexity increases significantly. You need careful bookkeeping of which sequences are active, where their KV cache blocks reside, and when to add/remove sequences from the batch. Frameworks like vLLM and TensorRT-LLM handle this complexity, providing dynamic batching with straightforward APIs.
Scheduling and prioritization:
With continuous batching, deciding which requests to add to the batch becomes a scheduling problem. Simple first-come-first-served (FCFS) scheduling works but isn’t optimal. More sophisticated approaches consider:
- Request priority: Premium users get faster service
- Estimated completion time: Prefer adding short requests to maintain batch fluidity
- Cache locality: Prioritize requests that share prompt prefixes with current batch members
- Fairness: Prevent starvation of low-priority or long requests
Good schedulers balance these factors, maximizing throughput while meeting service level objectives for different request classes.
Semantic Caching: Avoiding Redundant Computation
Batching maximizes throughput for unique requests, but many real-world workloads contain repetition. Caching responses to identical or similar prompts eliminates redundant computation entirely, providing orders of magnitude speedup.
Exact match caching:
The simplest caching strategy stores prompt-response pairs. When a request arrives, hash the prompt and check the cache. If found, return the cached response instantly without inference. This works well for truly identical prompts—common in applications with templated queries or finite question sets.
Implementation is straightforward using distributed caches like Redis or Memcached. The challenge is cache invalidation—responses become stale if the model updates or if responses should vary (like creative generation tasks with temperature > 0). Time-based expiration or version tagging handles staleness.
Exact caching captures a surprisingly large fraction of real-world queries. Studies show 20-40% of chatbot queries are duplicates or near-duplicates. For these workloads, exact caching provides massive cost savings with minimal implementation effort.
Approximate semantic caching:
Many prompts differ textually but are semantically equivalent or highly similar. “What’s the weather?” and “What is the current weather?” should produce similar responses. Approximate caching uses embeddings to identify semantically similar prompts and returns cached responses for sufficiently similar queries.
The process involves:
- Embed incoming prompts using a fast embedding model (e.g., sentence transformers)
- Search for similar embeddings in a vector database (e.g., Faiss, Pinecone)
- If similarity exceeds threshold, return cached response
- Otherwise, compute response and cache it with its embedding
This approach requires calibrating the similarity threshold. Too strict, and you miss caching opportunities. Too loose, and you return inappropriate responses to genuinely different queries. The optimal threshold depends on your domain and tolerance for imperfect matches.
Prefix caching for shared contexts:
Many requests share common prefixes—system prompts, conversation history, or document contexts. Computing attention over these shared prefixes repeatedly wastes resources. Prefix caching stores KV cache for common prefixes, allowing new requests to start from the cached state rather than recomputing from scratch.
For example, if your chatbot uses a 500-token system prompt, every request processes those 500 tokens. With prefix caching, compute the KV cache for the system prompt once, then reuse it for all subsequent requests. This eliminates hundreds of token-worth of computation per request.
Implementation requires careful cache management. Prefixes are shared across requests, so you can’t modify their cached state. When multiple requests need the same prefix, create new KV cache for the unique portions that follow. Paged attention systems handle this naturally by sharing cached blocks across requests.
Multi-level caching strategies:
Production systems often combine multiple caching layers for maximum efficiency:
- L1: Exact match cache: Check for identical prompts (Redis/Memcached)
- L2: Semantic cache: Check for similar prompts (vector database + embeddings)
- L3: Prefix cache: Reuse KV cache for shared prompt prefixes
- L4: Partial generation cache: Cache intermediate token sequences for common patterns
Each layer catches different types of redundancy. Exact cache hits return in milliseconds. Semantic cache hits return in tens of milliseconds. Prefix cache hits reduce inference latency by 20-80% depending on prefix length. Combining these layers maximizes cost savings across diverse workload patterns.
💾 Caching Impact on Cost and Latency
• Latency: 1-5ms (vs 100-500ms inference)
• Cost: ~99.5% reduction per hit
• Hit Rate: 20-40% for typical chatbot workloads
Semantic Cache:
• Latency: 10-50ms (vs 100-500ms inference)
• Cost: ~95% reduction per hit
• Hit Rate: Additional 10-20% beyond exact matches
Prefix Cache:
• Latency: 30-80% reduction in inference time
• Cost: Proportional to prefix length / total length
• Hit Rate: Depends on prompt structure commonality
Advanced Batching Techniques
Beyond basic dynamic batching, several advanced techniques push throughput even higher for specific scenarios.
Speculative decoding:
Speculative decoding uses a smaller, faster “draft” model to predict multiple tokens ahead, then verifies these predictions with the target model in parallel. When predictions are correct, you effectively generate multiple tokens per forward pass, significantly increasing throughput.
The draft model generates k candidate tokens (typically 4-8). The target model processes all candidates in a single batched forward pass, accepting correct predictions and rejecting incorrect ones. For highly predictable text, speculation can improve throughput 2-3x with negligible impact on output quality.
Speculation works best with batching because the parallel verification naturally fits the batch processing paradigm. You can run speculation for multiple requests simultaneously, maximizing GPU utilization across both draft generation and verification phases.
Chunked prefill:
Very long prompts (thousands of tokens) create problems for batching—their prefill phase takes so long that other requests experience high latency waiting. Chunked prefill breaks long prompts into smaller chunks, interleaving their processing with decode steps from other requests.
For example, split a 4096-token prompt into eight 512-token chunks. Process one chunk, then resume decoding for other requests in the batch, then process another chunk, and so on. This prevents any single request from monopolizing the GPU and keeps latency more consistent across requests.
The trade-off is slightly increased total processing time for the long prompt due to less efficient attention computation across chunks. But for workloads mixing short and long prompts, chunked prefill significantly improves fairness and average latency.
Priority-based batching:
Not all requests are equally important. User-facing chatbot queries might need low latency, while batch processing jobs tolerate higher latency. Priority-based batching uses separate queues for different priority levels and allocates GPU resources proportionally.
High-priority requests form small batches with low timeout thresholds, minimizing latency. Lower-priority requests accumulate into larger batches, maximizing throughput but accepting higher latency. The scheduler dynamically balances GPU time between priority classes based on workload and SLA requirements.
This approach requires careful tuning to prevent starvation of low-priority requests while ensuring high-priority requests meet their latency targets. Token bucket or weighted fair queueing algorithms provide good balance in practice.
Monitoring and Optimization
Deploying batching and caching strategies effectively requires comprehensive monitoring to identify bottlenecks and tune parameters.
Key metrics to track:
Monitor these metrics to understand system performance and identify optimization opportunities:
Throughput metrics:
- Tokens per second (total system throughput)
- Requests per second
- Batch size distribution (actual sizes processed)
- GPU utilization percentage
Latency metrics:
- Time-to-first-token (TTFT) – how quickly users see the first response token
- Time-per-output-token (TPOT) – generation speed after the first token
- End-to-end latency distribution (p50, p95, p99)
- Queue wait time before batch processing begins
Caching metrics:
- Cache hit rate by cache layer
- Cache lookup latency
- Cache size and eviction rate
- Saved inference cost from cache hits
Resource metrics:
- GPU memory utilization
- GPU compute utilization during prefill vs decode
- CPU utilization for pre/post-processing
- Network bandwidth for distributed inference
These metrics reveal optimization opportunities. Low GPU utilization suggests larger batches could help. High p99 latency might indicate some requests are starving in the queue. Poor cache hit rates suggest adjusting similarity thresholds or adding cache layers.
A/B testing configuration changes:
When tuning batching parameters or caching strategies, use A/B testing to measure impact on real workloads. Route a fraction of traffic to experimental configurations and compare metrics. Small changes to batch timeout, queue depth, or cache similarity threshold can have outsized impacts—positive or negative.
Test one variable at a time to understand causal relationships. If you change batch size and cache threshold simultaneously, you won’t know which caused observed performance changes. Methodical experimentation leads to better-optimized systems.
Adaptive parameter tuning:
Some systems implement adaptive tuning that adjusts parameters based on real-time metrics. If queue depth grows rapidly, reduce batch timeout to process requests more quickly. If GPU utilization drops, increase batch size to pack more work into each batch.
Adaptive tuning must be conservative to avoid instability. Rapid parameter changes can cause oscillations where the system bounces between extreme configurations. Use exponential smoothing of metrics and gradual parameter adjustments with lower and upper bounds to maintain stability.
Combining Strategies for Maximum Impact
The real power emerges when combining batching and caching strategies synergistically. These optimizations aren’t mutually exclusive—they complement each other to provide multiplicative benefits.
Caching within batches:
Check caches before adding requests to batches. If a request hits the exact or semantic cache, return the cached response immediately without ever entering the inference batch. This keeps batch slots available for requests that truly need inference, improving batch efficiency.
For prefix cache hits, mark these requests as having partial pre-computation. When batching, group requests sharing the same cached prefix together when possible to maximize reuse of the shared KV cache blocks.
Batching similar requests:
When multiple requests share significant prompt overlap (detected via semantic cache embeddings), batch them together preferentially. This maximizes opportunities for prefix caching and can improve cache locality in attention mechanisms.
For example, if five users ask variations of “What’s the weather in Seattle?” batch these together if possible. They’ll share much of their semantic processing, potentially allowing attention optimizations that single-request processing couldn’t exploit.
Graduated cache warming:
Use background processing to pre-compute and cache responses for common queries during low-traffic periods. This “cache warming” ensures high hit rates during peak traffic, reducing inference load exactly when it matters most.
Identify frequently occurring prompts from historical data, compute responses in large batches during off-peak hours, and populate your cache. Combined with batching, cache warming can reduce peak infrastructure requirements by 50-70% for workloads with predictable query patterns.
Conclusion
Batching and caching represent the foundational optimizations enabling economically viable LLM inference at scale. Dynamic batching with continuous request processing maximizes GPU utilization by maintaining full batches throughout the decode phase, while intelligent caching at multiple levels eliminates redundant computation for duplicate or similar requests. Together, these strategies deliver 10-100x improvements in throughput and proportional cost reductions, transforming LLM inference from prohibitively expensive to operationally feasible for real-world applications.
Effective implementation requires understanding your workload characteristics, carefully monitoring key metrics, and iteratively tuning parameters to balance throughput, latency, and resource utilization. Start with exact match caching and basic dynamic batching, then incrementally add semantic caching, prefix caching, and advanced batching techniques as your scale and sophistication grow. With thoughtful optimization, you can serve millions of requests daily on modest infrastructure while maintaining excellent user experience and manageable costs.