Continuous Batching for LLM Inference: How It Works and When to Use It

Continuous batching is the single most impactful throughput optimization for LLM inference servers, and understanding how it works is essential for anyone operating LLMs at scale. Before continuous batching, every production LLM serving system used static batching: the server waited to accumulate a fixed number of requests, then processed them all together as a single batch. This approach is efficient for conventional neural networks where all inputs and outputs have the same shape, but catastrophically inefficient for autoregressive language models — because different requests generate different numbers of tokens, the entire batch has to wait for the longest sequence to finish before returning any results or accepting new requests. A request that needs 5 tokens blocks a batch slot that could have been freed and reused for a new request after those 5 steps. Continuous batching, introduced by Orca (Yu et al., 2022), eliminates this waste by allowing individual sequences to leave and join the batch at each generation step.

How Static Batching Wastes GPU Compute

To understand why continuous batching matters, consider what happens with static batching under realistic traffic. Suppose you’re serving a batch of 8 requests. Request A generates 500 tokens (a long document summary), requests B through H each generate 20 tokens (short answers). With static batching, requests B through H finish after 20 steps, but the batch continues running for another 480 steps to finish request A. During those 480 steps, 7 out of 8 batch slots are occupied by finished requests that are contributing nothing — you’re using 87.5% of your batch capacity to hold dead sequences while new requests queue up waiting. GPU utilization stays high (the hardware is busy running A), but throughput is low because the batch capacity isn’t being used for productive work.

The problem compounds under variable traffic. When requests have heavy-tailed length distributions (a few very long requests mixed with many short ones), static batching creates systematic bottlenecks where short requests incur latency proportional to the longest request in their batch — entirely unrelated to their own computation needs. Padding makes this worse: since all sequences in a static batch must have the same length to form a rectangular tensor, shorter sequences are padded to the longest sequence’s length, wasting compute on padding tokens.

How Continuous Batching Works

Continuous batching, also called iteration-level scheduling or in-flight batching, processes one generation step at a time across all active requests, and makes a scheduling decision at each step about which requests to include. After each step, finished sequences (those that generated an EOS token or hit max length) are evicted from the batch, and new requests from the queue are added to fill the freed slots. The batch composition changes at every step rather than being fixed for the duration of a decode phase.

The key insight enabling this is that the KV cache for each sequence is independent. Sequence A’s KV cache doesn’t interfere with sequence B’s — they each have their own cache entries that grow as generation proceeds. When sequence B finishes and is evicted, its KV cache is freed. When a new sequence C is added, it initializes a fresh KV cache and runs its prefill phase (processing the entire prompt in parallel as one forward pass) before joining the decode batch. The decode batch is therefore a mixed batch of sequences at different generation steps, each with a different KV cache length, processing one new token per step in synchrony.

Implementing this correctly requires solving the KV cache memory management problem: how do you allocate and free KV cache memory dynamically as sequences come and go? vLLM’s PagedAttention algorithm (Kwon et al., 2023) is the dominant solution — it manages KV cache memory in fixed-size pages (analogous to virtual memory pages), allocating and freeing pages as sequences grow and complete. This eliminates the memory fragmentation that would occur with contiguous KV cache allocation and allows vLLM to maintain near-100% GPU memory utilization for KV cache without over-allocation.

Prefill vs Decode Phases

Every LLM request has two distinct phases. The prefill phase processes the entire input prompt in a single forward pass — all prompt tokens are processed in parallel, and the KV cache for all prompt tokens is computed at once. This phase is highly parallelizable and compute-bound. The decode phase generates one token at a time, auto-regressively — each step takes the previously generated token as input, looks up the full KV cache for all previous tokens (prompt + generated so far), and produces the next token. This phase is memory-bandwidth-bound: the bottleneck is reading the KV cache from HBM, not performing floating-point operations.

Continuous batching mixes prefill and decode operations in the same batch, which creates a tension. Prefill for a long prompt is computationally expensive and can dominate a batch step, adding latency for requests that are already in the decode phase. Modern LLM serving systems handle this in different ways: vLLM originally ran prefill and decode in separate batches (prefill on arrival, then join the decode batch); newer systems like Sarathi-Serve use chunked prefill, splitting long prompts into smaller chunks and interleaving them with decode steps to smooth out the latency spikes that long prefills cause for concurrent requests.

Throughput vs Latency Tradeoffs

Continuous batching significantly improves throughput — the number of requests served per second — but has a nuanced effect on latency. For short requests that would have been blocked behind long ones in a static batch, latency improves substantially since they can now complete and be returned without waiting. For requests in a fully loaded system where the decode batch is always at maximum capacity, time-to-first-token (TTFT) may increase because new requests wait in queue for a decode slot to open, whereas in a static batching system they would have been picked up at the next batch boundary.

The practical consequence: continuous batching optimizes for throughput and average latency. If your SLA is primarily about throughput (requests per second) or average latency, continuous batching is strictly better than static batching. If your SLA is about worst-case latency or TTFT under heavy load, you need to carefully set concurrency limits and queue depths to bound how long new requests wait. vLLM’s –max-num-seqs parameter controls the maximum number of sequences in the decode batch at once; setting it too high degrades TTFT under heavy load as the decode batch becomes large and each step takes longer.

Continuous Batching in vLLM

vLLM implements continuous batching by default — there’s nothing to configure to enable it. The relevant configuration parameters are around tuning its behavior:

python -m vllm.entrypoints.openai.api_server   --model meta-llama/Meta-Llama-3-8B-Instruct   --max-num-seqs 256           # max concurrent sequences in decode batch
  --max-num-batched-tokens 8192   # max total tokens per iteration (prefill + decode)
  --gpu-memory-utilization 0.90   # fraction of GPU memory for KV cache
  --max-model-len 4096            # max sequence length (prompt + output)

The most impactful parameter is –max-num-batched-tokens, which caps the total compute per step. Setting it too low underutilizes the GPU on decode-heavy workloads; too high causes OOM when multiple long prefills arrive simultaneously. A starting point is 4x the model’s typical prompt length times your target concurrency, then tune based on measured GPU utilization and latency percentiles. The –gpu-memory-utilization parameter controls how much GPU HBM is reserved for the KV cache (PagedAttention’s page pool) — higher values mean more concurrent sequences fit in memory, but leaving some headroom (0.90 rather than 0.95) prevents OOM from activation memory spikes during prefill of long prompts.

Measuring the Impact

The throughput improvement from continuous batching over static batching depends heavily on your request length distribution. For workloads with highly variable output lengths (short and long responses mixed), improvements of 5–10x in requests per second are common. For workloads with uniformly short outputs (e.g., classification with 1–3 token outputs), the improvement is smaller since static batching’s waste is minimal when all sequences finish at nearly the same step. Measure your specific workload’s improvement by benchmarking vLLM (or TGI, which also implements continuous batching) against a naive static batching baseline using your actual traffic distribution — synthetic benchmarks with uniform output lengths will understate the benefit for real workloads.

The vllm benchmark_throughput.py script provides a standard benchmark that measures tokens per second under configurable request rate, input length, and output length distributions. Run it with –request-rate inf to measure peak throughput (maximum requests per second the server can sustain) and with –request-rate N to measure latency at a given load level. Compare throughput across different –max-num-seqs and –max-num-batched-tokens settings to find the configuration that maximizes throughput while keeping p95 TTFT within your latency budget.

Disaggregated Prefill and Decode

One of the active research and engineering frontiers in LLM serving is disaggregating the prefill and decode phases onto separate hardware. The motivation is that prefill and decode have fundamentally different hardware requirements: prefill is compute-bound and benefits from high FLOPS (a large GPU running at peak arithmetic throughput), while decode is memory-bandwidth-bound and benefits from high HBM bandwidth regardless of compute throughput. Running both on the same GPU means you’re always under-utilizing one dimension — during prefill you’re not using the memory bandwidth efficiently, and during decode you’re not using the compute units efficiently.

Disaggregated serving assigns prefill requests to a pool of “prefill GPUs” and decode requests to a “decode GPU” pool. When a request arrives, it is first processed by a prefill GPU (which runs the full prompt through the model in parallel), then its KV cache is transferred over the network to a decode GPU which handles autoregressive generation. This transfer adds latency (KV cache for a 4096-token prompt on a 70B model is several hundred MB, so even a 400Gbps InfiniBand link adds tens of milliseconds), but it allows you to size your prefill and decode fleets independently based on your actual workload characteristics. Systems like Mooncake (used in production at Kimi) and disaggregated vLLM have demonstrated that for workloads with heavy prefill (long documents, RAG with large retrieved chunks), this architecture improves overall throughput by 30–50% compared to mixing prefill and decode on the same GPU pool. For most teams not operating at hyperscale, this is not worth the infrastructure complexity, but understanding it contextualizes why continuous batching alone doesn’t fully solve the serving efficiency problem at very large scale.

Continuous Batching vs Request Batching in Practice

A common point of confusion is the relationship between continuous batching (iteration-level scheduling) and request-level batching. They are complementary, not alternatives. Continuous batching describes how the serving system manages the sequence of generation steps — allowing sequences to leave and join at each step rather than running a fixed batch to completion. Request batching describes how many requests are processed in parallel at any given step. Both vLLM and TGI use continuous batching as their scheduling algorithm and process multiple requests in parallel (a batch of sequences) at each step. The batch size at any given step is determined by how many sequences are currently in the decode phase, bounded by –max-num-seqs.

The practical implication for engineers configuring LLM servers is that the right mental model is queue management rather than batch size tuning. The question is not “what batch size should I use?” but rather “how many concurrent sequences can my GPU handle while meeting my latency SLA?” That answer depends on the model size (larger models have larger KV cache per sequence, limiting concurrency), the sequence length distribution (longer sequences consume more KV cache memory per slot), and the latency budget (more concurrent sequences = longer per-step time = higher latency per request). The –max-num-seqs parameter is your primary lever for this tradeoff. Start by estimating your GPU’s KV cache capacity in tokens (GPU memory in GB after model weights × bytes per KV cache token) and set –max-num-seqs so that the total KV cache across all sequences doesn’t exceed 90% of that capacity. Then measure actual throughput and latency at your expected request rate and adjust from there.

Leave a Comment