What Are the Two Steps of LLM Inference?

Large language models like GPT-4, Claude, and Llama generate text through a process that appears seamless to users but actually unfolds in two distinct computational phases: the prefill phase and the decode phase. Understanding these two steps is fundamental to grasping how LLMs work, why they behave the way they do, and what engineering challenges arise when deploying them at scale. The dichotomy between these phases shapes everything from latency characteristics to hardware requirements, from optimization strategies to cost structures. Each phase exhibits fundamentally different computational patterns, memory access behaviors, and bottlenecks—making them targets for entirely different optimization techniques.

For practitioners building systems around LLMs, understanding this two-phase structure illuminates why certain optimizations work, why costs scale the way they do, and how to design architectures that maximize throughput while minimizing latency. For users curious about the technology behind AI assistants, this knowledge demystifies why responses sometimes start slowly then flow smoothly, or why batch processing differs dramatically from interactive generation. The distinction between prefill and decode represents one of the most important architectural characteristics of autoregressive language models.

The Fundamental Architecture of Autoregressive Generation

Before examining the two phases in detail, it’s essential to understand the autoregressive nature of LLM text generation. Unlike models that process an entire input and produce a complete output in one step, autoregressive models generate text sequentially—one token at a time—with each new token depending on all previously generated tokens.

Token-by-Token Generation forms the core of how modern LLMs work. When you ask an LLM “What is the capital of France?”, the model doesn’t generate the complete answer “The capital of France is Paris” simultaneously. Instead, it generates:

  1. “The” (first token)
  2. “capital” (second token, conditioned on “The”)
  3. “of” (third token, conditioned on “The capital”)
  4. “France” (fourth token, conditioned on “The capital of”)
  5. “is” (fifth token, conditioned on “The capital of France”)
  6. “Paris” (sixth token, conditioned on “The capital of France is”)

Each token depends on all preceding tokens, creating a sequential dependency chain that prevents full parallelization of the generation process. This autoregressive structure directly gives rise to the two-phase inference pattern.

The Transformer Architecture Context helps explain why these phases differ so dramatically. Transformer models process sequences through self-attention mechanisms that compare every position in a sequence with every other position. For an input of length n, this requires computing n² attention weights. The computational and memory characteristics of these attention operations differ significantly depending on whether you’re processing many tokens in parallel (prefill) or generating one token at a time (decode).

The key-value (KV) cache plays a central role in understanding both phases. During generation, the model must attend to all previous tokens—both from the original prompt and from already-generated output. Rather than recomputing representations of these previous tokens repeatedly, the model stores their “keys” and “values” in the KV cache, accessing them during subsequent generation steps. Managing this cache efficiently becomes crucial for performance.

The Two-Phase Inference Process

📥

Phase 1: Prefill

Input: Full prompt
Processing: Parallel
Output: First token + KV cache
  • Process all prompt tokens at once
  • Compute-bound operation
  • High GPU utilization
  • Build KV cache for context
  • One-time per request
🔄

Phase 2: Decode

Input: Previous token
Processing: Sequential
Output: Next token
  • Generate one token at a time
  • Memory-bandwidth-bound
  • Lower GPU utilization
  • Read from KV cache
  • Repeats until completion

Phase 1: The Prefill Phase – Processing the Prompt

The prefill phase represents the first step of LLM inference, where the model processes the entire input prompt in a single forward pass through the network. This phase is compute-intensive and highly parallelizable, taking advantage of modern GPU architectures designed for large matrix operations.

Parallel Processing of Input Tokens defines the prefill phase’s computational character. When you submit a prompt with 1,000 tokens, the model doesn’t process them sequentially—it processes all 1,000 tokens simultaneously in parallel. Each layer of the transformer processes all input positions at once, computing attention weights between every pair of positions and applying feedforward transformations to all tokens concurrently.

This parallelism makes the prefill phase compute-bound. The GPU’s arithmetic units (tensor cores, CUDA cores) work at near-maximum capacity, performing billions of floating-point operations per second. Matrix multiplications dominate the computation—transforming input embeddings through attention mechanisms and feedforward layers. Modern GPUs excel at these operations, achieving high utilization and throughput.

Building the KV Cache represents the prefill phase’s critical output beyond the first generated token. As the model processes each layer of the transformer for the input tokens, it computes “key” and “value” representations for every position. Rather than discarding these after computing the first token, the model stores them in the KV cache.

For a prompt with n tokens passing through a model with l layers, each having d dimensions for keys and values, the KV cache requires storing 2 × l × n × d values. This memory grows linearly with prompt length and model depth. A 7-billion parameter model with 32 layers, 4096 dimensions per layer, and a 2048-token prompt requires approximately 2GB of KV cache storage (in FP16 precision).

This cache construction is why long prompts incur significant one-time cost during prefill. A 10-token prompt might prefill in 50ms, while a 10,000-token prompt could take 5 seconds—the computation scales with sequence length. However, this cost is paid only once per generation request.

Computational Characteristics of prefill make it fundamentally different from decode:

  • High throughput: Processing many tokens simultaneously achieves high tokens-per-second rates
  • Compute-bound: Limited by GPU’s arithmetic throughput rather than memory bandwidth
  • Batch-friendly: Multiple requests can be batched effectively, processing different prompts in parallel
  • Predictable latency: Time to complete scales linearly with prompt length
  • High GPU utilization: Typically achieving 60-90% of peak compute capacity

For production systems, prefill phase optimization focuses on maximizing batch sizes to amortize the fixed overhead across multiple requests, and using efficient attention implementations like FlashAttention that reduce memory access costs while maintaining compute intensity.

The Role of Attention in Prefill deserves special attention. Self-attention in the prefill phase computes relationships between all pairs of input tokens, requiring O(n²) operations for n tokens. Each token attends to every other token, computing attention weights that determine how much each position should influence the representation of each other position.

This quadratic complexity means that doubling prompt length quadruples the attention computation cost. Techniques like sparse attention, sliding window attention, or linear attention attempt to reduce this, but standard transformers process full O(n²) attention during prefill. For long documents (10,000+ tokens), this becomes computationally expensive, which is why many models have context length limits and why extending context windows requires architectural innovations.

Phase 2: The Decode Phase – Generating Tokens Sequentially

After prefill produces the first token, the decode phase takes over, generating subsequent tokens one at a time in an autoregressive manner. This phase exhibits dramatically different computational characteristics than prefill, creating distinct optimization challenges and performance bottlenecks.

Sequential Token Generation defines the decode phase’s fundamental constraint. Unlike prefill’s parallel processing, decode must generate tokens sequentially because each token depends on all previous tokens. The model:

  1. Takes the most recently generated token as input
  2. Computes its representation through all transformer layers
  3. Attends to all previous tokens (from both original prompt and generated output) using the KV cache
  4. Produces a probability distribution over the vocabulary
  5. Samples or selects the next token
  6. Repeats until reaching a stopping condition (max length, end-of-sequence token, etc.)

This sequential dependency prevents parallelization across tokens within a single generation sequence. You cannot generate the 10th output token until you’ve generated tokens 1-9, because token 10 depends on them. This creates the characteristic “one token at a time” behavior users observe when watching LLM outputs stream.

Memory Bandwidth Bottleneck represents the decode phase’s primary performance constraint. For each generated token, the model must:

  • Load all model weights from memory (potentially hundreds of gigabytes)
  • Read the entire KV cache containing representations of all previous tokens
  • Perform computations on a single token position
  • Store updated KV cache entries for the new token

The amount of computation per token is relatively small compared to the amount of data that must be moved from memory. This makes decode memory-bandwidth-bound rather than compute-bound. GPUs spend more time waiting for data to arrive from memory than performing arithmetic operations, resulting in lower utilization (often 20-40% of peak compute) compared to the prefill phase.

The Growing KV Cache Challenge intensifies as generation progresses. With each new token, the KV cache grows—requiring more memory and more bandwidth to access. If you generate 1,000 tokens, by the end you’re reading 1,000 entries from the KV cache for each layer’s attention computation. This growing memory requirement and access cost is why long generations slow down and why KV cache management becomes critical for production systems.

For a 70-billion parameter model generating 2,000 tokens, the KV cache might consume 60-80GB of memory. This dominates the memory footprint during long generations, often exceeding the memory required for the model weights themselves. Managing this cache efficiently—through techniques like paging (PagedAttention), quantization, or selective retention—becomes essential for scalable deployment.

Latency Per Token in decode phase determines the user-perceived generation speed. If the model generates tokens at 20 tokens per second, a 200-token response takes 10 seconds. This per-token latency depends on:

  • Model size (larger models have more weights to load from memory)
  • Memory bandwidth (faster memory enables quicker weight and KV cache access)
  • Batch size (batching multiple decode steps together improves throughput but slightly increases latency per request)
  • Implementation optimizations (kernel fusion, quantization, efficient attention)

Typical decode speeds range from 5-10 tokens/second for large models on consumer GPUs to 50-100+ tokens/second for smaller models on high-end datacenter GPUs with optimized inference engines.

Batching During Decode presents interesting trade-offs. Unlike prefill where different-length prompts can batch cleanly, decode batching is more complex because different requests generate different numbers of tokens and complete at different times. Some requests might need 50 tokens while others need 500.

Advanced batching strategies like continuous batching (implemented in systems like vLLM) dynamically manage batches, adding new requests as slots become available when other requests complete. This maximizes GPU utilization during decode by ensuring the memory bandwidth is shared across as many requests as possible, improving overall throughput even if individual request latency increases slightly.

The Performance Asymmetry and Its Implications

The stark difference between prefill and decode phases creates a fundamental performance asymmetry that shapes everything about LLM deployment and optimization. Understanding this asymmetry helps explain many seemingly puzzling aspects of LLM behavior and system design.

Latency Characteristics differ dramatically between phases. Prefill latency scales with prompt length but is a one-time cost paid upfront. Decode latency accumulates token by token throughout generation. For a 1,000-token prompt and 500-token generation:

  • Prefill might take 2 seconds (processing 1,000 tokens in parallel)
  • Decode might take 25 seconds (generating 500 tokens at 20 tokens/second)

The total time-to-first-token (TTFT) is dominated by prefill, while time-to-completion is dominated by decode. This asymmetry explains why users experience a brief pause before generation starts, followed by smooth streaming of output tokens.

Different Optimization Strategies target each phase:

For Prefill:

  • Batch multiple requests together to amortize fixed overhead
  • Use FlashAttention or other memory-efficient attention implementations
  • Optimize matrix multiplication kernels for large batch operations
  • Leverage mixed-precision or quantization for faster compute

For Decode:

  • Implement efficient KV cache management (PagedAttention, quantized cache)
  • Optimize memory bandwidth usage and data layout
  • Use speculative decoding to generate multiple tokens per step
  • Apply continuous batching to maximize throughput across requests

Hardware Implications of the two-phase pattern influence infrastructure choices. The prefill phase benefits from high compute throughput—GPUs with more tensor cores, higher TFLOPS ratings. The decode phase benefits from high memory bandwidth—GPUs with HBM (High Bandwidth Memory), larger memory buses.

This explains why some GPUs excel at inference despite lower compute specs—their memory bandwidth characteristics match decode phase requirements better. Conversely, training-optimized GPUs with extreme compute capacity may not provide proportional inference benefits if memory bandwidth is the bottleneck.

Cost Structures reflect the asymmetry. Cloud providers increasingly price LLM inference based on both input tokens (prefill) and output tokens (decode), recognizing that these consume different resources. Input token pricing reflects compute costs, while output token pricing reflects the cumulative memory bandwidth costs and the longer time GPU resources remain allocated during sequential generation.

💡 Why Longer Prompts Don’t Necessarily Mean Slower Generation

A common misconception is that longer prompts result in proportionally slower overall generation. While prefill time increases linearly with prompt length, the decode speed (tokens per second) remains relatively constant—it’s determined by model size and hardware, not prompt length. A 10,000-token prompt might take 5 seconds to prefill, but then generates at the same 20 tokens/second as a 100-token prompt. The long prompt affects time-to-first-token but not the per-token generation rate afterward. However, the growing KV cache from both prompt and generated tokens means that very long contexts do eventually slow decode slightly as more historical tokens must be attended to.

Practical Implications for Developers and Users

Understanding the two-phase structure of LLM inference provides practical insights for both developers building systems around LLMs and users interacting with them.

Prompt Engineering Considerations become clearer with phase awareness. Since prefill is a one-time cost, including substantial context in your prompt doesn’t slow per-token generation—only the initial startup. This means:

  • Rich prompts with many examples (few-shot learning) don’t penalize generation speed
  • Including relevant documentation or context improves quality without affecting decode throughput
  • The cost is in time-to-first-token, not in ongoing generation speed

However, extremely long prompts increase memory requirements for the KV cache, potentially limiting batch sizes or requiring more GPU memory, indirectly affecting cost and throughput at scale.

Streaming Responses leverage the decode phase’s sequential nature. Because tokens are generated one at a time, they can be streamed to users as they’re produced. This dramatically improves perceived latency—users see the first token after prefill completes (perhaps 1-2 seconds) rather than waiting for the entire 500-token response to finish (25 seconds). Modern inference APIs expose streaming interfaces specifically to exploit this characteristic.

Batch Processing vs Interactive Use benefit from understanding phase differences. For batch processing many documents where latency doesn’t matter, you can:

  • Maximize prefill batch sizes to process many prompts simultaneously
  • Use higher decode batch sizes to generate multiple sequences in parallel
  • Accept higher per-request latency in exchange for dramatically higher throughput

For interactive applications where responsiveness matters:

  • Minimize batch sizes to reduce time-to-first-token
  • Prioritize decode phase optimization for smooth token streaming
  • Use separate infrastructure for interactive vs batch workloads

Speculative Decoding represents an advanced technique that attempts to overcome decode phase limitations. The idea is to use a smaller, faster “draft” model to speculatively generate multiple tokens, then verify them with the larger “target” model in parallel. If the draft tokens are correct (which happens frequently for straightforward continuations), you effectively generate multiple tokens per decode step, improving throughput.

This technique works precisely because it converts some of the sequential decode phase into parallelizable verification, better utilizing GPU compute during decode. Implementations can achieve 2-3x speedup for certain generation patterns.

Quantization Impact differs between phases. Weight quantization (reducing model weights from FP16 to INT8 or INT4) primarily benefits the decode phase by reducing memory bandwidth requirements—fewer bytes to load per token. This is why quantized models show larger relative speedups during decode than prefill.

Activation quantization (quantizing the intermediate values during forward pass) can benefit both phases but requires careful calibration to avoid quality degradation. The prefill phase’s compute intensity makes some activation quantization strategies less effective since compute, not memory, is the bottleneck.

Advanced System Considerations

Production LLM inference systems incorporate sophisticated techniques to manage the two-phase pattern at scale, addressing challenges that arise when serving thousands of concurrent requests.

KV Cache Management Strategies become critical at scale. Naive implementations allocate contiguous memory for each request’s KV cache, leading to severe memory fragmentation as requests of different lengths come and go. PagedAttention, implemented in systems like vLLM, borrows ideas from virtual memory management in operating systems—allocating KV cache in fixed-size pages that can be allocated and freed dynamically.

This paging approach:

  • Reduces memory fragmentation by 20-40%
  • Enables memory sharing for identical prefixes (useful for batch queries with shared system prompts)
  • Allows dynamic memory allocation matching actual needs rather than pre-allocating worst-case memory

Continuous Batching maximizes decode phase throughput by maintaining GPU utilization as requests complete. Traditional batching waits for all requests in a batch to complete before starting new ones—wasting GPU resources as the batch size decreases. Continuous batching immediately adds new requests into available slots, maintaining consistently high batch sizes and GPU utilization.

The implementation complexity involves managing requests at different generation stages, handling memory allocation dynamically, and scheduling attention computations efficiently across heterogeneous request lengths.

Mixed Batch Scheduling runs prefill and decode in the same batch on the same GPU, carefully balancing compute-bound prefill operations with memory-bound decode operations. Since prefill uses GPU compute heavily while decode uses memory bandwidth heavily, interleaving them can improve overall resource utilization. However, this requires sophisticated scheduling to avoid interference and ensure fairness.

Multi-GPU Strategies must account for phase differences. Model parallelism (splitting the model across multiple GPUs) affects both phases similarly—each GPU processes its portion of the model for each token. Tensor parallelism and pipeline parallelism each have different characteristics for prefill vs decode, with tensor parallelism generally providing more consistent performance across both phases.

Caching at the System Level can optimize repeated prefills. If multiple users submit requests with identical prefixes (common with shared system prompts), the system can cache the prefilled KV cache for that prefix and reuse it across requests. This effectively makes the shared prefix zero-cost for subsequent requests, significantly improving throughput for certain usage patterns.

The Future Evolution of Two-Phase Inference

While the prefill-decode dichotomy is fundamental to autoregressive generation, ongoing research explores ways to mitigate its limitations or find alternative architectures.

Parallel Decoding Approaches attempt to generate multiple tokens simultaneously, reducing the sequential bottleneck. Techniques like:

  • Speculative decoding (mentioned earlier) uses draft models to propose multiple tokens
  • Medusa adds multiple prediction heads to the model, allowing it to predict several tokens ahead
  • Parallel sampling methods generate multiple candidate sequences and select the best

These approaches trade increased computation per step for reduced sequential steps, potentially improving latency when compute is available but the sequential bottleneck is limiting.

Alternative Architectures like state-space models (Mamba, RWKV) or retrieval-augmented generation attempt to change the computational pattern. These architectures process sequences with different complexity characteristics, potentially altering or eliminating the traditional prefill-decode distinction. However, they come with their own trade-offs in model quality, training efficiency, or other dimensions.

Hardware Co-design increasingly optimizes specifically for LLM inference patterns. Custom accelerators designed for inference can optimize memory layouts, prefetching strategies, and compute patterns specifically for the prefill-decode workflow. This co-design approach recognizes that general-purpose GPUs, while capable, aren’t optimized for the specific access patterns of LLM inference.

Conclusion

The two steps of LLM inference—prefill and decode—represent fundamentally different computational patterns with distinct characteristics, bottlenecks, and optimization strategies. Prefill processes the entire input prompt in parallel, computing representations and building the KV cache in a compute-intensive operation that happens once per request. Decode generates output tokens sequentially, one at a time, in a memory-bandwidth-bound process that repeats until generation completes. This dichotomy shapes everything from hardware selection to system architecture, from pricing models to user experience design.

Understanding these two phases transforms how developers approach LLM deployment and optimization. Recognizing that prefill is compute-bound while decode is memory-bound explains why certain optimizations work, why performance characteristics differ between short and long generations, and why batch processing strategies vary between interactive and throughput-oriented use cases. As LLMs become increasingly central to applications across industries, this foundational knowledge of how inference actually works—not as a single monolithic operation but as two distinct phases with different characteristics—becomes essential for building efficient, scalable, and cost-effective systems.

Leave a Comment