Speculative Decoding for Faster LLM Token Generation

Large language models generate text one token at a time in an autoregressive fashion—each token depends on all previous tokens, creating a sequential bottleneck that prevents parallelization. This sequential nature is fundamental to how transformers work, yet it creates a frustrating limitation: no matter how powerful your GPU is, you’re stuck generating tokens one at a time. For a model to generate 100 tokens, it must perform 100 sequential forward passes through the entire network. This sequential dependency makes inference slow and expensive, particularly for large models where each forward pass takes significant time.

Speculative decoding offers an elegant solution to this seemingly insurmountable constraint. By using a smaller, faster “draft” model to predict multiple tokens speculatively, then verifying these predictions with the target model in parallel, speculative decoding can generate multiple tokens per forward pass when predictions are accurate. This technique achieves 2-3x speedups in practice without any loss in output quality—the generated text is identical to what the target model would produce autoregressively. Understanding how speculative decoding works reveals a clever exploitation of model behavior patterns that makes impossible-seeming parallelization achievable in practice.

The Sequential Bottleneck in Autoregressive Generation

Before diving into speculative decoding, it’s crucial to understand why standard autoregressive generation is inherently sequential and why this creates such significant performance limitations.

How autoregressive generation works:

When generating text, a language model processes an input prompt and generates the first output token by computing probability distributions over the vocabulary. It samples from this distribution (or takes the most likely token), appends this token to the context, and repeats the process. Each new token requires:

Processing the entire context up to that point through the model
Computing next-token probabilities
Sampling/selecting a token
Appending to context for the next iteration

The critical constraint is step 1—you cannot compute the next token until you know the current token, because the current token is part of the context for the next token. This creates a dependency chain where token N+1 depends on token N, which depends on token N-1, and so on.

Why parallelization doesn’t help:

Modern GPUs excel at parallel computation. When processing a batch of sequences, or when computing attention across many tokens, GPUs can perform thousands of operations simultaneously. But this parallelism doesn’t help with the sequential token generation problem.

During prefill (processing the input prompt), GPUs are well-utilized—all prompt tokens can be processed in parallel. But during decode (generating output tokens), each token requires a separate forward pass. You can batch multiple independent sequences together, but for a single sequence, generation remains strictly sequential.

The memory-bound reality:

For large models, token generation is memory-bound rather than compute-bound. Each forward pass requires loading billions of parameters from GPU memory, but performs relatively little computation per loaded parameter during decode. The GPU’s computational capacity sits largely idle while waiting for memory transfers.

This memory-boundedness means even the fastest GPUs can’t dramatically speed up single-sequence generation. Whether you have an older V100 or a cutting-edge H100, the sequential nature and memory bottleneck limit token generation speed to roughly similar ranges for large models.

Quantifying the problem:

A typical large language model (e.g., 70B parameters) might generate tokens at 20-50 tokens per second on high-end GPUs. For a 500-token response, that’s 10-25 seconds of sequential processing. When serving thousands of users, this sequential bottleneck directly limits throughput and requires expensive GPU infrastructure to maintain acceptable latency.

The cost implications are enormous. If you could double token generation speed, you’d halve infrastructure costs for inference workloads. This makes optimization techniques like speculative decoding incredibly valuable for production deployments.

⏱️ Autoregressive Generation Timeline

Prefill Phase (Parallel): Process 1,000 input tokens → 20ms

Decode Phase (Sequential):
• Token 1: 50ms forward pass + sampling
• Token 2: 50ms forward pass + sampling
• Token 3: 50ms forward pass + sampling
• … (100 tokens total) → 5,000ms (5 seconds)

Problem: Decode phase dominates latency, cannot be parallelized in standard generation

The Core Idea Behind Speculative Decoding

Speculative decoding transforms the sequential bottleneck into a parallel verification problem by predicting multiple tokens ahead and checking them all at once.

The speculation-verification loop:

The key insight is this: if you could predict what the target model would generate, you could verify multiple predictions in a single parallel forward pass. This converts a sequential process (generate token 1, then 2, then 3) into a parallel one (predict tokens 1-5, verify all predictions simultaneously).

Speculative decoding uses a small, fast “draft” model to generate these predictions. The draft model runs quickly because it’s much smaller (perhaps 1B parameters vs. 70B for the target). It generates k candidate tokens speculatively (typically k=4-8). Then the target model verifies all k candidates in a single batched forward pass.

If candidates are correct, you’ve effectively generated multiple tokens in the time of one target model forward pass plus the cheap draft model predictions. If some candidates are wrong, you reject incorrect predictions and resample from the target model’s distribution starting from the first incorrect position.

Why this works: model agreement on easy tokens:

Speculative decoding exploits a key observation: large and small models often agree on “easy” tokens. When the next token is highly predictable from context—like completing common phrases, following clear grammar, or continuing obvious patterns—both models make the same prediction.

For example, given “The capital of France is”, both models will predict “Paris” with high confidence. Similarly, closing punctuation, common function words, and continuation of established patterns are highly predictable. Small models succeed at these easy predictions despite lacking the reasoning capabilities of larger models.

When predictions are easy, speculation succeeds at high rates (often 60-80% acceptance), yielding substantial speedups. When predictions are hard, speculation fails more often, but failure is relatively cheap—you only waste the time of the fast draft model, not the expensive target model.

The acceptance criterion:

Verification isn’t just checking if tokens match exactly. Speculative decoding uses a probabilistic acceptance criterion based on the probability distributions from both models. A candidate token is accepted if the target model assigns it reasonable probability relative to the draft model’s prediction.

Specifically, for each candidate token, compare:

p_target(token): Target model’s probability for this token
p_draft(token): Draft model’s probability for this token

Accept if p_target(token) ≥ p_draft(token), or with probability p_target/p_draft if the target assigns lower probability. This ensures the final distribution of generated text matches what the target model would produce autoregressively—output quality is identical, not just similar.

Mathematical guarantees:

This acceptance mechanism provides strong guarantees: the output distribution is exactly the same as standard autoregressive sampling from the target model. This isn’t an approximation—given the same random seeds, speculative decoding produces identical outputs to standard generation. This mathematical equivalence means you can deploy speculative decoding without worrying about quality degradation or behavioral changes.

Implementation Strategies and Variations

While the core concept is straightforward, implementing speculative decoding effectively requires addressing several practical challenges and considering different implementation strategies.

Choosing the draft model:

The draft model must be carefully selected to balance speed and prediction accuracy. Several approaches work:

Smaller version of target architecture: Use a smaller model from the same family (e.g., Llama-7B as draft for Llama-70B). These models share architectural choices and training data characteristics, improving prediction agreement. However, maintaining separate model weights increases memory usage.

Quantized target model: Use an aggressively quantized version of the target model (like 4-bit or 3-bit quantization) as the draft. This requires less additional memory since you’re using the same base model. Quantization reduces both memory and compute, making draft predictions very fast while maintaining reasonable accuracy.

Distilled specialist models: Train a small model specifically to predict the target model’s behavior. This student model is distilled from the teacher (target) model using knowledge distillation. Distilled models can achieve better prediction agreement than generic small models of similar size.

Early-exit strategies: Some implementations use early exit from the target model itself—stop computation after a few layers for draft predictions. This eliminates separate model memory overhead but provides less speedup since early-exit draft predictions aren’t as fast as truly small models.

Determining speculation depth:

How many tokens should the draft model predict before verification? This k parameter critically affects performance:

Too small (k=1-2): You waste verification overhead—the cost of running the target model—to verify very few tokens. Even with perfect acceptance, speedup is minimal.

Too large (k=10+): Acceptance rate drops rapidly for later tokens because compounding error accumulates. Each incorrect token makes subsequent predictions likely wrong. Beyond a certain point, you spend more time on rejecting speculations than generating accepted tokens.

Optimal range (k=4-8): Empirically, this range balances verification efficiency against acceptance rates. With typical acceptance rates around 60-80%, speculating 5-6 tokens ahead means accepting 3-5 on average, providing meaningful speedup without excessive rejection overhead.

The optimal k depends on your specific models and workload. Easy generation tasks (like code completion with clear syntax rules) support larger k. Harder tasks (like creative writing or reasoning) benefit from smaller k due to lower agreement between models.

Batching speculative verification:

A powerful optimization batches verification across multiple speculation attempts. Rather than verifying one set of k candidates at a time, accumulate several candidate sets and verify them all in a large batch. This increases target model utilization during verification, amortizing fixed overhead across more candidates.

For example, with k=5 and batch size 4, you verify 20 candidate tokens in a single batched forward pass. This trades off latency (waiting to accumulate batches) against throughput (processing more candidates per verification), making it ideal for serving multiple requests concurrently.

🚀 Speculative Decoding Flow

Step 1: Draft model predicts k=5 tokens → 5ms

Step 2: Target model verifies all 5 tokens in parallel → 50ms

Outcome (70% acceptance): Accept first 3-4 tokens → 3.5 tokens for 55ms

Standard generation: 1 token for 50ms

Speedup: 3.5x tokens per target model forward pass

Total: 2.5-3x faster generation at identical output quality

Performance Analysis and Optimization

Understanding where speedups come from and how to maximize them requires analyzing the performance characteristics of speculative decoding in detail.

Speedup factors and dependencies:

Speculative decoding speedup depends on several interacting factors:

Draft-target speed ratio: If the draft model runs 10x faster than the target (typical for 1B vs 70B models), speculating k=5 tokens costs only 0.5 target model forward passes worth of time. This favorable ratio is crucial—using a draft model only 2x faster wouldn’t provide enough advantage.

Acceptance rate: With 80% acceptance over k=5 tokens, you expect to accept 4 tokens per speculation attempt. With 50% acceptance, you’d accept only 2-3 tokens. Higher acceptance rates directly translate to better speedups. This rate varies by task difficulty and model agreement.

Verification overhead: The target model verification pass must process all candidate tokens. If batching and parallelization aren’t efficient, verification overhead can eat into speedup gains. Well-optimized implementations minimize this overhead through careful tensor operations and memory management.

Memory bandwidth utilization: Since generation is memory-bound, speculative decoding must efficiently use memory bandwidth. Loading model weights for one forward pass that verifies k candidates is more efficient than k sequential forward passes. This bandwidth advantage is fundamental to why the technique works.

When speculative decoding excels:

Certain scenarios provide exceptional speedups:

Code generation: Programming languages have strict syntax rules that both draft and target models learn well. Closing brackets, indentation, common keywords, and function signatures are highly predictable, leading to 70-90% acceptance rates and 3-4x speedups.

Translation: Source-target language pairs with clear correspondences enable high agreement. Grammatical structures and vocabulary translations are relatively deterministic, supporting effective speculation.

Factual question answering: When answers are concise and factual, models often agree. Simple factual completions like dates, names, or numerical values match between models reliably.

Templated generation: Generating structured outputs (JSON, XML, form data) with predictable formats shows high agreement. The rigid structure constrains generation space, increasing prediction accuracy.

When speculation struggles:

Some tasks inherently resist speculation:

Creative writing: Open-ended creative tasks lack constraint, reducing agreement. Different word choices, stylistic variations, and narrative directions mean models often diverge in predictions.

Complex reasoning: Multi-step reasoning where the target model’s stronger capabilities matter shows low draft-target agreement. The small model can’t follow the large model’s reasoning chain.

High-entropy distributions: When next-token distributions are flat (many tokens have similar probability), both models are uncertain, and their predictions often differ even if neither is clearly wrong.

Technical specialized domains: If the draft model wasn’t trained on specialized vocabulary or domain knowledge that the target model has, agreement suffers in those domains.

Practical Deployment Considerations

Deploying speculative decoding in production systems requires addressing several operational and engineering challenges.

Memory requirements:

Running two models simultaneously increases memory usage. A 70B target model requires ~140GB in FP16. Adding a 7B draft model needs ~14GB more. This 10% memory increase is manageable but must be planned for in infrastructure sizing.

Optimization techniques help:

Quantization: Use 4-bit quantization for draft models, reducing their memory footprint to ~3-4GB
Shared vocabulary embeddings: If models share vocabulary, reuse embedding layers between draft and target
Model sharding: In distributed inference, shard models across GPUs with careful placement to minimize communication overhead

Latency considerations:

While throughput improves significantly with speculative decoding, time-to-first-token (TTFT) latency might increase slightly. The first speculation-verification cycle must complete before producing any tokens, and if the initial speculation fails, you’ve added latency with no benefit.

For latency-sensitive applications:

Use smaller speculation depths (k=3-4) to reduce speculation overhead
Implement adaptive strategies that skip speculation for the first few tokens when context suggests low agreement
Combine with techniques like continuous batching to amortize latency across concurrent requests

Integration with other optimizations:

Speculative decoding combines synergistically with other inference optimizations:

Flash Attention: Memory-efficient attention speeds up both draft and target models, compounding speedup benefits. Faster verification means speculation overhead decreases, improving net speedup.

KV Cache optimization: Efficient key-value caching reduces memory bandwidth for attention operations, making verification more efficient. Paged attention works well with speculation by efficiently managing cache for speculative tokens.

Quantization: Quantizing both models reduces memory bandwidth requirements, making the memory-bound generation faster. This increases baseline speed but also improves speculation speedup since verification becomes faster.

Batching: Speculative decoding naturally batches verification across multiple candidate tokens. Combining this with batching across multiple user requests provides further throughput improvements.

Monitoring and debugging:

Production deployments should monitor speculation effectiveness:

Acceptance rate: Track what fraction of speculated tokens are accepted. Declining acceptance rates indicate changing workload patterns or model mismatch issues.

Speedup factor: Measure end-to-end speedup compared to baseline generation. This real-world metric captures all overheads and optimizations.

Per-request characteristics: Different request types may show vastly different speculation effectiveness. Understanding these patterns enables routing optimizations or adaptive speculation strategies.

Draft model performance: Monitor draft model latency and correctness independently. If draft predictions slow down or degrade in quality, investigate before speculation effectiveness suffers.

Advanced Techniques and Variants

Research continues advancing speculative decoding with techniques that push performance further or adapt the core idea to new scenarios.

Adaptive speculation depth:

Rather than using fixed k, adaptive approaches dynamically adjust speculation depth based on predicted acceptance probability. After a few tokens, the system learns how predictable the current generation is and adjusts k accordingly.

For highly predictable sequences, increase k to maximize speedup. For unpredictable sequences, reduce k to avoid wasting effort on rejections. This adaptation improves average-case performance by matching strategy to content difficulty.

Multi-candidate speculation:

Instead of the draft model proposing a single sequence of k tokens, it can propose multiple candidate sequences (like beam search). The target model verifies all sequences in parallel, accepting the most probable valid sequence.

This increases speculation cost but improves acceptance when generation has multiple plausible paths. For tasks where the draft model is uncertain but the target model has clear preferences, multi-candidate speculation recovers speedup that single-sequence speculation would miss.

Ensemble speculation:

Use multiple draft models with different characteristics—perhaps one specialized for factual content, another for reasoning, a third for creative text. Based on context classification, select the most appropriate draft model or ensemble their predictions.

Different draft models excel in different scenarios. Switching between them or combining their predictions intelligently improves overall speculation effectiveness across diverse workloads.

Iterative refinement:

Some approaches iterate speculation—after accepting k tokens, immediately speculate the next k using the updated context. This creates a pipeline where draft prediction and target verification overlap, hiding draft model latency.

Careful implementation can achieve higher effective throughput by keeping both models continuously busy, though this increases implementation complexity and memory requirements for managing concurrent speculations.

Conclusion

Speculative decoding fundamentally changes the economics of LLM inference by breaking the sequential bottleneck through probabilistic prediction and parallel verification. By using small, fast draft models to predict token sequences and verifying them with target models in parallel, the technique achieves 2-3x speedups without any quality degradation—the mathematical guarantees ensure output distributions remain identical to standard autoregressive sampling. This speedup directly translates to reduced infrastructure costs, lower latency, and higher throughput for production LLM deployments, making previously uneconomical applications viable.

The technique’s effectiveness varies by task characteristics, with structured and predictable generation showing exceptional gains while creative and high-entropy tasks benefit less. However, even modest average speedups of 1.5-2x provide substantial value at scale, and ongoing research continues improving speculation strategies, acceptance rates, and integration with complementary optimizations. For any organization running LLM inference at scale, implementing speculative decoding has become essentially mandatory—the combination of significant speedup, zero quality trade-off, and moderate implementation complexity makes it one of the highest-ROI optimizations available for transformer inference systems.