When an LLM generates text, it does not deterministically pick the single most probable next token. Instead, it samples from a probability distribution shaped by several parameters: temperature, top-p, top-k, and optionally others like repetition penalty. Getting these parameters right is the difference between outputs that are coherent but boring, creative but incoherent, or well-calibrated for a specific task. Understanding what each parameter actually does — at the level of logits and probability distributions — lets you tune generation behaviour systematically rather than by trial and error.
The Logit Pipeline: From Raw Scores to Token Probabilities
At each generation step, the LLM computes a logit vector of shape (vocab_size,) — one raw score per token in the vocabulary. For GPT-2-scale models this is 50,257 tokens; for Llama 3 it is 128,256. These logits are the unnormalised output of the final linear projection and can range from large negative values to large positive values. Before sampling, the logits pass through a sequence of transformations: temperature scaling, top-k truncation, top-p (nucleus) truncation, and finally softmax to convert to probabilities. Order matters: top-k and top-p are applied to the temperature-scaled logits before softmax, not after.
import torch
import torch.nn.functional as F
def sample_token(
logits: torch.Tensor, # shape: (vocab_size,)
temperature: float = 1.0,
top_k: int = 0,
top_p: float = 1.0,
repetition_penalty: float = 1.0,
input_ids: torch.Tensor = None, # for repetition penalty
) -> int:
"""Full sampling pipeline: temperature → top-k → top-p → sample."""
logits = logits.clone().float() # work in float32 for precision
# 1. Repetition penalty: downscale logits for tokens already in context
if repetition_penalty != 1.0 and input_ids is not None:
for token_id in set(input_ids.tolist()):
if logits[token_id] > 0:
logits[token_id] /= repetition_penalty
else:
logits[token_id] *= repetition_penalty
# 2. Temperature scaling
if temperature != 1.0:
logits = logits / temperature
# 3. Top-k filtering: keep only the k highest-logit tokens
if top_k > 0:
top_k = min(top_k, logits.size(-1))
kth_value = torch.topk(logits, top_k).values[-1]
logits[logits < kth_value] = float('-inf')
# 4. Top-p (nucleus) filtering: keep smallest set of tokens summing to p
if top_p < 1.0:
sorted_logits, sorted_idx = torch.sort(logits, descending=True)
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
# Remove tokens where cumulative prob exceeds top_p
sorted_logits[cumulative_probs > top_p] = float('-inf')
# But always keep at least the top token (shift right by 1)
sorted_logits[1:] = sorted_logits[:-1].clone()
sorted_logits[0] = logits[sorted_idx[0]] # never remove top token
logits = sorted_logits.scatter(0, sorted_idx, sorted_logits)
# 5. Sample from the resulting distribution
probs = F.softmax(logits, dim=-1)
return torch.multinomial(probs, num_samples=1).item()
Temperature: Sharpening and Flattening the Distribution
Temperature is the most influential sampling parameter. Dividing logits by temperature T before softmax changes the sharpness of the probability distribution: temperature below 1.0 makes the distribution sharper (higher probability mass on the top tokens), temperature above 1.0 makes it flatter (probability mass more evenly spread across tokens). At temperature 0, the distribution collapses to a point mass on the single highest-logit token — this is equivalent to greedy decoding. At temperature 1.0, the distribution is the model’s raw output unchanged.
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
# Simulate how temperature affects a simple 6-token distribution
logits = torch.tensor([3.2, 2.8, 1.5, 0.4, -0.3, -1.2])
temperatures = [0.3, 0.7, 1.0, 1.5, 2.0]
for T in temperatures:
probs = F.softmax(logits / T, dim=-1)
entropy = -(probs * probs.log()).sum().item()
print(f"T={T:.1f} | probs: {probs.numpy().round(3)} | entropy: {entropy:.3f}")
# T=0.3 | probs: [0.904, 0.093, 0.003, ...] — almost deterministic
# T=0.7 | probs: [0.608, 0.306, 0.069, ...] — concentrated
# T=1.0 | probs: [0.476, 0.328, 0.097, ...] — raw model output
# T=1.5 | probs: [0.349, 0.294, 0.155, ...] — more spread
# T=2.0 | probs: [0.285, 0.262, 0.184, ...] — nearly uniform
In practice: use temperature 0.0 to 0.3 for tasks requiring factual accuracy and consistency (code generation, structured output, function calling). Use 0.7 to 1.0 for balanced generation (chat, instruction following). Use 1.0 to 1.5 for creative writing where diversity matters. Avoid temperatures above 1.5 for most production uses — very high temperatures produce outputs where low-probability tokens are sampled frequently, causing incoherence. Temperature also interacts with the model’s calibration: a well-calibrated model at temperature 1.0 is appropriately uncertain; if you have a model that produces overconfident logits, raising temperature slightly can improve output quality.
Top-k: Hard Truncation to the k Most Probable Tokens
Top-k sampling truncates the vocabulary to the k tokens with the highest logits before sampling. All other tokens are set to negative infinity (zero probability after softmax). Top-k is a blunt instrument: at each generation step, exactly k tokens are eligible regardless of whether those k tokens collectively hold 99% or 55% of the probability mass. When the model is highly confident (a narrow, peaked distribution), a top-k of 50 still forces you to sample from 50 tokens even though only 2 or 3 are genuinely probable. Conversely, when the model is uncertain (a flat distribution), top-k of 50 may be too restrictive. This context-insensitivity is why top-p generally outperforms top-k when a single parameter is needed, but top-k is still useful as a hard upper bound when combined with top-p.
import torch
import torch.nn.functional as F
def top_k_filter(logits: torch.Tensor, k: int) -> torch.Tensor:
"""Zero out all logits outside the top-k."""
if k == 0:
return logits
values, _ = torch.topk(logits, k=min(k, logits.size(-1)))
min_top_k = values[-1]
return logits.masked_fill(logits < min_top_k, float('-inf'))
# Example: observe how top-k behaves differently on peaked vs flat distributions
peaked = torch.tensor([5.0, 2.0, 1.5, 1.0, 0.8, 0.5]) # confident model
flat = torch.tensor([1.1, 1.0, 0.9, 0.8, 0.7, 0.6]) # uncertain model
for name, dist in [('peaked', peaked), ('flat', flat)]:
filtered = top_k_filter(dist.clone(), k=3)
probs = F.softmax(filtered, dim=-1)
print(f"{name} top-3: {probs.numpy().round(3)}")
# peaked top-3: [0.952, 0.031, 0.017] — still very concentrated
# flat top-3: [0.402, 0.333, 0.265] — roughly uniform over top 3
Top-p (Nucleus Sampling): Adaptive Vocabulary Truncation
Top-p sampling, also called nucleus sampling, selects the smallest set of tokens whose cumulative probability exceeds a threshold p. The vocabulary is sorted by probability descending, tokens are accumulated until the cumulative probability exceeds p, and all remaining tokens are discarded. At each step the nucleus size adapts to the shape of the distribution: when the model is confident, the nucleus might contain only 3–5 tokens; when uncertain, it might contain 100+. This adaptive behaviour is why top-p is generally preferred over top-k for general-purpose generation.
import torch
import torch.nn.functional as F
def top_p_filter(logits: torch.Tensor, p: float) -> torch.Tensor:
"""Nucleus sampling: keep smallest set of tokens with cumulative prob >= p."""
if p >= 1.0:
return logits
sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
# Remove tokens once cumulative prob exceeds p (shift to keep the token that crosses threshold)
sorted_logits[cumulative_probs > p] = float('-inf')
# Restore original ordering
logits = torch.full_like(logits, float('-inf'))
logits.scatter_(0, sorted_indices, sorted_logits)
return logits
# Demonstrate adaptive nucleus size
peaked = F.log_softmax(torch.tensor([5.0, 2.0, 1.5, 0.5, 0.3, 0.1]), dim=-1)
flat = F.log_softmax(torch.tensor([1.1, 1.0, 0.9, 0.8, 0.7, 0.6]), dim=-1)
for name, logprobs in [('peaked', peaked), ('flat', flat)]:
filtered = top_p_filter(logprobs.clone(), p=0.9)
n_tokens = (filtered > float('-inf')).sum().item()
print(f"{name} nucleus at p=0.9: {n_tokens} tokens in nucleus")
# peaked: 2 tokens in nucleus (model is confident)
# flat: 5 tokens in nucleus (model is uncertain — includes more options)
Typical top-p values: 0.9 is a solid default for general-purpose chat. 0.95 gives slightly more diversity. 0.7 to 0.8 gives tighter, more focused outputs. Values below 0.6 can cause repetition because the nucleus is so small that the model repeatedly samples from a tiny set of tokens. When using top-p, many practitioners also set top-k as a safety bound (e.g., top-k=50, top-p=0.9) to prevent pathologically large nuclei from forming when the distribution is very flat.
Repetition Penalty and Min-p
Repetition penalty discourages the model from repeating tokens that have already appeared in the context. Tokens present in the input are divided (if their logit is positive) or multiplied (if negative) by the penalty factor before temperature scaling. A penalty of 1.0 is no penalty; 1.2 to 1.4 is a moderate penalty commonly used for open-ended generation to prevent repetitive loops. Penalties above 1.5 tend to over-correct, making the model avoid useful words that legitimately recur in the text. Min-p is a newer alternative to top-p that filters by setting a minimum probability threshold relative to the top token: any token with probability less than min-p times the top token's probability is excluded. This keeps the nucleus anchored to the model's current confidence level and performs well in benchmarks against top-p at similar settings.
Recommended Settings by Task
For code generation and structured output: temperature 0.0–0.2, top-p 1.0, top-k 0 (effectively greedy or near-greedy). The task has a correct answer and diversity is counterproductive. For function calling and JSON extraction: temperature 0.0, no sampling parameters needed — use greedy decoding. For instruction-following chat: temperature 0.7, top-p 0.9, no top-k. This balances coherence with enough diversity to avoid repetitive phrasing. For creative writing: temperature 1.0–1.2, top-p 0.95, repetition penalty 1.1–1.2. For code completion in an IDE (where partial context matters): temperature 0.2–0.4, top-p 0.95 — lower temperature than chat to keep suggestions grounded. The most common mistake is using the same sampling parameters across all tasks; production systems should have different generation configs per endpoint based on the nature of the task.
How Sampling Parameters Interact in Practice
Temperature, top-k, and top-p are not independent: they interact in ways that can produce unexpected results when set carelessly. The most important interaction to understand is that temperature is applied before top-p and top-k. This means a low temperature sharpens the distribution first, which then reduces the effective size of the nucleus when top-p is applied. At temperature 0.3 and top-p 0.9, the nucleus will typically contain only 1–3 tokens because the sharpened distribution concentrates probability mass very quickly. Conversely, a high temperature flattens the distribution, which means top-p 0.9 will produce a large nucleus — potentially hundreds of tokens — before the cumulative probability threshold is reached. If you want tight, deterministic outputs, reducing temperature is more effective than reducing top-p, because temperature directly controls the entropy of the distribution. Top-p's role is to serve as a safety net against the model generating extremely low-probability tokens under any temperature setting.
Top-k and top-p can be combined, in which case whichever filter is more restrictive applies first. Setting top-k 50 and top-p 0.9 means the model first keeps the top 50 tokens (by top-k), then within those 50 further restricts to the smallest set summing to 90% probability (by top-p). This combined setting is the default in many production deployments because it bounds worst-case behaviour: even if top-p would select 200 tokens on a very flat distribution, top-k 50 prevents it. The overhead of computing both filters is negligible compared to the cost of the forward pass.
Sampling in Batch Inference
When you run LLM inference over a batch of requests with different sampling parameters, each request can have its own temperature, top-p, and top-k values. Modern inference servers like vLLM apply sampling parameters per-sequence at the logit processing stage, before token selection. The logit tensor is shaped (batch_size, vocab_size) and temperature scaling, top-k, and top-p are applied as batched operations — temperature is a per-sequence scalar broadcast over the vocab dimension, top-k and top-p filtering are applied per row. This means there is no meaningful throughput penalty for using different sampling settings across requests in the same batch, though very high top-k values (thousands) do require sorting more of the vocabulary per step.
One subtle issue in batch inference with sampling: different sequences in the same batch can take different lengths due to different stopping criteria (e.g., one sequence reaches an end-of-sequence token early). Inference servers handle this via sequence packing and masking, but the important point for production systems is that sampling randomness means you cannot predict sequence length in advance. This is why latency SLOs for sampled generation are typically defined in terms of the first token latency and per-token latency rather than total latency — total latency depends on the sampled sequence length, which is non-deterministic unless you use greedy decoding with a fixed max_tokens.
Setting Sampling Parameters via the Transformers Library
In Hugging Face Transformers, sampling parameters are set in the GenerationConfig object passed to model.generate(). Temperature requires setting do_sample=True; without it, temperature has no effect because greedy decoding is used by default regardless of the temperature value. This is a common source of confusion — setting temperature=0.7 without do_sample=True produces greedy decoding, not temperature-0.7 sampling.
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
inputs = tokenizer("The capital of France is", return_tensors="pt")
# Greedy decoding (deterministic)
greedy_cfg = GenerationConfig(max_new_tokens=50)
greedy_out = model.generate(**inputs, generation_config=greedy_cfg)
# Temperature sampling (requires do_sample=True)
sample_cfg = GenerationConfig(
max_new_tokens=50,
do_sample=True,
temperature=0.7,
top_p=0.9,
top_k=50,
repetition_penalty=1.1,
)
sampled_out = model.generate(**inputs, generation_config=sample_cfg)
print(tokenizer.decode(greedy_out[0]))
print(tokenizer.decode(sampled_out[0]))
Debugging Degenerate Outputs with Sampling Parameters
Two types of output failure are directly attributable to sampling parameters and are worth knowing how to diagnose. The first is repetitive loops: the model repeatedly generates the same phrase or sentence. This is caused by the model's probability distribution collapsing onto a small set of high-probability tokens, which sampling keeps re-selecting. The fix is to increase repetition penalty (1.2–1.4), raise temperature slightly, or switch from top-k to top-p if top-k is set too low. The second failure mode is incoherent or hallucinated outputs that jump between topics. This is caused by temperature being too high or top-p being too large, allowing low-probability tokens to be sampled that steer the generation in random directions. The fix is to reduce temperature toward 0.7–0.8 or tighten top-p to 0.85–0.90.
For production applications, it is worth logging the distribution entropy at each generation step when debugging bad outputs — high entropy steps indicate where the model was most uncertain and therefore most likely to have gone wrong. Most inference libraries do not expose per-step entropy directly, but you can approximate it by logging the probability of the sampled token: consistently low probabilities (below 0.05 for the selected token) indicate a very flat distribution and suggest either overly high temperature or a genuinely ambiguous context that may benefit from retrieval augmentation or better context framing.
A useful discipline for production LLM applications is to treat sampling parameters as configuration rather than code — store them in a config file or database keyed by application type, and make them easy to adjust without redeployment. When a user reports bad output quality, the first diagnostic step is to check what sampling configuration was used and whether it is appropriate for the task. Many quality complaints resolve quickly once the correct temperature and top-p for the task are identified and applied consistently.