Prefix Tuning vs Prompt Tuning vs P-Tuning: Soft Prompt Methods Compared

Prefix tuning, prompt tuning, and P-Tuning are three parameter-efficient fine-tuning (PEFT) methods that adapt large language models by training a small set of continuous “soft” tokens instead of updating the model weights. All three emerged in 2021–2022 as alternatives to full fine-tuning and LoRA, and all three remain relevant for specific situations — particularly when you want task-specific adaptation without modifying the base model at all, making it possible to serve many tasks from a single frozen model instance. Despite superficial similarities, they differ substantially in where the soft tokens are inserted, how they’re optimized, and what tradeoffs they impose.

The Core Idea: Soft Prompts

Traditional prompting prepends discrete text tokens to the input — hard prompts like “Translate to French:” that are fixed and human-readable. Soft prompting replaces or augments these with continuous vectors in the embedding space that are optimized by gradient descent. These learnable vectors have the same dimensionality as token embeddings but don’t correspond to any real token — they’re arbitrary floating-point vectors that the optimizer shapes to steer the model’s behavior. During inference, the soft tokens are prepended to (or interleaved with) the input, and the frozen model processes them as if they were normal embeddings. The entire task adaptation is encoded in these vectors, which are typically 10–100 tokens and require only kilobytes to store.

Prompt Tuning

Prompt tuning (Lester et al., 2021) is the simplest of the three. It prepends a fixed number of learnable soft tokens to the input sequence at the embedding layer only. The model weights are fully frozen; only the soft token embeddings are trained. The soft tokens are shared across all examples in a task and act as a task-specific prefix that conditions the model’s behavior.

The key finding from the original paper is that prompt tuning scales well with model size. For small models (under 1B parameters), it underperforms full fine-tuning significantly. For models above ~10B parameters, it matches full fine-tuning performance while updating only a few thousand parameters. This makes it particularly attractive for very large frozen models where full fine-tuning would be prohibitively expensive, and for multi-tenant serving where many tasks share a single base model and each task’s adaptation is just a small vector loaded at inference time.

Training prompt tuning with HuggingFace PEFT:

from peft import PromptTuningConfig, PromptTuningInit, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B")

peft_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    prompt_tuning_init=PromptTuningInit.TEXT,
    prompt_tuning_init_text="Classify the sentiment of the following review:",
    num_virtual_tokens=20,
    tokenizer_name_or_path="meta-llama/Meta-Llama-3-8B",
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# trainable params: 20,480 || all params: 8,030,261,248 || trainable%: 0.000255

Initialization matters significantly. Initializing from a relevant text string (PromptTuningInit.TEXT) rather than random vectors converges faster and to better final performance, particularly on smaller datasets. The soft tokens are prepended to every input during both training and inference, with the input sequence shifted to accommodate them within the model’s context window.

Prefix Tuning

Prefix tuning (Li and Liang, 2021) extends the soft prompt idea to all layers of the transformer, not just the input embedding layer. Instead of prepending learned vectors only at the input, prefix tuning prepends learned key-value pairs to the attention computation at every transformer layer. The prefix vectors are separate from the input tokens and attend to them (and are attended to by them) at every layer.

This is a meaningfully different inductive bias. In prompt tuning, the soft tokens can only influence the model through the attention mechanism at the first layer — deeper layers see only the model’s own intermediate representations, not the soft tokens directly. In prefix tuning, the soft prefix is present at every layer’s key-value cache, giving it direct influence over the attention patterns throughout the entire forward pass. This makes prefix tuning substantially more expressive than prompt tuning, at the cost of more parameters (one prefix per layer per attention head vs. one prefix at the embedding layer).

from peft import PrefixTuningConfig, get_peft_model, TaskType

peft_config = PrefixTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    num_virtual_tokens=30,         # prefix length per layer
    encoder_hidden_size=512,       # reparameterization MLP hidden size
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# trainable params: ~3.7M for a 7B model with 30 prefix tokens

A practical detail: training prefix tuning directly is unstable because the per-layer prefix parameters are unconstrained high-dimensional vectors. The original paper addresses this with a reparameterization trick — the actual trainable parameters are a smaller MLP that maps to the per-layer prefix vectors. At inference time, the MLP is discarded and the computed prefix vectors are saved directly. This stabilizes training at the cost of slightly more complexity during setup.

P-Tuning

P-Tuning (Liu et al., 2021) and its successor P-Tuning v2 (Liu et al., 2022) take a different approach. Rather than prepending soft tokens to the input, P-Tuning interleaves learnable tokens with the natural language prompt at chosen positions, using a small LSTM or MLP to model dependencies between the soft tokens. The motivation is that natural language prompts have internal structure — the soft tokens that represent “city” and “capital” in a fact retrieval prompt are semantically related — and modeling these dependencies helps the optimizer find better solutions.

P-Tuning v2 is the more practically relevant version. It removes the LSTM/MLP dependency model (finding it adds little benefit over direct optimization) and instead applies prefix-tuning-style per-layer soft tokens, but makes this approach work well for smaller models by using a different training objective and applying the soft tokens to all transformer layers in a way that’s specifically effective for NLU tasks like NER, question answering, and classification — tasks where the original prompt tuning underperformed significantly at smaller model scales.

Direct Comparison

The three methods sit at different points on a tradeoff curve between expressiveness and parameter count. Prompt tuning is the most lightweight — roughly 10K–100K parameters — and works well only with very large models (10B+) on tasks where the base model already has strong relevant capabilities. Prefix tuning is more expressive and works reasonably well across model sizes (1B+), at the cost of ~1–5M parameters and the reparameterization complexity. P-Tuning v2 is the most expressive of the three and is the best choice for NLU tasks at smaller model scales, where prompt tuning and prefix tuning both degrade significantly.

In practice, all three have been largely supplanted by LoRA for most fine-tuning use cases. LoRA adapts the weight matrices directly via low-rank decomposition and achieves better task performance at comparable parameter counts across nearly all model sizes and task types. The remaining niche for soft-prompt methods is multi-task serving from a single frozen model: because prompt/prefix tuning doesn’t modify the base model at all, you can serve N tasks by switching a small task-specific vector (a few MB at most) while keeping one model copy in memory. With LoRA, you need to merge or switch adapters, which adds serving complexity. If your primary constraint is serving infrastructure cost — single model instance, many task variants, low latency — soft-prompt methods are worth considering. For single-task fine-tuning quality, use LoRA or QLoRA.

When to Use Each

Use prompt tuning when you have a model above 10B parameters, a large labeled dataset, and need to serve many task variants from a single model endpoint. The essentially zero parameter overhead per task makes it operationally attractive at scale. Use prefix tuning when you need better task performance than prompt tuning provides at smaller model scales (1B–10B), can afford ~3–5M additional parameters per task, and still want the shared-frozen-model serving benefit. Use P-Tuning v2 specifically for NLU classification and extraction tasks (NER, relation extraction, QA) at smaller model scales where prefix tuning also underperforms — it was designed for exactly this regime and outperforms the others there. For everything else — instruction following, code generation, domain adaptation, conversational tasks — LoRA or QLoRA will outperform all three soft-prompt methods at comparable or lower parameter counts and is simpler to implement and deploy.

Training Stability and Practical Considerations

All three soft-prompt methods share a training instability problem that doesn’t exist with LoRA. Because the soft tokens are a very small number of parameters relative to the frozen model, the gradient signal is concentrated and can cause large oscillations early in training. The standard fix is to use a lower learning rate than you would for LoRA (1e-4 to 5e-4 rather than 1e-3 to 3e-3), warm up slowly over the first 5–10% of training steps, and use a cosine schedule. Prompt tuning in particular is sensitive to initialization — random initialization often leads to slow convergence or local minima, whereas initializing from task-relevant vocabulary tokens (or the full task description string with PromptTuningInit.TEXT) gives the optimizer a much better starting point. For prefix tuning, the reparameterization MLP is critical for stability; training the per-layer prefix vectors directly without it typically diverges.

Dataset size requirements differ across the three methods. Prompt tuning with large models (10B+) can work surprisingly well with as few as 1,000 labeled examples, because the large model’s strong priors do most of the work and the soft tokens only need to steer the existing capability. At smaller model scales, all three methods generally need more data than LoRA to reach competitive performance — the soft tokens are a less expressive adaptation mechanism when the model has weaker priors to build on. If your labeled dataset has fewer than 500 examples, LoRA with a small rank (r=4 or r=8) will outperform soft-prompt methods across virtually all model sizes and tasks.

Inference Overhead

At inference time, soft-prompt methods add a small but measurable overhead. The soft tokens occupy positions in the context window — 20 prompt tuning tokens mean 20 fewer positions available for the actual input and generated output. For tasks with very long inputs or when you’re operating near the context limit, this matters. For prefix tuning, the per-layer prefixes add to the KV cache size for every generation step: 30 prefix tokens with 32 layers on a model with hidden dimension 4096 adds roughly 30 × 32 × 4096 × 2 (K and V) × 2 bytes (bfloat16) ≈ 15MB to the KV cache per sequence. For a high-throughput serving scenario with hundreds of concurrent requests, this adds up to gigabytes of additional KV cache memory.

Prompt tuning has no extra KV cache overhead because the soft tokens only appear at the embedding layer and are processed like any other input tokens — they contribute to the KV cache at the same rate as regular tokens. This makes prompt tuning the most inference-efficient of the three methods, even though it’s the least expressive. When you’re optimizing for serving throughput at scale, the choice between these methods should factor in KV cache memory as carefully as it factors in parameter storage.

The Multi-Task Serving Case

The scenario where soft-prompt methods genuinely win over LoRA is large-scale multi-task serving from a single frozen model. Imagine you need to serve 1,000 different task variants (domain-specific classifiers, style adapters, format-specific extractors) from a single LLM backend. With LoRA, each adapter is 10–100MB, and switching adapters requires loading different weight deltas — either merging into the base model (which can’t happen concurrently) or maintaining separate adapter modules (which adds memory and latency). With prompt tuning, each task adaptation is 10–100KB of soft token embeddings, and switching tasks at inference time is just a lookup and prepend operation with essentially zero overhead. The base model processes all requests identically; only the prepended vectors differ. This property is genuinely useful for LLM API providers, multi-tenant serving platforms, and any setup where the number of task variants is large and the serving infrastructure is shared. Outside this scenario, the expressiveness advantages of LoRA make it the default choice for nearly all fine-tuning workloads.

Leave a Comment