Fine-tuning adapts a pre-trained model to a specific task or domain. The question is not whether to fine-tune — it’s how much of the model to update, and at what cost. Full fine-tuning updates every parameter, which produces the best possible adaptation but requires GPU memory proportional to the full model. LoRA and QLoRA are parameter-efficient alternatives that update a small number of additional parameters while leaving the base model frozen, enabling fine-tuning on hardware that couldn’t run full fine-tuning at all. Understanding what each approach actually does — not just that “LoRA is more efficient” — lets you make the right choice for your task, your hardware, and your quality requirements.
Full Fine-Tuning
Full fine-tuning updates all model parameters via gradient descent on your training data. Every weight in the network is eligible to change. This gives the optimizer maximum flexibility to adapt the model to your task distribution, and it produces the highest quality adaptation for tasks that differ substantially from the pre-training distribution.
The memory cost is substantial. Full fine-tuning in bf16 with Adam requires approximately 16 bytes per parameter: 2 bytes for weights, 2 bytes for gradients, 8 bytes for Adam first and second moments, and 4 bytes for the fp32 master weight copy. For a 7B model that is 112GB before activations. This requires multiple high-memory GPUs and either DDP (if it fits per-GPU) or FSDP/DeepSpeed for sharding.
Full fine-tuning also risks catastrophic forgetting — the model’s general capabilities degrade as it overfits to the fine-tuning distribution, particularly with small datasets. Regularization techniques like weight decay and early stopping mitigate this, but the risk requires careful evaluation across tasks beyond the fine-tuning target. For instruction-following models, full fine-tuning on a narrow task often degrades general instruction following quality noticeably.
When is full fine-tuning the right choice? When your task is far from the pre-training distribution, when you have enough data (tens of thousands of examples or more), when you have the hardware, and when you need the absolute best quality on the target task without constraints on preserving the base model’s general capabilities.
LoRA: Low-Rank Adaptation
LoRA, introduced by Hu et al. in 2021, works on a key observation: weight updates during fine-tuning have low intrinsic rank. Rather than updating a full weight matrix W directly, LoRA freezes W and adds a parallel branch with two small matrices: the effective weight becomes W + BA, where B is d×r and A is r×k, with rank r much smaller than d or k. Only A and B are trained. The original W is never modified.
The rank r is the key hyperparameter. With r=8 on a 7B model, LoRA adds roughly 4–8 million trainable parameters — less than 0.1% of total. Because only A and B are trained, gradient computation and optimizer states are only needed for these small matrices. The memory saving is dramatic: a 7B model requiring 112GB for full fine-tuning can be LoRA fine-tuned in around 16GB, feasible on a single A100 40GB or a consumer RTX 4090 24GB with gradient checkpointing.
LoRA is typically applied to the query and value projection matrices in attention layers, though current best practice applies it to all linear layers — query, key, value, output projections, and MLP layers — for better quality. The PEFT library’s target_modules=”all-linear” setting handles this automatically for supported architectures. The alpha hyperparameter scales the LoRA update: the effective delta is (alpha/r) × BA. Setting alpha equal to r is a reasonable default; some practitioners use alpha = 2r for stronger adaptation.
A key practical advantage of LoRA is mergeability. After training, you can merge the adapter back into the base model: W_new = W + BA. The merged model is identical in size and architecture to the base, runs at the same speed with no adapter overhead, and is production-ready. Alternatively, keep adapters separate and swap them at inference time if you need multiple task-specific adapters sharing one base model — an efficient pattern for multi-tenant serving.
LoRA quality versus full fine-tuning depends on the task. For instruction following, chat adaptation, and domain adaptation with sufficient examples, LoRA quality is close to full fine-tuning and often indistinguishable in production evaluation. For tasks requiring deep structural adaptation — very narrow domains with specialized vocabulary far from pre-training, or complex multi-step reasoning chains — full fine-tuning can show meaningful quality advantages.
QLoRA: Quantized LoRA
QLoRA, introduced by Dettmers et al. in 2023, extends LoRA by quantizing the frozen base model weights to 4-bit NormalFloat (NF4), a data type optimized for normally distributed weights. The LoRA adapter matrices A and B remain in bf16. During the forward pass, the frozen 4-bit weights are dequantized to bf16 on the fly for computation, then discarded. The base model’s memory footprint drops by roughly 4x compared to bf16, while the LoRA adapters are small enough that their bf16 storage is negligible.
The practical impact is large. QLoRA makes it feasible to fine-tune a 7B model on a single 24GB consumer GPU (RTX 3090 or 4090), a 13B model on a 48GB GPU, and a 70B model on a single 80GB A100. The original paper demonstrated that a 65B model fine-tuned with QLoRA on a single A100 matched the quality of full fine-tuning on instruction following benchmarks — the result that made QLoRA widely adopted practically overnight.
QLoRA adds two more techniques. Double quantization quantizes the quantization constants themselves, saving roughly 0.37 additional bits per parameter. Paged optimizers use NVIDIA unified memory to page optimizer states between GPU and CPU when GPU memory is under pressure, preventing OOM during long training runs with memory spikes. Both are enabled by default in bitsandbytes and are transparent to the user.
The throughput cost of QLoRA is real. Dequantizing 4-bit weights to bf16 on every forward and backward pass adds compute overhead — QLoRA trains at roughly 30–50% of the throughput of bf16 LoRA at the same batch size. For single-GPU fine-tuning this is acceptable; the alternative is not running at all. For multi-GPU setups where you have enough memory to run bf16 LoRA, the throughput loss makes QLoRA less attractive.
Practical LoRA Hyperparameters
Rank (r) controls adapter capacity. r=8 is a solid default for most fine-tuning tasks. For tasks requiring more structural adaptation, r=16 or r=32 improves quality. Beyond r=64 rarely helps and approaches the cost of updating more parameters directly. For simple tasks — format adaptation, style transfer — r=4 is sufficient.
Target modules: applying LoRA to all linear layers consistently outperforms applying it only to attention projections. Use target_modules=”all-linear” in PEFT for supported architectures and don’t overthink this.
Learning rate for LoRA should be higher than for full fine-tuning — typically 1e-4 to 3e-4 — because only the small adapter matrices are updating. With a cosine schedule and 3–5% warmup this works well across most tasks. If training loss doesn’t decrease in the first 50–100 steps, increase it; if loss is noisy, decrease it.
Dropout (lora_dropout in PEFT) adds regularization useful for small datasets under 1,000 examples; set to 0.05–0.1 in that case. For larger datasets it has minimal effect and can be set to 0.
Head-to-Head Comparison
Memory (7B model): Full fine-tuning needs roughly 112GB (bf16 weights + Adam states). LoRA in bf16 needs around 16GB. QLoRA needs 10–12GB on a single GPU — within reach of an RTX 4090.
Training throughput: Full fine-tuning is fastest per step when hardware is unconstrained. LoRA is close, with minor overhead from the adapter branch. QLoRA is 30–50% slower than LoRA due to dequantization on every pass.
Quality: Full fine-tuning is the ceiling. LoRA is typically within 1–3% on standard benchmarks for instruction and task fine-tuning. QLoRA matches LoRA quality in most evaluations — NF4 quantization of the frozen base model does not meaningfully degrade adapter learning.
Serving after training: All three produce merged models that serve identically. LoRA adapters kept separate enable efficient multi-adapter serving on one base model. Full fine-tuning requires a separate model copy per task.
The Decision Framework
Use QLoRA if you are fine-tuning on a single consumer GPU or a single A100 80GB with a 7B+ model. It’s the only practical option at these constraints and quality loss versus LoRA is minimal. Also right for rapid experimentation where you want to iterate on many adapter configurations cheaply.
Use LoRA if you have enough GPU memory to load the base model in bf16 and throughput matters. LoRA trains faster than QLoRA and produces cleaner adapters. For production fine-tuning pipelines running many jobs, LoRA’s throughput advantage compounds significantly over time.
Use full fine-tuning if you have the hardware, a large dataset, and a task substantially different from the pre-training distribution. Also necessary when you need to update the model’s vocabulary, modify the embedding layer, or add new architectural components — things LoRA cannot do on frozen weights.
Don’t default to full fine-tuning because it sounds more thorough. For the majority of practical fine-tuning tasks in 2026 — instruction tuning, domain adaptation, style adaptation, task-specific formatting — LoRA produces quality indistinguishable from full fine-tuning in production evaluation, at a fraction of the cost. Start with LoRA, evaluate on your task, and only escalate to full fine-tuning if there’s a measurable gap worth the hardware investment.
Common Mistakes to Avoid
Training LoRA with too low a learning rate is the most common mistake. Because only a tiny fraction of parameters are updating, the gradient signal needs to be stronger than it would be for full fine-tuning. If you port a full fine-tuning learning rate (1e-5 or 2e-5) directly to LoRA without adjustment, training will be sluggish and underfit. Start at 1e-4 and adjust from there.
Using too small a rank for a complex task is the second most common issue. An r=4 adapter trying to teach a model legal document analysis from scratch will hit a quality ceiling quickly. If your validation loss plateaus well above an acceptable level and you’ve already tuned learning rate and epochs, try doubling the rank before concluding that LoRA can’t match full fine-tuning for your task.
Forgetting to evaluate on tasks outside the fine-tuning distribution catches teams by surprise. Even with LoRA, there can be subtle shifts in model behavior on tasks unrelated to fine-tuning — particularly if the training data has a strong stylistic signal. Always include a regression benchmark covering general capabilities alongside your task-specific evaluation, regardless of which fine-tuning approach you use.
Rank Selection and Target Modules
One of the most consequential decisions in LoRA fine-tuning is which modules to apply adapters to and at what rank. The default in most frameworks is to apply LoRA to the query and value projection matrices (q_proj and v_proj) in every attention layer. This is a reasonable starting point but not always optimal — applying LoRA to all linear layers including the key projection, output projection, and the up/down projections in the FFN typically achieves better task performance at the same total parameter count, because the additional adapter coverage allows the model to make more targeted adjustments across more of its computation.
Rank is the primary knob for trading parameter count against adaptation capacity. Rank 8 is the conventional default and works well for single-task fine-tuning on datasets of 1,000–100,000 examples. Rank 16 or 32 becomes worth considering when fine-tuning on complex multi-turn tasks, when training data is large and diverse, or when you observe that rank 8 models plateau before the training loss is fully converged. Ranks above 64 rarely provide additional benefit for standard instruction fine-tuning — at that point the adapter has enough capacity to overfit, and you’re better served by full fine-tuning or a higher-quality dataset.
The lora_alpha parameter controls the scaling of the adapter output: the effective adapter learning rate is proportional to lora_alpha / rank. Setting alpha equal to rank (the default in many configs) gives a scaling factor of 1. Setting alpha to twice the rank is a common practice that effectively doubles the adapter’s contribution to the output, which can help when the base model needs significant steering. If fine-tuning feels slow to converge or the adapter has little effect on model behavior, try doubling alpha before increasing rank — it’s a cheaper adjustment.
Merging Adapters for Inference
LoRA adapters can be merged back into the base model weights before deployment, eliminating the adapter overhead entirely at inference time. The merge is mathematically exact: W_merged = W_base + (lora_alpha / rank) * B * A. The merged model is identical to the adapter model in a forward pass but runs at full base model speed with no additional memory or compute cost for the adapter matrices.
Merging is straightforward with PEFT’s merge_and_unload() method. The tradeoff is that the merged model can’t be easily un-merged if you want to swap adapters — each merged variant is a separate full model copy. For production serving where you have a single fine-tuned variant and want maximum inference throughput, always merge before deployment. For multi-task serving where you need to switch between adapters on the fly, keep adapters separate and load them dynamically.
QLoRA adapters can also be merged, but the merge happens in the dequantized weight space. The resulting merged model is a full-precision (or bf16) model, not a quantized one — merging a QLoRA adapter produces a larger model than the quantized base. If you need a quantized merged model for deployment, the workflow is: train with QLoRA, merge the adapter into dequantized weights, then re-quantize the merged model using GPTQ or AWQ. This is a common production pattern: QLoRA for memory-efficient training, re-quantization for memory-efficient serving.