Why Fine-Tune an LLM?
A base or instruction-tuned LLM is a generalist — it can do many things reasonably well but few things exceptionally well for your specific domain. Fine-tuning trains the model on examples from your use case so it learns your terminology, output format, tone, and task patterns. The result is a model that performs better on your specific task, often by a significant margin, while costing less to run per query (smaller fine-tuned models can outperform larger base models on the target task) and requiring shorter prompts (the model already understands the context it was trained on).
The key question before fine-tuning is whether you actually need it. Many teams reach for fine-tuning prematurely, when better prompting, few-shot examples, or RAG would have achieved the same goal at a fraction of the cost and complexity. Fine-tuning is the right answer when: you have a highly specialised domain with vocabulary or reasoning patterns not well represented in the base model; you need consistent output format that prompting alone cannot reliably enforce; you want to reduce prompt length and inference cost at scale; or you need a model that embodies a specific persona or communication style across thousands of interactions.
The Three Approaches: Full Fine-Tuning, LoRA, and QLoRA
Three techniques dominate practical LLM fine-tuning in 2026, and they differ primarily in what they update and how much memory they require.
Full fine-tuning updates all model parameters. You start with pre-trained weights and gradient-descend on your training data, modifying every weight in the network. This produces the highest-quality results — the model can adapt deeply to your domain — but requires enormous VRAM (roughly 16× the model weight size) and risks catastrophic forgetting (the model loses general capabilities as it specialises).
LoRA (Low-Rank Adaptation) freezes the original model weights and injects small trainable rank-decomposition matrices into specific layers (typically the attention weight matrices). Only these adapter matrices are updated during training. Because they represent a tiny fraction of total parameters (typically 0.1–1%), the training memory and compute requirements are drastically lower than full fine-tuning. Quality on the target task is close to full fine-tuning for most applications, while general capabilities are largely preserved.
QLoRA (Quantised LoRA) takes LoRA further by loading the base model in 4-bit quantisation rather than FP16/BF16. The quantised weights are frozen; only the LoRA adapters train in full precision. This reduces the base model memory from 2 bytes per parameter (FP16) to 0.5 bytes (Q4), making it possible to fine-tune 70B models on a single consumer GPU. The quality trade-off from quantisation is modest for most tasks but slightly larger than standard LoRA.
Memory Requirements: Side-by-Side Comparison
Method | 7B Model | 13B Model | 70B Model | Required GPU
------------|------------|------------|------------|------------------
Full FT | ~112 GB | ~210 GB | ~1,120 GB | 4–8× A100 80GB+
LoRA (FP16) | ~20 GB | ~34 GB | ~160 GB | 1× 24GB / 2× 80GB
QLoRA (Q4) | ~8 GB | ~14 GB | ~48 GB | 1× 12GB / 1× 80GB
Full fine-tuning of anything above 7B is only practical on multi-GPU clusters. LoRA of 7B–13B models fits on a single consumer GPU (RTX 3090/4090). QLoRA of 70B fits on a single A100 80GB or two consumer 24GB GPUs. These numbers are approximate and assume Adam optimiser with gradient checkpointing enabled; actual requirements vary.
LoRA Fine-Tuning with the Hugging Face Ecosystem
The PEFT (Parameter-Efficient Fine-Tuning) library from Hugging Face is the standard tool for LoRA and QLoRA. Here is a complete working example for fine-tuning Llama 3.1 8B with LoRA on a supervised instruction dataset:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
from datasets import load_dataset
# Load model and tokenizer
model_id = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank: higher = more parameters, better quality, more memory
lora_alpha=32, # Scaling factor (typically 2x rank)
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], # Which layers to adapt
lora_dropout=0.05,
bias="none"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 41,943,040 || all params: 8,072,323,072 || trainable%: 0.5196
# Load and format dataset
dataset = load_dataset("your_dataset", split="train")
# Training arguments
training_args = TrainingArguments(
output_dir="./lora-llama-output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size = 16
learning_rate=2e-4,
bf16=True,
logging_steps=50,
save_strategy="epoch",
warmup_ratio=0.03,
lr_scheduler_type="cosine",
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
dataset_text_field="text",
max_seq_length=2048,
)
trainer.train()
trainer.model.save_pretrained("./lora-adapter")
After training, the lora-adapter directory contains only the adapter weights — typically 50–200 MB for rank 16, regardless of base model size. The base model weights are unchanged and reusable. To use the fine-tuned model:
from peft import PeftModel
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base_model, "./lora-adapter")
# Optionally merge adapters into base model for faster inference
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
Merging the adapter into the base model produces a standard model file that loads without the PEFT library and runs at full inference speed with no adapter overhead.
QLoRA: Fine-Tuning 70B on a Single GPU
QLoRA combines 4-bit quantisation with LoRA to make large model fine-tuning accessible on a single GPU. The key addition is loading the base model with BitsAndBytes quantisation:
from transformers import BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True, # Nested quantisation for extra memory savings
bnb_4bit_quant_type="nf4", # NormalFloat4: better than INT4 for LLMs
bnb_4bit_compute_dtype=torch.bfloat16 # Compute in BF16 despite 4-bit storage
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.3-70B-Instruct",
quantization_config=bnb_config,
device_map="auto"
)
# Then apply LoRA exactly as before
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj","v_proj"], ...)
model = get_peft_model(model, lora_config)
A 70B model with QLoRA requires approximately 40–48 GB of GPU memory for the quantised weights plus adapter training overhead. This fits on a single A100 80GB with headroom, or across two RTX 3090s (48 GB total) using device_map=”auto” for layer distribution.
Cost Comparison: DIY vs. Managed Fine-Tuning Services
Fine-tuning can be done on your own hardware, on cloud GPU instances, or through managed fine-tuning APIs from the major providers. The cost profile differs substantially:
Approach | 7B LoRA (3 epochs, 10K samples) | 70B QLoRA same
-----------------------|----------------------------------|----------------
Own RTX 4090 (24GB) | ~$0.50 electricity | Not feasible
Own A100 80GB | ~$1.20 electricity | ~$4 electricity
AWS p3.2xlarge (V100) | ~$8 compute | ~$30 (multi-GPU)
AWS p4d.24xlarge | ~$100 compute | ~$50 compute
Google Cloud A100 | ~$15 compute | ~$60 compute
OpenAI fine-tuning | ~$50–$100 (gpt-4o-mini) | N/A (not offered)
Anthropic fine-tuning | ~$100–$300 (Claude) | N/A
Together AI fine-tune | ~$4 (Llama 8B) | ~$20 (Llama 70B)
For small to medium datasets (under 100K samples), the cost differences between approaches are modest. For large-scale fine-tuning (millions of samples, multiple epochs), DIY on owned hardware or spot instances becomes significantly more economical. Managed fine-tuning APIs (OpenAI, Anthropic) are the most expensive per token but require the least operational overhead — you upload a JSONL file and get an endpoint back.
Preparing Your Training Data
Data quality matters far more than training duration or hyperparameter choices for fine-tuning quality. A common mistake is assembling a large dataset of mediocre examples rather than a smaller dataset of excellent ones. For most supervised fine-tuning tasks, 500–5,000 high-quality examples outperform 50,000 noisy ones.
The standard format for instruction fine-tuning is a JSONL file where each line is a conversation:
{"messages": [
{"role": "system", "content": "You are a technical support specialist for Acme Software."},
{"role": "user", "content": "How do I reset my API key?"},
{"role": "assistant", "content": "To reset your API key: navigate to Settings → Security → API Keys. Click 'Regenerate Key' next to the key you want to reset. Note that the old key will be invalidated immediately."}
]}
{"messages": [
{"role": "user", "content": "I'm getting a 429 error."},
{"role": "assistant", "content": "A 429 error means you've exceeded your rate limit. Your current plan allows 100 requests per minute. You can either wait for the rate limit window to reset (it resets every 60 seconds), or upgrade to a higher tier for more requests per minute."}
]}
Key data quality checks before training: verify that every example demonstrates the exact behaviour you want (tone, format, length, domain focus); remove or fix examples where the assistant response is generic, too long, or off-topic; ensure the dataset represents the distribution of inputs you expect in production; and include some examples of correct refusal behaviour for out-of-scope queries if that matters for your use case.
Key Hyperparameters and How to Set Them
Learning rate: 1e-4 to 3e-4 is the typical range for LoRA fine-tuning. Higher rates learn faster but risk instability; lower rates are safer but may need more epochs. Start with 2e-4 and monitor training loss.
LoRA rank (r): Controls the size of the adapter matrices. Higher rank = more parameters = more expressive but more memory. Rank 8–16 works for most tasks. Rank 64–128 for tasks requiring significant domain adaptation. Start with 16.
Epochs: 1–5 is the typical range. More epochs risk overfitting on small datasets. Monitor validation loss and stop when it stops improving. For datasets under 1,000 examples, 3–5 epochs with early stopping is safe.
Max sequence length: Set to the maximum input+output length you expect in production. Longer sequences use more memory. Truncating training examples to 2,048 tokens covers most instruction-following use cases.
Batch size: Use the largest batch size that fits in VRAM. Gradient accumulation lets you simulate larger batches when VRAM is limited: a batch size of 4 with 4 accumulation steps = effective batch size of 16.
Evaluating Fine-Tuned Models
Never deploy a fine-tuned model based solely on training loss. Training loss tells you the model fit the training data; it says nothing about whether it improved on your actual task or degraded general capabilities. Always evaluate on a held-out test set of examples the model never saw during training, using the same metrics you care about in production.
Also evaluate general capability degradation. Run your fine-tuned model through a set of general knowledge, reasoning, and instruction-following examples from outside your training domain. If performance on these degrades noticeably, you have over-specialised — reduce the number of training epochs or dataset size. LoRA fine-tuning typically degrades general capabilities much less than full fine-tuning, but it is not immune.
from peft import PeftModel
from transformers import pipeline
# Compare base vs. fine-tuned on held-out test set
base_pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-8B-Instruct", ...)
ft_pipe = pipeline("text-generation", model=merged_model, ...)
for example in test_set:
base_output = base_pipe(example["prompt"])[0]["generated_text"]
ft_output = ft_pipe(example["prompt"])[0]["generated_text"]
# Score both against ground truth using your chosen metric
When to Use Each Method
Use LoRA when: you have 500–100K training examples, your task needs the base model’s general knowledge preserved, you are working with 7B–70B models, and you have access to at least one GPU with 12–80 GB of VRAM. This covers the majority of practical fine-tuning use cases.
Use QLoRA when: you want to fine-tune a 70B model on limited hardware (single A100 or two consumer GPUs), or when VRAM is the binding constraint for any model size. Accept a small additional quality reduction compared to FP16 LoRA.
Use full fine-tuning when: you have a very large, high-quality dataset (millions of examples), you need maximum adaptation to a radically different domain, you have access to a multi-GPU cluster, and the quality improvement over LoRA is measurable on your specific task. In practice, LoRA is so close in quality for most tasks that full fine-tuning is only justified for a narrow set of cases.
Use managed APIs when: you want minimal operational overhead, you are fine-tuning GPT-4o mini or Claude on proprietary data, your dataset is small enough that the cost difference versus DIY is modest, and your organisation cannot run GPU workloads internally. The per-example cost is higher but the total cost of a first fine-tuning experiment is often lower when you factor in engineering time.
Unsloth: 2–5x Faster LoRA Training
Unsloth is an open-source library that optimises LoRA and QLoRA training kernels for significantly faster training and lower memory usage than the standard Hugging Face PEFT implementation. It is a drop-in replacement that requires minimal code changes:
pip install unsloth
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="meta-llama/Llama-3.1-8B-Instruct",
max_seq_length=2048,
load_in_4bit=True, # QLoRA
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_alpha=16,
lora_dropout=0, # Unsloth works best with no dropout
bias="none",
)
# Then use with SFTTrainer as normal
Unsloth reports 2–5x faster training and 50–70% less VRAM usage compared to standard PEFT, achieved through custom CUDA kernels and memory-efficient attention implementations. For long training runs on consumer hardware, this can meaningfully reduce the total wall-clock time and electricity cost. The library actively supports Llama 3, Mistral, Phi, and other major model families with model-specific optimisations.
Serving Fine-Tuned Models
After fine-tuning and merging adapters, your model is a standard Hugging Face model directory that any inference framework can load. For production serving, the same options apply as for any other model — vLLM, Ollama, TGI — with no fine-tuning-specific changes required.
For Ollama, create a Modelfile pointing to your merged model:
FROM /path/to/merged-model
SYSTEM "You are a technical support specialist for Acme Software. ..."
PARAMETER temperature 0.1
PARAMETER num_ctx 4096
ollama create acme-support -f Modelfile
ollama run acme-support
Fine-tuned models typically benefit from lower temperature settings (0.0–0.3) compared to general-purpose use, because the goal is consistent, predictable outputs rather than creative variation. Set this in the Modelfile or as a default parameter in your inference configuration.
Common Mistakes and How to Avoid Them
The most common fine-tuning failures follow predictable patterns. Training on too little data: under 200 examples, the model memorises rather than generalises. Either collect more data or use few-shot prompting instead. Training for too many epochs on small datasets: validation loss starts rising while training loss continues falling — a clear sign of overfitting. Use early stopping. Not evaluating general capability degradation: a model that performs perfectly on your task but cannot follow basic instructions or answer general questions has been over-tuned. Reduce epochs. Including too much noise in training data: examples where the expected output is inconsistent, incorrectly formatted, or off-topic teach the model bad habits. Audit a random sample of your training data before and after cleaning and compare model quality. Using too high a rank without enough data: high rank (64+) with a small dataset overfits quickly. Match rank to dataset size — rank 8–16 for under 10K examples, higher ranks only with larger datasets.
Fine-Tuning vs. RAG: Choosing the Right Tool
Fine-tuning and RAG are complementary rather than competing approaches, but they address different problems. RAG is the right tool when the information the model needs changes frequently, is too voluminous to fit in weights, or needs to be cited and verified. Fine-tuning is the right tool when the model needs to adopt a consistent style, format, or reasoning pattern, or when domain vocabulary and concepts are so specialised that base model knowledge is inadequate. For many production applications, the optimal architecture combines both: a fine-tuned model that has adopted the right persona and output format, paired with a RAG system that provides current, verifiable facts. Neither alone covers all cases, and the decision to fine-tune should be made based on what problem you are actually solving rather than as a default first step.