Synthetic data generation with LLMs has moved from a research curiosity to a standard production technique for fine-tuning. The core problem it solves is data scarcity: high-quality labelled training data is expensive to produce, and most organisations have far more unlabelled domain content than labelled examples. LLMs can bridge this gap by generating plausible (query, answer), (instruction, response), or (input, output) pairs from your existing content — not as a replacement for real data, but as a way to bootstrap a fine-tuning dataset when real labelled data is scarce, or to augment a small real dataset with synthetic examples in underrepresented categories.
The technique works because modern LLMs are good enough at following formatting instructions and reasoning about domain content that the synthetic pairs they generate capture most of the signal that real human-labelled pairs would provide — with the important caveat that LLM-generated labels inherit the LLM’s biases and failure modes. Synthetic data is most useful as a starting point that you then filter, deduplicate, and selectively augment with real labels on the hardest or most important examples. Using it as a complete substitute for real data produces models that are fluent but may be confidently wrong in ways that real data would have corrected.
Generating Instruction-Response Pairs
The most common use case is generating (instruction, response) pairs for instruction fine-tuning. The Alpaca and Self-Instruct approaches generate instructions from a small seed set by prompting a teacher LLM to produce diverse task descriptions and completions. For domain-specific fine-tuning, start from your own content rather than a generic seed set — sample passages from your documentation, codebase, or knowledge base and prompt the LLM to generate tasks that would be answered by that content:
import anthropic, json
client = anthropic.Anthropic()
GEN_PROMPT = """Given the passage below, generate 3 diverse instruction-response pairs.
Vary question type: factual, how-to, comparison, troubleshooting.
Return a JSON array with 'instruction' and 'response' keys only.
Passage:\n{passage}"""
def generate_pairs(passage):
r = client.messages.create(model='claude-sonnet-4-20250514', max_tokens=1500,
messages=[{'role':'user','content':GEN_PROMPT.format(passage=passage)}])
text = r.content[0].text.strip().replace('```json','').replace('```','').strip()
return json.loads(text)
all_pairs = []
for passage in corpus_passages:
for pair in generate_pairs(passage):
pair['source'] = passage
all_pairs.append(pair)
Temperature matters significantly. Use 0.7–0.9 for diversity in instruction types — lower temperatures produce repetitive formats even when the content varies. A common failure mode is generating instructions that all start with “What is” or “How do I” — include explicit diversity constraints in the prompt to counteract this. Run multiple passes and deduplicate by embedding cosine similarity to maximise coverage.
Generating Preference Data for DPO
Direct Preference Optimization requires (prompt, chosen_response, rejected_response) triples. Generating preference data synthetically is harder because the rejected response needs to be plausibly wrong rather than obviously bad — a clearly incorrect rejection does not teach the model useful distinctions. The standard approach is to generate multiple candidate responses with different system prompts or sampling parameters, then use a judge LLM to rank them:
def generate_preference_triple(instruction):
candidates = []
system_prompts = [
'You are a precise, concise technical assistant.',
'You are a helpful assistant. Be thorough and include examples.',
'Answer briefly in 2-3 sentences.',
'Provide a detailed explanation with caveats.',
]
for sys in system_prompts:
r = client.messages.create(model='claude-sonnet-4-20250514', max_tokens=600,
system=sys, messages=[{'role':'user','content':instruction}])
candidates.append(r.content[0].text)
judge = f"Rank these 4 responses best to worst for accuracy and utility.\nInstruction: {instruction}\n" + \
'\n'.join(f'[{i+1}] {c[:300]}' for i,c in enumerate(candidates)) + \
'\nReturn JSON: {"ranking": [best_idx, ..., worst_idx]} 1-based.'
jr = client.messages.create(model='claude-sonnet-4-20250514', max_tokens=60,
messages=[{'role':'user','content':judge}])
ranking = json.loads(jr.content[0].text)['ranking']
return {'prompt': instruction, 'chosen': candidates[ranking[0]-1], 'rejected': candidates[ranking[-1]-1]}
The quality of the judge model is the bottleneck for preference data quality. Using the same model as both generator and judge introduces circular reasoning — the judge will prefer responses in its own style regardless of actual quality. Use the strongest available judge (ideally human reviewers for the most important preference pairs) and validate that the judge’s rankings correlate with human preferences on a sample before using synthetic preference data at scale.
Filtering and Quality Control
Raw synthetic output needs filtering before it goes into a training dataset. The most common quality issues are: responses that hallucinate facts not in the source passage, instructions that are too vague or too similar to each other, responses that are too short to be useful, and formatting errors where the LLM failed to follow the output schema. A practical filtering pipeline runs four checks in sequence: schema validation (parse the JSON, reject malformed output), length filter (reject instructions under 10 words or responses under 50 words), deduplication by embedding similarity (reject pairs where the instruction cosine similarity to any existing pair exceeds 0.92), and a quality classifier.
For the quality classifier, train a small binary classifier on a few hundred manually labelled examples (good/bad synthetic pairs) and use it to score the full synthetic dataset. A BERT-base or DeBERTa-v3-small fine-tuned as a binary classifier is fast enough to score millions of pairs in hours and typically filters out 10–30% of synthetic pairs that pass the basic checks but are low quality. This is cheaper and more consistent than using an LLM-as-judge for every example, which can cost significantly at scale. Reserve LLM-as-judge for borderline cases flagged by the classifier.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
def deduplicate_pairs(pairs, threshold=0.92):
instructions = [p['instruction'] for p in pairs]
embeddings = model.encode(instructions, batch_size=256, show_progress_bar=True)
keep = [True] * len(pairs)
for i in range(len(pairs)):
if not keep[i]:
continue
sims = np.dot(embeddings[i+1:], embeddings[i]) / (
np.linalg.norm(embeddings[i+1:], axis=1) * np.linalg.norm(embeddings[i]) + 1e-9
)
for j, sim in enumerate(sims, start=i+1):
if sim > threshold:
keep[j] = False
return [p for p, k in zip(pairs, keep) if k]
filtered = deduplicate_pairs(all_pairs)
print(f'Kept {len(filtered)}/{len(all_pairs)} after dedup')
Data Mixing: Synthetic vs Real
The optimal ratio of synthetic to real data depends on the quality gap between them and the task type. For instruction following on factual domain content, a mix of 80% synthetic and 20% real human-labelled pairs often matches the performance of 100% real data at a fraction of the cost — the synthetic data handles common patterns while the real data anchors the model on the hardest and most important examples. For tasks requiring precise factual accuracy (medical, legal, financial) the threshold for relying on synthetic data is much higher, and the real-data fraction should be larger. A practical rule: if your model fine-tuned on synthetic-only data scores within 5% of human-labelled-only on your evaluation benchmark, the synthetic data quality is sufficient to use as the primary training source. If the gap is larger, invest in more real labels rather than generating more synthetic data.
Curriculum ordering matters for mixed datasets. Training on synthetic data first then fine-tuning on real data (two-stage training) consistently outperforms shuffling synthetic and real data together. The synthetic data teaches the model the task format and common patterns; the real data then corrects the specific failure modes and biases that the synthetic data introduced. This mirrors how humans learn: broad exposure first, correction of specific errors second. Implement this with two separate training runs rather than trying to weight samples within a single run — the learning rate schedules and epoch counts can be optimised independently for each stage.
Evol-Instruct and Complexity Scaling
WizardLM’s Evol-Instruct technique improves synthetic dataset quality by iteratively rewriting instructions to be more complex, specific, or constrained. Instead of generating instructions directly from passages, start with simple instructions and apply evolution operations: add constraints (“answer in under 50 words”), increase specificity (“explain for an audience of senior ML engineers”), add reasoning requirements (“explain why, not just how”), or combine multiple subtasks. Each evolution step produces a harder instruction that requires more sophisticated responses. Fine-tuning on evolved instructions produces models that handle difficult, nuanced queries much better than models trained on simple synthetic instructions, because the training distribution covers the complexity range users actually need.
The practical implementation generates a base instruction set, applies 2–3 rounds of evolution using an LLM (“make this instruction more complex and specific without changing the topic”), filters evolved instructions for coherence, generates responses for each evolved instruction, and filters the (instruction, response) pairs for quality. The total cost per training example is 3–5x higher than simple generation, but the resulting model quality improvement is typically worth it for high-value fine-tuning tasks where the base synthetic approach falls short.
Domain Coverage and Gap Analysis
One of the most impactful uses of synthetic data is targeted coverage of underrepresented task types or topic areas. Before generating data blindly from your corpus, analyse your evaluation benchmark to find where the baseline model fails most often — these failure clusters define the coverage gaps that synthetic data should fill. If your model consistently fails on multi-step reasoning questions, generate examples specifically requiring chain-of-thought. If it fails on edge cases in a specific subdomain, generate examples focused on that subdomain. This targeted approach produces much better fine-tuned models than uniform corpus sampling, because it concentrates the training signal where the model needs improvement rather than reinforcing patterns it already handles well.
A practical gap analysis workflow: run your base model on your evaluation set, cluster the failing examples by topic and question type using embedding similarity, rank the clusters by failure rate times cluster size, and generate synthetic examples proportional to each cluster’s need. Clusters with 80%+ failure rate and many examples are the highest-priority targets. Clusters with low failure rate can be addressed with fewer synthetic examples or skipped entirely — adding training examples for things the model already handles well wastes data budget without improving performance.
Cost and Latency Considerations at Scale
Generating training data at scale has real cost implications. A dataset of 100,000 instruction-response pairs with average response length of 300 tokens costs roughly 30M output tokens — at typical API rates, this is a meaningful budget line, especially if you also run a judge pass for quality scoring. Batching API calls with the Anthropic Batch API (or equivalent) reduces per-token cost by 50% for non-time-sensitive generation workloads, which is almost always the case for dataset construction. Plan generation in off-peak batches rather than real-time calls.
For teams with strict cost constraints, a tiered generation strategy works well: use a smaller, cheaper model (like claude-haiku-4-5-20251001) for initial bulk generation and a larger model (claude-sonnet-4-20250514) only for the judge pass and for rewriting the top-priority coverage gaps. The quality difference for straightforward instruction-response generation on well-constrained topics is smaller than the cost difference, so the cheaper model handles the majority of examples adequately. Reserve the expensive model for complex reasoning tasks, preference data judging, and evolved instruction generation where quality differences are most pronounced.
Avoiding Common Failure Modes
Several failure modes recur across synthetic data generation projects and are worth anticipating explicitly. Sycophancy amplification is the most insidious: if the generator LLM tends toward agreeable, affirmative responses, the fine-tuned model will inherit and amplify this tendency — it learns that agreeable responses are the training distribution and generalises this beyond what the data intended. Mitigate by including examples where the correct response is a correction, a refusal, or an expression of uncertainty, and by explicitly prompting the generator to produce responses that disagree with incorrect premises. Topic drift is another common issue: generators tend to drift toward the topics most represented in their training data, so a synthetic dataset generated from a domain corpus will overrepresent the most common topics in that corpus and underrepresent rare but important edge cases. Track topic coverage across the synthetic set with a topic model or keyword frequency analysis and identify underrepresented areas for targeted generation. Finally, length bias — LLMs instructed to generate training responses tend to produce longer responses than necessary, since longer responses are more common in their RLHF training data. Fine-tuning on these produces a model that over-explains. Explicitly constrain response length in generation prompts and include length diversity (short, medium, long responses) as a generation objective.
Track these issues with automated checks built into your generation pipeline rather than catching them during model evaluation — the earlier in the pipeline you detect a systematic generation problem, the less data you have to regenerate. A simple keyword frequency monitor on generated instructions and a response length histogram are cheap to build and catch the most common drift and bias patterns before they contaminate the full dataset.