Examples of LLM Techniques: From Prompting to Fine-Tuning and Beyond

Large language models have evolved from simple text completion tools into sophisticated systems capable of reasoning, coding, and complex task execution. But understanding the theory behind LLMs is vastly different from knowing how to actually use them effectively. The gap between reading about transformer architectures and building production systems is filled with practical techniques—specific methods that practitioners use daily to extract maximum value from these models.

This article explores concrete examples of LLM techniques across the spectrum of sophistication, from zero-shot prompting that requires no training data to advanced fine-tuning methods that reshape model behavior. Rather than superficially covering every technique, we’ll dive deep into the most impactful approaches with real examples you can adapt to your own projects. Whether you’re building a chatbot, automating document analysis, or creating code generation tools, these techniques form the foundation of effective LLM implementation.

Prompting Techniques: The Foundation

Prompting is how you communicate with language models, and the way you craft prompts dramatically affects output quality. While it might seem straightforward to “just ask the model,” sophisticated prompting techniques can mean the difference between useless garbage and production-ready results.

Zero-Shot Prompting

Zero-shot prompting is the simplest technique: you describe the task and the model attempts it without any examples. This works surprisingly well for straightforward tasks where the model already has the necessary knowledge from pre-training.

Example: Sentiment Analysis

Analyze the sentiment of this customer review and classify it as positive, negative, or neutral:

Review: "The product arrived two weeks late and the packaging was damaged, but once I got it working, the quality exceeded my expectations."

Sentiment:

The model will typically identify this as mixed or neutral, recognizing both negative (shipping issues) and positive (quality) elements. Zero-shot works here because sentiment analysis is a well-understood task that appears frequently in the model’s training data.

Zero-shot prompting excels at classification, summarization, simple Q&A, and format transformation. It fails when tasks require domain-specific knowledge the model hasn’t encountered, highly specialized reasoning patterns, or precise output formatting that the model doesn’t naturally follow.

Few-Shot Prompting

Few-shot prompting provides examples within the prompt itself, showing the model the exact pattern you want. This dramatically improves performance on tasks where zero-shot struggles, essentially teaching the model through demonstration.

Example: Custom Entity Extraction

Extract the product name, issue type, and urgency from customer support tickets.

Ticket: "My SuperClean 3000 vacuum won't turn on after yesterday's use. Need help ASAP!"
Product: SuperClean 3000
Issue: Won't power on
Urgency: High

Ticket: "The EcoWash dishwasher has a small leak under the door. Not urgent but should be looked at."
Product: EcoWash
Issue: Water leak
Urgency: Low

Ticket: "Can someone tell me if the SmartBrew coffee maker is compatible with third-party pods?"
Product: SmartBrew
Issue: Compatibility question
Urgency: Medium

Ticket: "My AirFlow purifier makes a grinding noise and has been for the past week. Really annoying."
Product:

The model will follow this exact structure for the final ticket, extracting “AirFlow,” “Unusual noise,” and “Medium” by pattern-matching the examples. Few-shot prompting is particularly powerful because it works with any model—no training or fine-tuning required.

The number of examples matters. Three to five examples typically provide the best balance—enough to establish the pattern without consuming too much context window or introducing noise. Ten examples rarely outperform five, and single-example prompting (one-shot) often fails to establish clear patterns.

Chain-of-Thought Prompting

Chain-of-thought (CoT) prompting encourages models to show their reasoning process step-by-step before reaching conclusions. This technique is transformative for complex reasoning tasks where jumping directly to an answer leads to errors.

Example: Multi-Step Problem Solving

A company has 150 employees. They want to organize a team-building event where each team has exactly 8 members. If they can't form a complete team, those employees will help with event coordination instead. How many complete teams can they form, and how many coordinators will there be? Let's think step by step.

Step 1: Determine how many complete teams of 8 can be formed from 150 employees.
150 ÷ 8 = 18.75

Step 2: Since we can only form complete teams, we take the whole number.
Complete teams = 18

Step 3: Calculate how many employees are in complete teams.
18 teams × 8 members = 144 employees

Step 4: Calculate remaining employees who will be coordinators.
150 - 144 = 6 coordinators

Answer: The company can form 18 complete teams with 6 coordinators.

Simply ending the prompt with “Let’s think step by step” or “Let’s solve this carefully” activates the chain-of-thought behavior. The model breaks down complex problems into manageable steps, dramatically reducing errors on mathematical reasoning, logical puzzles, and multi-hop questions.

CoT becomes essential when dealing with problems that require multiple reasoning steps, numerical calculations, or combining information from different parts of the context. For simple questions, it adds unnecessary verbosity without improving accuracy.

Self-Consistency and Voting

Self-consistency takes chain-of-thought further by generating multiple reasoning paths and selecting the most common answer. This technique reduces errors from individual reasoning mistakes by leveraging the “wisdom of the crowd” within a single model.

How it works in practice:

  1. Generate 5-10 chain-of-thought responses to the same question
  2. Extract the final answer from each reasoning path
  3. Select the answer that appears most frequently

For the team-building problem above, you might get 7 responses saying “18 teams, 6 coordinators” and 3 responses with arithmetic errors leading to different answers. Self-consistency would correctly choose the majority answer. This technique is particularly valuable for high-stakes decisions where accuracy matters more than speed.

Prompting Techniques Comparison

Technique Setup Effort Best For Cost per Query
Zero-Shot Minimal Common tasks, classifications Low
Few-Shot Low Custom formats, domain tasks Medium
Chain-of-Thought Low Complex reasoning, math Medium-High
Self-Consistency Medium High-stakes accuracy needs Very High
Practical tip: Start with zero-shot for rapid prototyping. If results are inadequate, add 3-5 examples (few-shot). For reasoning tasks, add “Let’s think step by step” (CoT). Reserve self-consistency for critical decisions where 10x cost is justified by accuracy needs.

Retrieval-Augmented Generation (RAG)

RAG techniques address a fundamental LLM limitation: models only know what’s in their training data. When you need up-to-date information, company-specific knowledge, or access to private documents, RAG retrieves relevant information and provides it to the model as context.

Basic RAG Architecture

The standard RAG pattern involves three steps: when a user asks a question, embed the question into a vector, search a vector database for relevant documents, and inject those documents into the LLM prompt along with the original question. The model generates answers grounded in the retrieved information rather than relying solely on parametric knowledge.

Example: Company Knowledge Base

Imagine a customer service bot with access to product documentation:

Retrieved Documents:
[Doc 1] The ProMax 5000 warranty covers manufacturing defects for 24 months from purchase date. Water damage and physical damage are not covered.

[Doc 2] To initiate a warranty claim, customers must provide proof of purchase and photos of the defect. Claims are processed within 5-7 business days.

User Question: How long is the warranty on the ProMax 5000, and what does it cover?

Answer based on the provided documentation:

The model generates an accurate answer combining information from both documents, something impossible with its training data alone. This technique is foundational for chatbots, documentation Q&A systems, and internal knowledge assistants.

Advanced RAG: Hybrid Search and Reranking

Basic RAG using pure vector similarity often misses relevant documents due to the semantic gap between questions and answers. Hybrid search combines vector similarity with traditional keyword matching, capturing both semantic meaning and exact term matches.

The hybrid approach:

  1. Vector search finds semantically similar documents
  2. BM25 keyword search finds documents with exact term matches
  3. Combine results with weighted scoring (e.g., 70% vector, 30% keyword)
  4. Apply a reranking model to final candidates

Reranking models are specialized cross-encoders that score question-document pairs more accurately than simple vector similarity. While too slow for initial retrieval over millions of documents, they excel at refining a candidate list of 50-100 documents down to the best 5-10.

This two-stage retrieve-then-rerank pattern dramatically improves RAG quality. In production systems, hybrid search with reranking typically achieves 20-40% better answer accuracy compared to pure vector search.

Hypothetical Document Embeddings (HyDE)

HyDE is a counterintuitive technique that improves retrieval quality by first having the LLM generate a hypothetical answer to the question, then using that generated answer to search for real documents. The insight is that answers are semantically closer to other answers than questions are to answers.

Example workflow:

  1. User asks: “What are the return shipping costs for international orders?”
  2. LLM generates hypothetical answer: “For international returns, customers are responsible for return shipping costs, which typically range from $15-40 depending on destination country and package weight.”
  3. Embed this hypothetical answer and search the knowledge base
  4. Retrieved documents are more relevant because we’re matching answer-to-answer rather than question-to-answer
  5. Generate final answer using actual retrieved documents

HyDE performs particularly well when questions are vague, abstract, or phrased differently from how documentation is written. The technique adds one extra LLM call but significantly improves retrieval quality in challenging scenarios.

Prompt Engineering Patterns

Beyond basic prompting techniques, specific patterns have emerged for structuring prompts to handle complex tasks reliably. These patterns are reusable templates that work across different domains and use cases.

The Persona Pattern

Assigning the model a specific role or expertise changes its response style and knowledge emphasis. This simple technique can dramatically improve output quality for specialized tasks.

Example: Technical vs. Business Explanations

You are a senior software architect with 15 years of experience in distributed systems. Explain database sharding to a junior developer in a way that emphasizes practical considerations and common pitfalls.

versus:

You are a business consultant explaining technology to non-technical executives. Explain database sharding focusing on business benefits and cost implications, avoiding technical jargon.

The same question receives vastly different answers based on the assigned persona. The first emphasizes implementation details and edge cases; the second focuses on scalability ROI and operational costs. Use personas whenever your task has a clear stakeholder or requires specific expertise.

The Template Pattern

Template prompts provide rigid structure that the model must fill in, ensuring consistent output formatting. This technique is essential when LLM output feeds into downstream systems that expect specific formats.

Example: Structured Data Extraction

Extract information from the following email and fill in this JSON template. If information is not available, use null.

Email: "Hi Sarah, I'd like to schedule a call next Tuesday at 2pm to discuss the Q4 budget proposal. Please send me the financial projections beforehand. Thanks, Michael"

Template:
{
  "sender_name": "",
  "recipient_name": "",
  "action_requested": "",
  "deadline": "",
  "meeting_date": "",
  "meeting_time": "",
  "topic": "",
  "attachments_mentioned": []
}

The model fills in the template with extracted information, producing machine-readable output that’s trivial to parse. This pattern eliminates the unreliability of free-form output when you need structured data.

The Verification Pattern

For high-stakes outputs, the verification pattern has the model generate an answer, then critically evaluate its own response for accuracy and completeness. This self-checking reduces errors and hallucinations.

Example: Medical Information Verification

Question: What are the common side effects of metformin?

First, provide your answer:
[Model generates answer]

Now, verify your answer by checking for:
1. Did you include only medically documented side effects?
2. Did you avoid making claims about severity without proper context?
3. Did you include appropriate disclaimers about consulting healthcare providers?
4. Are there any statements you're uncertain about?

Verification:
[Model evaluates its own response]

Final answer (incorporating verification feedback):

The model catches its own errors, adds missing disclaimers, and moderates overconfident statements. This two-stage generate-then-verify pattern is particularly valuable in medical, legal, and financial domains where accuracy is critical.

Fine-Tuning Techniques

When prompting techniques hit their limits, fine-tuning reshapes model behavior by continuing training on task-specific data. Fine-tuning requires more effort than prompting but unlocks capabilities impossible through prompts alone.

Supervised Fine-Tuning (SFT)

Supervised fine-tuning trains the model on input-output pairs that demonstrate desired behavior. This is the most common fine-tuning approach, appropriate when you have hundreds to thousands of examples of correct outputs for your specific task.

When to use SFT:

  • You have 500+ high-quality labeled examples
  • The task requires consistent output formatting that prompting can’t reliably achieve
  • Domain-specific terminology or style is critical
  • You need to reduce inference costs by using a smaller fine-tuned model instead of prompting a large model

Example use case: Legal Document Classification

A law firm might collect 2,000 contracts manually labeled by experts into categories (employment, NDA, licensing, partnership). Fine-tuning a smaller model on this data creates a specialized classifier that outperforms prompting GPT-4 while costing 1/10th per classification.

The process involves formatting your examples into prompt-completion pairs, splitting into train/validation sets (80/20), training for 2-5 epochs with appropriate learning rate, and evaluating on held-out data. Modern fine-tuning APIs from OpenAI, Anthropic, and others make this process accessible without deep ML expertise.

Parameter-Efficient Fine-Tuning (PEFT)

PEFT techniques like LoRA (Low-Rank Adaptation) fine-tune only a small subset of model parameters rather than the entire model. This dramatically reduces computational costs and makes fine-tuning large models practical on consumer hardware.

LoRA advantages:

  • Trains 100x fewer parameters than full fine-tuning
  • Maintains base model performance on general tasks
  • Multiple LoRA adapters can be swapped for different tasks
  • Requires 3-4x less GPU memory during training

Instead of updating billions of parameters, LoRA adds small adapter layers and trains only those. A LoRA adapter is typically 10-100MB instead of multi-GB full model checkpoints, making them easy to distribute and version control.

Practical example: You could maintain a base Llama-2-13B model with separate LoRA adapters for customer support, code generation, and content writing. Each adapter specializes the model for its task while sharing the same base model weights.

Instruction Tuning

Instruction tuning is supervised fine-tuning specifically on diverse instruction-following tasks. Rather than specializing for one task, instruction tuning improves general instruction-following ability across many tasks.

The training data consists of thousands of varied instructions: summarize this article, translate this text, answer this question, write code for this problem. The model learns the meta-skill of understanding and executing novel instructions.

This technique is how base models like GPT-3 become instruction-following assistants like ChatGPT. For practitioners, instruction tuning is valuable when building general-purpose assistants or when you need strong performance across many related tasks without task-specific fine-tuning for each one.

Technique Selection Framework

Start Here: Zero/Few-Shot Prompting
If task is common and model understands it → Zero-shot
If output format is specific → Few-shot (3-5 examples)
Cost: $0.01-0.10 per query | Setup: Minutes
Upgrade: Advanced Prompting
If task requires reasoning → Chain-of-thought
If accuracy is critical → Self-consistency
If need current/private data → RAG
Cost: $0.10-1.00 per query | Setup: Hours to days
Advanced: Fine-Tuning
If have 500+ labeled examples → Supervised fine-tuning
If need specialized model → LoRA/PEFT
If building general assistant → Instruction tuning
Cost: $100-5000 training + $0.001-0.01 per query | Setup: Days to weeks
Decision Rule: Use the simplest technique that meets your requirements. Prompting is almost always worth trying first—it’s fast to test and often sufficient. Escalate to fine-tuning only when prompting demonstrably fails or costs become prohibitive at scale.

Inference Optimization Techniques

Beyond improving model outputs, practical techniques optimize inference speed and cost—critical concerns for production systems serving real users.

Streaming Responses

Streaming returns model output token-by-token as generation occurs rather than waiting for complete responses. This dramatically improves perceived latency for interactive applications. Users see the first words within 200-500ms instead of waiting 5-10 seconds for full responses.

Most modern LLM APIs support streaming through server-sent events or websockets. The implementation is straightforward: instead of a single response containing complete text, you receive multiple chunks that are concatenated on the client side.

When streaming matters:

  • Chatbots and conversational interfaces
  • Long-form content generation where total generation time exceeds 3-5 seconds
  • Any user-facing application where perceived responsiveness matters

Streaming is purely a UX optimization—it doesn’t change model output or reduce actual generation time, but the psychological impact on users is substantial.

Response Caching

Many applications send similar or identical prompts repeatedly. Response caching stores previous outputs and returns them instantly for matching inputs, eliminating redundant API calls entirely.

Effective caching strategies:

  • Cache complete prompt+completion pairs using content hashing
  • Set TTL (time-to-live) based on information freshness requirements
  • Implement semantic similarity matching for near-duplicate prompts
  • Use tiered caching (in-memory for hot queries, Redis for warm, database for cold)

For applications with recurring questions, caching can reduce LLM API costs by 60-80% while improving response times from seconds to milliseconds. A customer support bot answering “What are your business hours?” doesn’t need to call the LLM for the 100th time.

Prompt Compression

Longer prompts cost more and process slower. Prompt compression techniques reduce token count without losing critical information, directly reducing costs and latency.

Compression approaches:

  • Remove redundant whitespace and formatting
  • Use abbreviations for frequently repeated terms
  • Eliminate unnecessary pleasantries in multi-turn conversations
  • Compress long documents using extractive summarization before injection

For RAG systems, compress retrieved documents by extracting only relevant passages rather than including full documents. Instead of injecting three 2000-token documents (6000 tokens), extract the relevant 200-token sections from each (600 tokens). This 10x compression maintains quality while cutting costs dramatically.

Model Selection and Routing

Different tasks require different model capabilities. Sophisticated systems route queries to appropriate models rather than using one model for everything.

Capability-Based Routing

Simple questions don’t need the most powerful (expensive) models. Route based on complexity:

  • Simple classification, extraction → Fast, cheap model (GPT-3.5, Claude Haiku)
  • Complex reasoning, long context → Powerful model (GPT-4, Claude Opus)
  • Code generation, technical tasks → Specialized model (GPT-4, Claude Sonnet)

Implement a lightweight classifier that analyzes incoming queries and routes to appropriate models. This hybrid approach reduces costs 40-70% compared to using the most powerful model for everything while maintaining quality on tasks that need it.

Example routing logic:

  • Question length < 50 tokens and no reasoning keywords → GPT-3.5
  • Contains code, SQL, or technical terms → Claude Sonnet
  • Multiple steps, comparison, or analysis required → GPT-4
  • Default fallback → Mid-tier model

Cascade Patterns

Cascade patterns try fast/cheap models first and fall back to powerful models only when needed. This optimizes the cost-quality tradeoff by using expensive models minimally.

Typical cascade:

  1. Try GPT-3.5 with confidence scoring
  2. If confidence < 0.7, retry with GPT-4
  3. If still uncertain, escalate to human

The confidence scoring is crucial—use techniques like asking the model to rate its own certainty, checking for hedging language, or using a separate verification model. This pattern typically handles 60-70% of queries with cheap models while maintaining high quality through selective escalation.

Conclusion

The landscape of LLM techniques spans from simple zero-shot prompts to sophisticated fine-tuning pipelines, each with distinct use cases and tradeoffs. Mastering these techniques means knowing when to use which approach: prompting for rapid iteration and flexibility, RAG for knowledge grounding, fine-tuning for specialized performance, and inference optimization for production scalability. The most effective LLM systems combine multiple techniques strategically rather than relying on any single approach.

Success with LLMs comes from methodical experimentation and measurement. Start simple with prompting techniques, measure performance rigorously, and escalate complexity only when simpler methods prove insufficient. The techniques covered here form a practical toolkit for building reliable, cost-effective LLM applications that deliver real value to users.

Leave a Comment