Chain-of-Thought Prompting: How It Works, When to Use It, and Advanced Variants

What Is Chain-of-Thought Prompting?

Chain-of-thought (CoT) prompting is a technique that instructs a language model to show its reasoning step by step before producing a final answer. Rather than jumping directly to a conclusion, the model works through the problem explicitly — identifying relevant information, applying logic, considering intermediate results — and only then commits to an answer. The reasoning trace is not just a presentation artifact: it actually changes what answer the model produces, typically making it more accurate on tasks that require multi-step reasoning.

The technique was described in a 2022 paper by researchers at Google, who found that simply adding “Let’s think step by step” to a prompt dramatically improved performance on arithmetic, commonsense reasoning, and symbolic reasoning tasks — particularly for larger models. It has since become one of the most widely used prompting techniques in production LLM applications.

Why Chain-of-Thought Works

Language models generate text token by token, and each token is conditioned on everything that came before it. When a model produces a reasoning trace before its final answer, that trace becomes part of the context for generating the answer. Working through intermediate steps in text gives the model access to those intermediate results when computing the final answer — effectively expanding its working memory beyond what fits in a single forward pass.

This is why CoT helps most on tasks that require sequential reasoning: the model uses its own generated text as a scratchpad. For tasks where the answer can be retrieved directly — factual recall, simple classification — CoT adds latency and tokens without improving accuracy, and can sometimes hurt by giving the model opportunities to reason itself into the wrong answer.

Zero-Shot Chain-of-Thought

The simplest form of CoT requires no examples — just a prompt suffix instructing the model to reason before answering:

prompt = """A store sells apples for 0.50 each and oranges for 0.75 each.
If I buy 4 apples and 3 oranges, how much do I spend in total?

Let's think step by step."""

The phrase “Let’s think step by step” reliably triggers reasoning behaviour in most capable models. Alternatives that also work include “Think through this carefully before answering”, “Work through this problem step by step”, and “First, let’s reason about what we know.” The exact phrasing matters less than the instruction to reason before concluding.

For instruction-tuned models like Claude and GPT-4, you can also include the instruction in the system prompt to apply it globally across all turns:

system_prompt = """You are a helpful assistant. For any question that requires
reasoning or calculation, always work through your thinking step by step before
giving your final answer."""

Few-Shot Chain-of-Thought

Few-shot CoT includes worked examples in the prompt, showing the model not just that it should reason step by step, but what that reasoning should look like for your specific task:

prompt = """Solve the following word problems. Show your work step by step.

Problem: A baker makes 24 loaves on Monday and 18 loaves on Tuesday.
She sells 30 loaves. How many does she have left?
Solution: Start with total loaves: 24 + 18 = 42 loaves.
Subtract sold: 42 - 30 = 12 loaves remaining.
Answer: 12 loaves.

Problem: A train travels 60 mph for 2.5 hours. How far does it travel?
Solution: Distance = speed × time = 60 × 2.5 = 150 miles.
Answer: 150 miles.

Problem: {new_problem}
Solution:"""

Few-shot CoT consistently outperforms zero-shot CoT because the examples teach the model the reasoning format expected for your specific domain. The tradeoff is longer prompts and the need to curate good examples. For high-stakes or complex tasks, the quality of the few-shot examples matters enormously — poorly reasoned examples can teach the model bad reasoning patterns.

Variants and Extensions

Self-consistency extends CoT by generating multiple independent reasoning chains for the same problem and taking a majority vote over the final answers. Because different reasoning paths can arrive at different answers, aggregating over many samples is more robust than relying on a single chain. Self-consistency improves accuracy substantially on mathematical reasoning tasks — typically 5–15 percentage points over single-chain CoT — at the cost of 10–40x more tokens and latency.

import anthropic
from collections import Counter

client = anthropic.Anthropic()

def self_consistent_answer(problem: str, n_samples: int = 10) -> str:
    answers = []
    for _ in range(n_samples):
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=512,
            messages=[{
                "role": "user",
                "content": f"{problem}\n\nLet's think step by step. At the end, state your final answer clearly as 'Answer: X'."
            }]
        )
        text = response.content[0].text
        # Extract the final answer
        if "Answer:" in text:
            answer = text.split("Answer:")[-1].strip().split("\n")[0]
            answers.append(answer)
    # Return majority vote
    return Counter(answers).most_common(1)[0][0] if answers else "No answer"

Tree of Thoughts (ToT) generalises CoT by exploring multiple reasoning branches simultaneously, evaluating the promise of each branch, and pruning dead ends. It is more computationally expensive than self-consistency but better at tasks requiring strategic planning and backtracking — solving puzzles, writing code that satisfies complex constraints, multi-step planning problems.

ReAct (Reason + Act) interleaves reasoning and tool use: the model reasons about what to do next, takes an action (calls a tool), observes the result, and reasons again. This is the foundation of most modern agent architectures, combining the benefits of CoT reasoning with the ability to gather information from external sources.

Program of Thought (PoT) has the model generate code rather than natural-language reasoning, then executes the code to get the answer. For mathematical and algorithmic tasks, this is more reliable than natural-language CoT because the answer comes from code execution rather than the model’s arithmetic, which is a well-known weakness.

When to Use Chain-of-Thought

CoT reliably helps on tasks that require multiple reasoning steps: arithmetic, logical deduction, commonsense reasoning about sequences of events, code debugging, structured analysis. The benefit scales with task complexity — simple tasks see little improvement, while hard multi-step tasks see the largest gains.

CoT is less useful or counterproductive for tasks where reasoning is not needed: direct factual retrieval, simple classification, tasks where the answer is obvious from the question. For these, CoT adds tokens and latency without improving quality, and occasionally introduces errors by giving the model space to second-guess a correct initial answer.

Model size matters too. CoT reasoning capabilities emerge at scale — smaller models (under ~7B parameters) often produce incoherent or unhelpful reasoning chains that do not improve and sometimes hurt final accuracy. If you are using a smaller model, test whether CoT actually helps on your specific task rather than assuming it will.

Extracting the Final Answer

When using CoT, you need to reliably extract the final answer from the reasoning trace. A few approaches work well. The simplest is to instruct the model to format its final answer in a consistent, easily parseable way:

prompt += "\n\nAfter your reasoning, state your final answer on a new line starting with 'ANSWER:'"

# Then extract:
response_text = response.content[0].text
final_answer = response_text.split("ANSWER:")[-1].strip().split("\n")[0]

For structured tasks, instruct the model to produce JSON for the final answer after its prose reasoning. This gives you the benefits of step-by-step reasoning while making the answer easy to parse programmatically. Alternatively, use a two-step approach: first generate the reasoning chain, then make a second API call that reads the reasoning and extracts only the structured answer — separating the reasoning and extraction steps often improves reliability for complex tasks.

Chain-of-Thought in Production

The main practical concerns with CoT in production are cost and latency. A reasoning chain adds 200–1,000 tokens or more to each response, multiplying both cost and time-to-first-token. For latency-sensitive applications, streaming the response so users see reasoning appearing in real time — rather than waiting for the full response — significantly improves perceived responsiveness even though total generation time is unchanged.

For high-volume, cost-sensitive applications, consider using CoT selectively: apply it only to requests that the model flags as complex or uncertain, and use direct answering for simpler requests. A fast classifier or the model’s own confidence signal can route requests to the appropriate strategy. This hybrid approach captures most of the accuracy benefit of CoT at a fraction of the cost of applying it universally.

Extended Thinking in Modern Models

The most powerful evolution of chain-of-thought in 2025 and 2026 is extended thinking — a mode available in models like Claude and OpenAI’s o-series where the model reasons internally before producing its visible response. Unlike standard CoT where the reasoning is part of the output, extended thinking happens in a separate “thinking” block that can be shown to users or hidden, and the model can reason for much longer without bloating the visible response.

Extended thinking enables qualitatively different reasoning: the model can explore multiple approaches, back up and try again when a path leads nowhere, and spend more compute on hard problems. For tasks that genuinely require deep reasoning — complex mathematics, multi-step code generation, intricate logical puzzles — extended thinking models substantially outperform standard CoT on the same prompt.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # how much to spend on internal reasoning
    },
    messages=[{
        "role": "user",
        "content": "Prove that there are infinitely many prime numbers."
    }]
)

for block in response.content:
    if block.type == "thinking":
        print("Internal reasoning:", block.thinking[:200], "...")
    elif block.type == "text":
        print("Final response:", block.text)

The tradeoff is significant cost and latency — extended thinking can use thousands of tokens of internal reasoning before producing the final answer. Use it for tasks where reasoning quality clearly matters more than speed, and measure whether the quality improvement justifies the cost for your specific use case.

Evaluating Chain-of-Thought Quality

Evaluating CoT is trickier than evaluating direct answers because you need to assess both the reasoning process and the final answer. A correct answer reached through faulty reasoning is a reliability risk — the model got lucky and may fail on similar problems. A few evaluation approaches are useful. Step validity checking uses an LLM judge to evaluate whether each reasoning step is logically valid and follows from the previous one. Faithfulness checking verifies that the final answer actually follows from the reasoning chain rather than contradicting it. Counterfactual testing modifies a key fact in the problem and checks whether the reasoning chain updates appropriately — models that have learned to pattern-match rather than reason often produce the same conclusion despite a changed premise. Building these checks into your evaluation pipeline gives you confidence that CoT is genuinely improving reasoning rather than just generating plausible-looking text that happens to reach the right answer.

Practical Tips for Getting the Most from CoT

A few patterns that reliably improve CoT quality in practice. Ask for uncertainty. Instruct the model to note when it is unsure at any step: “If you are uncertain about any step, say so explicitly.” This produces more calibrated reasoning and helps identify where the chain is weakest. Specify the reasoning format. For domain-specific tasks, tell the model how to structure its reasoning — what to identify first, what to calculate, what to verify. A structured template reduces variance in reasoning quality. Include verification steps. Ask the model to check its answer at the end: “After reaching your answer, verify it by working backwards or checking it against the original problem.” This catches a meaningful fraction of arithmetic and logical errors. Use shorter chains for simpler problems. More reasoning is not always better — for problems that require two or three steps, a long elaborated chain can introduce errors that a tighter chain would avoid. Match the depth of reasoning to the actual complexity of the task rather than defaulting to the longest possible chain.

Chain-of-Thought vs. Fine-Tuning

A common question is whether to use CoT prompting or fine-tune a model on task-specific examples. The two approaches are not mutually exclusive. CoT is the right first step — it requires no training data, no infrastructure, and no retraining cycle. Fine-tuning on CoT traces (training the model to produce reasoning chains similar to your best few-shot examples) is a more advanced step that makes sense once you have validated that CoT reasoning genuinely helps your task and you have collected enough high-quality examples to make training worthwhile. Fine-tuned CoT models tend to be faster and cheaper than prompting a large model with long few-shot examples, at the cost of the upfront training investment. For most applications, start with prompting, measure the impact, and only move to fine-tuning if the accuracy gains justify the operational complexity of maintaining a custom model.

The key insight is that chain-of-thought prompting is one of the highest-leverage techniques available to practitioners working with LLMs today — low cost to implement, broadly applicable, and grounded in a clear mechanistic understanding of why it works.

Leave a Comment