Why Token Optimisation Matters
Every token sent to an LLM API costs money — both input and output tokens. At low volume, the amounts are trivial. At production scale, they become significant operational expenses that compound with every feature addition, every edge case in prompt design, and every traffic spike. An application that sends 5,000 input tokens per request when 800 would suffice is paying 6x more than necessary on every call. Output tokens cost 3–5x more per token than input tokens on most providers, so verbose responses add cost disproportionately. Token optimisation is the discipline of getting the same quality output with fewer tokens — reducing cost while maintaining the user experience.
Measure First: Understanding Your Token Profile
Before optimising, measure exactly where your tokens are going. Log the token breakdown for every production request: system prompt tokens, conversation history tokens, RAG context tokens, user input tokens, and output tokens. This breakdown reveals which component is driving cost and where optimisation will have the most impact.
import anthropic, json
from dataclasses import dataclass
client = anthropic.Anthropic()
@dataclass
class TokenProfile:
system_tokens: int = 0
history_tokens: int = 0
context_tokens: int = 0
user_input_tokens: int = 0
total_input_tokens: int = 0
output_tokens: int = 0
cached_tokens: int = 0
def profile_request(system: str, messages: list, model="claude-sonnet-4-6") -> tuple[str, TokenProfile]:
response = client.messages.create(
model=model, max_tokens=1024, system=system, messages=messages
)
p = TokenProfile(
total_input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
cached_tokens=getattr(response.usage, "cache_read_input_tokens", 0)
)
return response.content[0].text, p
Typical findings from this analysis: system prompts are 30–60% of total input tokens; RAG context is 20–40%; conversation history grows unbounded without truncation and can dominate long sessions. Knowing which component is largest tells you where to focus first.
Optimising System Prompts
System prompts are sent with every single request — even a 100-token reduction multiplied by 100,000 daily requests saves 10 million tokens per day. Audit your system prompt for common bloat patterns. Remove redundant instructions that repeat the same constraint in different words. Remove examples that are covered by the model’s base capability without needing exemplification. Remove formatting instructions that the model already follows reliably without being told. Remove caveats and disclaimers that appear in the prompt but not in actual responses.
A practical optimisation workflow: take your current system prompt, ask a capable model to compress it while preserving all distinct instructions, then test the compressed version against your evaluation suite. A 500-token system prompt can often be reduced to 200 tokens without any measurable quality change. That 300-token saving at 50,000 daily requests saves 15 million input tokens per day — at Claude Sonnet pricing, roughly $45/day or $1,350/month from a single prompt edit.
COMPRESS_PROMPT = """Compress this system prompt to the minimum tokens needed to preserve all distinct instructions.
Remove redundancy, verbose phrasing, and anything the model would do anyway without being told.
Return only the compressed prompt, nothing else.
Original:
{prompt}"""
response = client.messages.create(
model="claude-sonnet-4-6", max_tokens=2048,
messages=[{"role":"user","content":COMPRESS_PROMPT.format(prompt=your_system_prompt)}]
)
compressed = response.content[0].text
Controlling Output Length
Output tokens cost 3–5x more than input tokens. Verbose responses are one of the most significant sources of unnecessary API cost — and the model’s default verbosity is often much higher than users need. Several techniques reliably reduce output length without degrading quality.
Explicit length instructions. Simply telling the model how long the response should be is highly effective: “Answer in 2–3 sentences”, “Provide a bullet list of 5 items maximum”, “Write a one-paragraph summary”. Models follow these instructions reliably. Add length guidance to your system prompt for every task type that does not require unlimited length.
max_tokens limits. Setting a hard token ceiling prevents runaway responses but risks cutting off legitimate content mid-sentence. Set max_tokens to 1.5x your typical expected response length — enough headroom that well-formed responses complete, while cutting off unusually verbose ones.
Output format constraints. Requesting JSON or structured output implicitly limits verbosity because the format has no room for explanatory prose. A JSON response to a classification task might use 50 tokens where a prose explanation uses 300.
# Instead of: "Is this review positive or negative?"
# Use: "Classify this review. Return JSON only: {"sentiment": "positive" or "negative", "score": 1-5}"
# Saves ~250 output tokens per request
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=50, # JSON response needs very few tokens
messages=[{"role":"user","content":
'Classify: "This product exceeded my expectations!" Return JSON: {"sentiment":"..","score":1-5}'}]
)
Conversation History Truncation
In multi-turn conversations, history grows with every exchange. Without management, a long conversation’s context can grow to tens of thousands of tokens — far more than needed for the current question. Three strategies manage this effectively:
Sliding window. Keep only the last N turns. Simple to implement, loses early context. Appropriate when recent context is all that matters.
Summarisation. When history exceeds a threshold, summarise older turns into a compact representation and replace them. Preserves semantic content at the cost of detail.
Selective retention. Keep only turns that established important context (decisions, key facts, constraints) and discard filler exchange. More accurate than sliding window but requires logic to identify what matters.
def truncate_history(messages: list[dict], max_tokens: int = 2000) -> list[dict]:
"""Keep most recent messages that fit within token budget."""
token_estimate = lambda m: len(m["content"].split()) * 1.3 # rough estimate
total = 0
kept = []
for msg in reversed(messages):
t = token_estimate(msg)
if total + t > max_tokens:
break
kept.insert(0, msg)
total += t
return kept
RAG Context Compression
Retrieved document chunks are often the largest component of input token cost in RAG applications. Retrieved chunks may contain redundant information, irrelevant surrounding context, or formatting that adds tokens without adding meaning. Compressing retrieved context before injecting it into the prompt can reduce RAG input costs by 30–50%:
COMPRESS_CONTEXT = """Extract only the information relevant to this query from the text below.
Remove redundant sentences, headers, and formatting. Keep only facts that help answer the query.
Query: {query}
Text: {chunk}
Return only the relevant extracted content:"""
def compress_chunk(query: str, chunk: str) -> str:
if len(chunk.split()) < 50:
return chunk # Don't compress very short chunks
response = client.messages.create(
model="claude-haiku-4-5-20251001", # Cheap model for compression
max_tokens=300,
messages=[{"role":"user","content":COMPRESS_CONTEXT.format(query=query, chunk=chunk)}]
)
return response.content[0].text
Use a cheap, fast model (Haiku, Flash) for compression — the cost of the compression call should be significantly less than the token savings it produces. For a 1,000-token chunk that compresses to 300 tokens, the savings on a Claude Sonnet call ($0.009) outweigh the Haiku compression cost ($0.0004) by 20x.
Prompt Caching: Eliminating Repeated Input Costs
For applications with a shared system prompt or document context sent on every request, prompt caching is the single highest-leverage token optimisation. Anthropic and OpenAI both offer it — cached tokens cost 10–25% of the standard input token price.
response = client.messages.create(
model="claude-sonnet-4-6", max_tokens=1024,
system=[{
"type": "text",
"text": your_large_system_prompt,
"cache_control": {"type": "ephemeral"} # Cache this prefix
}],
messages=[{"role":"user","content":user_question}]
)
saved = response.usage.cache_read_input_tokens
print(f"Cached tokens this call: {saved} (~{saved*0.003*0.1/1000:.4f} USD saved)")
For a 5,000-token system prompt at 50,000 daily requests on Claude Sonnet: without caching, that is 250 million input tokens per day at $0.75/day. With caching at 90% hit rate and 10% of normal price for cached tokens: roughly $0.15/day. A $600/month saving from a one-line code change.
Choosing Smaller Models for Appropriate Tasks
Not every task needs a frontier model. A significant fraction of production LLM tasks — classification, extraction, simple summarisation, format conversion — can be handled by economy models (Claude Haiku, Gemini Flash, GPT-4o mini) at 10–50x lower cost with comparable quality. The discipline of task-appropriate model selection is one of the most impactful token cost optimisations available. Build a routing layer that sends simple tasks to economy models and reserves standard/frontier models for genuinely complex requests. Track quality metrics per model tier to confirm that economy models deliver acceptable results for the tasks you assign them — the data will tell you definitively whether downgrading a task type is safe.
Batch API for 50% Off Async Workloads
For any processing that does not need real-time responses — nightly document processing, offline analysis, data enrichment pipelines — the Batch API from Anthropic and OpenAI offers 50% off standard pricing. Submit a JSONL file of requests, poll for completion, retrieve results. The same model, same quality, half the price. No code changes beyond the batch submission logic. For applications processing large document corpora, the batch API is the single easiest 50% cost reduction available.
The Token Optimisation Priority Order
Address these in sequence for the best return on effort. First, enable prompt caching for any shared system prompt — usually one line of code, often the biggest absolute saving. Second, compress your system prompt — test against your eval suite, often 30–50% reduction with no quality loss. Third, implement conversation history truncation if your application has multi-turn sessions. Fourth, add output length instructions to your prompts for all task types that do not require unlimited length. Fifth, compress RAG context chunks before injection for document-heavy applications. Sixth, route appropriate tasks to economy model tiers. Seventh, use the batch API for any async processing workloads. Working through this list systematically for a production application typically reduces total token costs by 60–80% without any measurable degradation in output quality.
Fewer-Shot vs. Zero-Shot Prompting
Few-shot examples in prompts can dramatically improve quality for complex or unusual tasks — but they add tokens for every single request. Before including few-shot examples in a production prompt, verify that they are actually necessary. Run the task zero-shot with a clear instruction and compare the quality to a few-shot version on your evaluation set. For tasks where modern frontier models perform well zero-shot — classification, translation, summarisation of well-defined formats — the few-shot examples add tokens without adding quality. Reserve few-shot examples for tasks where the model genuinely needs them: unusual output formats, domain-specific terminology, or tasks where "show, don't tell" outperforms even a detailed instruction. If examples are needed, use the minimum number that achieves the quality target — often 1–2 examples produce 90% of the benefit of 5–10.
Testing Optimisations Without Degrading Quality
Every token optimisation carries a risk of quality degradation. The only reliable way to validate that a change is safe is empirical testing on a representative sample. Before deploying any optimisation — shorter prompts, smaller models, truncated context, compressed RAG chunks — run your evaluation suite and compare quality metrics between the old and new configurations. Track not just average quality but the tail distribution: does the optimisation degrade rare edge cases disproportionately, even if average quality holds? A shadow deployment that runs both configurations on live traffic simultaneously and compares outputs is the gold standard for validating optimisations before fully switching over. Token optimisation done right is a continuous improvement cycle, not a one-time change — the teams that maintain the lowest cost-per-quality-point run small experiments regularly, validate them rigorously, and deploy the winners systematically.
Tracking the Impact of Optimisations
Token optimisation without measurement is guesswork. Before implementing any change, record your baseline: average input tokens per request, average output tokens per request, and total monthly cost broken down by application component. After each optimisation, compare these numbers against baseline. A system prompt compression that reduces average input tokens from 800 to 320 should be visible immediately in your token usage logs — if it is not, the compression did not deploy correctly or a different component is dominating your input token count. Build a simple weekly token report that tracks these metrics over time, flags any week where cost-per-request increases by more than 10%, and summarises the impact of any optimisations deployed that week. The discipline of measurement transforms token optimisation from an occasional cost-cutting exercise into a continuous operational practice that compounds over time as each improvement builds on the last.