LLM Response Caching: How to Cut API Costs and Latency with Exact, Semantic, and Prompt Caching

Why Caching Matters for LLM Applications

LLM API calls are expensive and slow compared to almost every other operation in a software stack. A single call to a frontier model can cost between a fraction of a cent and several cents, take 1–5 seconds to complete, and involve significant computational overhead on the provider’s side. For applications handling substantial traffic, these costs compound quickly — tens of thousands of API calls per day translate into meaningful infrastructure spend, and even modest latency improvements have a noticeable impact on user experience.

Caching addresses both problems simultaneously. When you can reuse a previous response instead of making a new API call, you pay nothing and respond instantly. The challenge is identifying when reuse is safe — LLM outputs are not deterministic, but many application queries are effectively identical or share large common prefixes that can be cached at the provider level. Understanding the different caching strategies and where each applies is one of the more practically impactful optimisations available to LLM application developers.

Exact Match Caching

The simplest form of caching stores the full prompt as a hash key and the model’s response as the value. When an identical prompt arrives, return the cached response immediately without calling the API. This is appropriate when the same prompt is submitted repeatedly — FAQ bots, product description generators with fixed templates, classification tasks applied to a document corpus.

import hashlib, json, redis, anthropic

cache = redis.Redis(host="localhost", port=6379, decode_responses=True)
client = anthropic.Anthropic()

def cached_completion(prompt: str, model: str = "claude-sonnet-4-6", ttl: int = 3600) -> str:
    key = "llm:" + hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()
    cached = cache.get(key)
    if cached:
        return json.loads(cached)
    response = client.messages.create(
        model=model, max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    result = response.content[0].text
    cache.setex(key, ttl, json.dumps(result))
    return result

The TTL should reflect how quickly your data changes. For static content, a multi-day TTL is reasonable. For queries about current events or user-specific data, use a much shorter TTL or skip caching entirely. Always include the model name in the cache key — if you switch models, cached responses from the old model should not be served.

Semantic Caching

Exact match caching only works when prompts are literally identical. Semantic caching extends this by finding cached responses to queries that are similar in meaning. When a new query arrives, embed it, search the cache for embeddings within a cosine similarity threshold, and return the cached response if a sufficiently similar query was seen before.

from openai import OpenAI
import numpy as np

openai_client = OpenAI()

def embed(text: str) -> list[float]:
    return openai_client.embeddings.create(
        input=text, model="text-embedding-3-small"
    ).data[0].embedding

class SemanticCache:
    def __init__(self, threshold: float = 0.95):
        self.threshold = threshold
        self.entries: list[tuple[list[float], str]] = []

    def get(self, query: str) -> str | None:
        q = np.array(embed(query))
        for emb, response in self.entries:
            if np.dot(q, np.array(emb)) >= self.threshold:
                return response
        return None

    def set(self, query: str, response: str):
        self.entries.append((embed(query), response))

The similarity threshold is the key parameter to tune. Too high (0.99+) and you get very few cache hits. Too low (0.85) and you return cached responses for queries that actually need different answers. For most applications, 0.92–0.96 is a reasonable starting range. In production, use a vector database like Pinecone, Weaviate, or pgvector rather than an in-memory list for scalable approximate nearest-neighbour search at high request volumes.

Provider-Side Prompt Caching

Both Anthropic and OpenAI offer server-side prompt caching that operates at the token level rather than the response level. When you send a request with a long system prompt or a large injected document, the provider caches the computed key-value representations of those tokens. Subsequent calls that share the same cached prefix are processed faster and billed at a reduced input token rate — typically 10–25% of the standard price for cached tokens.

With Anthropic, mark the sections you want cached using a cache-control breakpoint:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": your_large_system_prompt,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": user_question}]
)
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Cache created tokens: {response.usage.cache_creation_input_tokens}")

The ephemeral cache lasts up to five minutes. For applications that process many queries against the same large document — a customer support bot trained on a product manual, a code assistant with a large codebase injected — prompt caching reduces both latency (cached tokens skip re-computation) and cost significantly. The savings compound with traffic: an application making 10,000 calls per day against a 50,000-token system prompt saves roughly 90% of the input token cost for that prefix.

Caching Embeddings

For RAG applications, embedding costs can rival or exceed completion costs when you are embedding large corpora or re-embedding on every query. Two caching patterns help significantly. First, cache document embeddings persistently — once you have embedded a chunk of text, store the vector alongside the text and never embed it again unless the text changes. A simple content hash as the cache key ensures you automatically re-embed only when content actually changes. Second, cache query embeddings within a session — if a user refines their query slightly, the embedding of the refined query may already be in a short-lived cache, saving a round-trip to the embedding API.

import hashlib

def get_or_create_embedding(text: str, cache: dict) -> list[float]:
    key = hashlib.md5(text.encode()).hexdigest()
    if key not in cache:
        cache[key] = embed(text)
    return cache[key]

For production scale, back this cache with Redis or a database rather than an in-memory dict, and persist it across application restarts. A corpus of 100,000 chunks embedded once and cached indefinitely costs a fraction of re-embedding on every deployment or application restart.

Caching in Multi-Turn Conversations

Multi-turn conversations create a caching challenge because the full prompt grows with each turn. A conversation with 20 turns has a different cache key from the same conversation with 19 turns, even though the first 19 turns are identical. Provider-side prompt caching handles this naturally — the shared prefix (all turns up to the last one) is cached, and only the new turn needs to be processed. Your application benefits automatically as long as you send the full conversation history with each request rather than trying to summarise or compress it.

For your own application-level cache, a practical approach is to cache at the turn level rather than the conversation level: store each assistant response keyed by the conversation state that produced it (a hash of all previous messages). When the conversation branches or a user edits a previous message, the cache misses gracefully and a new call is made. This requires more careful key design but enables reuse across users who ask similar questions at the same point in a conversation.

What Not to Cache

Not every LLM response should be cached, and applying caching indiscriminately can cause subtle but serious bugs. A few categories to handle carefully.

Personalised responses. If the system prompt or context includes user-specific information — account details, preferences, history — caching responses across users means user A might see user B’s personalised output. Always include user-identifying information in the cache key, or avoid caching personalised responses entirely.

Time-sensitive queries. “What’s the weather today?” or “What are the latest AI papers?” require fresh responses. Cache these with a very short TTL or not at all. The most elegant approach is to classify queries as time-sensitive before deciding whether to cache.

Stochastic-by-design outputs. Some applications deliberately use temperature above 0 to produce varied creative outputs. Caching these defeats their purpose — a user asking for a creative story idea who gets the same cached response every time will notice. Reserve caching for deterministic or near-deterministic tasks.

Safety-relevant responses. If your application makes moderation decisions or safety classifications, be especially cautious about caching. A cached “safe” classification for a piece of content can mask subtle variations that a fresh model call would catch. For high-stakes safety applications, freshness is usually worth the cost.

Measuring Cache Performance

Track cache metrics as first-class operational data alongside latency and cost. The most important metrics are hit rate (what fraction of requests were served from cache), cost savings (how much you saved versus making fresh API calls), and latency improvement (the difference in response time between cache hits and misses).

from dataclasses import dataclass, field
from datetime import datetime

@dataclass
class CacheMetrics:
    hits: int = 0
    misses: int = 0
    total_saved_tokens: int = 0
    hit_latencies: list[float] = field(default_factory=list)
    miss_latencies: list[float] = field(default_factory=list)

    @property
    def hit_rate(self) -> float:
        total = self.hits + self.misses
        return self.hits / total if total > 0 else 0.0

    @property
    def avg_hit_latency_ms(self) -> float:
        return sum(self.hit_latencies) / len(self.hit_latencies) * 1000 if self.hit_latencies else 0

    @property
    def avg_miss_latency_ms(self) -> float:
        return sum(self.miss_latencies) / len(self.miss_latencies) * 1000 if self.miss_latencies else 0

A hit rate below 10–15% suggests your cache keys are too specific or your query distribution is too diverse for caching to add value. A hit rate above 80% suggests you may be over-caching — check whether cached responses are staying fresh enough for your use case. The sweet spot for most applications is 30–60%, representing genuine query reuse without staleness problems.

Combining Caching Strategies

The most effective production caching setups layer multiple strategies. Provider-side prompt caching handles the shared system prompt and document prefix cheaply and automatically. Exact match caching catches repeated identical queries at the application layer before they reach the API. Semantic caching catches near-duplicate queries that exact match misses. And embedding caching prevents redundant embedding API calls for your document corpus.

Each layer has different implementation complexity, different hit rates, and different risk profiles. Start with provider-side prompt caching — it requires minimal code changes and delivers immediate savings for any application with a substantial system prompt. Add exact match caching next for known high-frequency queries. Add semantic caching only if you have measured that near-duplicate queries are a meaningful fraction of your traffic and the additional complexity is justified by the savings.

Cache Invalidation Strategies

Cache invalidation is the hard part of any caching system, and LLM caches are no exception. When should a cached response be considered stale? The answer depends on what can change: the model itself, your system prompt, the underlying data the model was summarising, or the user’s context.

The simplest approach is TTL-based expiry — every cached entry expires after a fixed duration, forcing a fresh API call. TTL is easy to implement and reason about, but it is a blunt instrument: it expires responses that are still valid and keeps responses that have become stale if the underlying data changes mid-TTL. For most conversational applications, a TTL of 15–60 minutes strikes a reasonable balance between freshness and hit rate.

Event-driven invalidation is more precise: when something changes — a product price updates, a document is edited, a user changes their preferences — explicitly delete or mark stale the cache entries that depend on that change. This keeps cached responses fresh as long as they remain valid, regardless of age. The tradeoff is implementation complexity: you need to track which cache entries depend on which data, and trigger invalidation when that data changes. For applications backed by a database, change data capture (CDC) patterns — listening to database write events and triggering cache invalidation — make this manageable without per-feature custom logic.

For prompt caching at the provider level, invalidation is simpler because the provider controls the cache lifetime. Anthropic’s ephemeral cache expires after five minutes automatically. If you update your system prompt, the new version creates a new cache entry automatically on the next request — the old cached version simply stops being used as requests move to the new prompt. You do not need to explicitly invalidate it.

A practical middle ground that works for most applications is layered expiry: a short TTL (5–15 minutes) for highly dynamic content, a medium TTL (1–4 hours) for moderately dynamic content, and a long TTL (24 hours or more) for effectively static content like documentation summaries or product descriptions. Tag each cache entry with a content type at write time, and apply the appropriate TTL automatically. This avoids the complexity of event-driven invalidation while providing much better freshness control than a single global TTL.

The Business Case for LLM Caching

For teams trying to justify the engineering investment in caching infrastructure, the numbers are usually compelling. A typical RAG application with a 10,000-token system prompt and 30% cache hit rate on queries reduces input token costs by roughly 40–50% through a combination of prompt caching and query-level caching. At meaningful traffic volumes — tens of thousands of daily users — this translates directly to thousands of dollars per month in infrastructure savings. Latency improvements are harder to quantify financially but directly affect user engagement: cache hits that respond in under 100 milliseconds versus fresh API calls at 2–4 seconds represent the difference between a snappy, interactive experience and one that feels sluggish. For most LLM applications operating at scale, caching is not a nice-to-have optimisation — it is a prerequisite for running economically.

Leave a Comment