LLM Routing: How to Send Every Request to the Right Model and Cut API Costs

What Is LLM Routing?

LLM routing is the practice of directing each request to the most appropriate model rather than sending every request to the same one. Instead of using a single frontier model for everything, a router classifies incoming requests by complexity, cost sensitivity, or task type and dispatches them accordingly — simple queries go to a fast, cheap model; complex reasoning tasks go to a more capable, expensive one.

The motivation is straightforward economics. Frontier models cost 10–100x more per token than smaller models, but the majority of queries in most production applications do not require frontier-level capability. A customer support bot fielding questions like “What are your business hours?” is massively overpaying if every one of those goes to a frontier model. Routing simple queries to a smaller model and reserving the large model for genuinely complex work can reduce costs by 50–80% with minimal quality impact on the overall user experience.

Three Routing Strategies

Rule-based routing classifies requests using explicit rules — keyword lists, regex patterns, input length thresholds. It is fast, cheap, and transparent, but requires manual maintenance and struggles with edge cases. It works best for applications with clearly separable query types distinguishable by surface features.

Model-based routing uses a small, fast classifier to predict which tier a request belongs to. The classifier is trained on labelled examples of easy vs. hard queries. More accurate than rules and generalises better, at the cost of a small inference overhead on every request.

Cascade routing sends every request to the small model first. If the response meets a quality threshold, it is returned. If not, the request escalates to the large model. This guarantees the large model only handles what the small model cannot, but doubles latency for escalated requests.

Building a Classifier Router

import anthropic
from enum import Enum

client = anthropic.Anthropic()

class Tier(Enum):
    FAST = "fast"
    STANDARD = "standard"
    ADVANCED = "advanced"

ROUTER_PROMPT = """Classify this user query into one of three tiers:

FAST: Simple, factual requests requiring no reasoning. Examples: greetings, simple lookups, yes/no questions.
STANDARD: Moderate complexity requiring some reasoning. Examples: explanations, summaries, multi-step instructions.
ADVANCED: Complex reasoning, creative work, or tasks requiring nuanced judgment. Examples: code architecture, complex analysis.

Query: {query}
Respond with only one word: FAST, STANDARD, or ADVANCED."""

MODEL_MAP = {
    Tier.FAST: "claude-haiku-4-5-20251001",
    Tier.STANDARD: "claude-sonnet-4-6",
    Tier.ADVANCED: "claude-opus-4-6",
}

def route(query: str) -> Tier:
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=10,
        messages=[{"role": "user", "content": ROUTER_PROMPT.format(query=query)}]
    )
    tier_str = response.content[0].text.strip().upper()
    return Tier[tier_str] if tier_str in Tier.__members__ else Tier.STANDARD

def routed_completion(query: str, system: str = "") -> tuple[str, Tier]:
    tier = route(query)
    response = client.messages.create(
        model=MODEL_MAP[tier], max_tokens=1024,
        system=system, messages=[{"role": "user", "content": query}]
    )
    return response.content[0].text, tier

The routing call itself uses the cheapest model — its only job is classification, which small models handle reliably. The overhead is roughly 100–200ms per request. For latency-sensitive applications, run the router and fast-tier model in parallel, cancelling the fast-tier call if the router returns ADVANCED.

Cascade Routing: Try Small First

def cascade_completion(query: str, quality_threshold: float = 0.8) -> tuple[str, str]:
    small_response = client.messages.create(
        model="claude-haiku-4-5-20251001", max_tokens=1024,
        messages=[{"role": "user", "content": query}]
    ).content[0].text

    quality = judge_quality(query, small_response)
    if quality >= quality_threshold:
        return small_response, "haiku"

    large_response = client.messages.create(
        model="claude-opus-4-6", max_tokens=1024,
        messages=[{"role": "user", "content": query}]
    ).content[0].text
    return large_response, "opus"

def judge_quality(query: str, response: str) -> float:
    result = client.messages.create(
        model="claude-haiku-4-5-20251001", max_tokens=16,
        messages=[{"role": "user", "content": f"Rate this response quality 0.0-1.0. Query: {query}\nResponse: {response}\nReturn only a number."}]
    ).content[0].text.strip()
    try:
        return float(result)
    except ValueError:
        return 0.5

Routing by Task Type

Beyond complexity tiers, routing by task type sends different kinds of work to models optimised for that task — code generation to a code-specialised model, multilingual queries to a model with strong coverage of that language, long document work to a model with a large context window:

TASK_MODEL_MAP = {
    "code_generation": "claude-sonnet-4-6",
    "code_review": "claude-opus-4-6",
    "translation": "claude-sonnet-4-6",
    "summarisation": "claude-haiku-4-5-20251001",
    "analysis": "claude-opus-4-6",
    "factual_qa": "claude-haiku-4-5-20251001",
    "creative_writing": "claude-opus-4-6",
}

Task-type routing is often more reliable than complexity routing because the categories are more clearly defined and model specialisation differences are more predictable than raw capability differences.

Multi-Provider Routing

Routing does not have to stay within one provider’s model family. Multi-provider routing distributes requests across OpenAI, Anthropic, Google, and open-source models to optimise for cost, capability, latency, or availability. Different providers have different pricing structures, different model strengths, and different uptime guarantees — routing across them gives you more levers to pull.

A multi-provider router needs an abstraction layer that presents a uniform interface regardless of which provider handles the call. The OpenAI-compatible API format — supported by Anthropic, Mistral, Together AI, Groq, and others — is the practical standard:

from openai import OpenAI

# Each provider configured as a separate client
providers = {
    "openai": OpenAI(api_key="..."),
    "anthropic": OpenAI(api_key="...", base_url="https://api.anthropic.com/v1"),
    "groq": OpenAI(api_key="...", base_url="https://api.groq.com/openai/v1"),
    "together": OpenAI(api_key="...", base_url="https://api.together.xyz/v1"),
}

PROVIDER_MODEL_MAP = {
    "fast_cheap": ("groq", "llama-3.1-8b-instant"),
    "fast_capable": ("anthropic", "claude-haiku-4-5-20251001"),
    "standard": ("openai", "gpt-4o-mini"),
    "advanced": ("anthropic", "claude-opus-4-6"),
}

def multi_provider_completion(query: str, tier: str = "standard") -> str:
    provider_name, model = PROVIDER_MODEL_MAP[tier]
    client = providers[provider_name]
    response = client.chat.completions.create(
        model=model, max_tokens=1024,
        messages=[{"role": "user", "content": query}]
    )
    return response.choices[0].message.content

Multi-provider routing also enables automatic fallback: if one provider’s API is down or slow, the router can redirect to an equivalent model at another provider. This improves reliability significantly for high-availability applications without requiring you to build a full multi-cloud infrastructure.

Cost Modelling and Threshold Calibration

Before implementing a router, model the expected cost savings explicitly. The saving from routing a request to the small model is: (large_model_cost_per_token – small_model_cost_per_token) × tokens. With Claude Haiku at roughly 1/20th the cost of Claude Opus, routing a 500-token request saves approximately 95% of the cost for that request. If 60% of your traffic is FAST-tier, the overall cost reduction is roughly 57% — significant even accounting for the routing overhead.

Calibrate quality thresholds empirically. Run your router against a labelled evaluation set and plot quality vs. cost at different escalation thresholds. Find the threshold that maximises cost savings while keeping quality above your acceptable floor. A 5-point drop in average quality score is meaningless if users cannot perceive it; a 2-point drop may be unacceptable if it affects a safety-critical response category. The right threshold is application-specific and cannot be determined without measurement.

Measuring Routing Performance

Track routing metrics as first-class operational data. The key metrics are: escalation rate (what fraction of requests reach each tier), cost per request by tier, quality scores by tier, and overall quality versus baseline (sending everything to the large model). A rising escalation rate over time may indicate that your query distribution is shifting toward harder requests, or that your router’s classifier is degrading — either case warrants investigation.

Also monitor for systematic routing errors. If the router consistently misclassifies a particular query type — routing difficult domain-specific questions as FAST, or routing simple greetings as ADVANCED — add those examples to the router’s training data or adjust its classification rules. A router that is wrong on 10% of requests adds cost without quality benefit on those requests, and may be actively degrading user experience for the misclassified cases.

RouteLLM: A Dedicated Routing Framework

RouteLLM is an open-source framework from LMSys that provides pre-trained routing models specifically designed for LLM cost optimisation. It offers several router architectures — matrix factorisation, BERT-based classifiers, and causal LLM classifiers — trained on human preference data. RouteLLM routers are calibrated to a cost threshold: you specify what fraction of calls should go to the strong model, and the router adjusts its threshold to hit that target. This makes the cost/quality trade-off explicit and controllable rather than requiring manual threshold tuning. For teams that want routing without building a custom classifier, RouteLLM is the most mature off-the-shelf option.

When Routing Is and Is Not Worth It

Routing is worth implementing when you have meaningful traffic, your query distribution genuinely spans a wide range of complexity, and the quality gap between your small and large model is small enough that routing easy queries to the small model does not noticeably degrade user experience. It pays for itself quickly at scale: an application making 100,000 daily API calls where 60% route to a model that costs 1/20th as much saves the cost equivalent of roughly 57,000 calls per day.

Routing is not worth implementing for low-traffic applications where API costs are not a significant concern, for applications where all queries are uniformly complex, or where any quality difference between tiers is unacceptable. Adding a routing layer to a small application prematurely adds operational complexity — another model to monitor, another failure mode to handle, another component to maintain — without proportional benefit. Start with a single model, measure your actual costs and query distribution once you have real traffic, and add routing when the data makes a clear case for it.

Latency vs. Cost Trade-offs in Routing

Routing decisions involve a fundamental tension between latency and cost that varies by application type. For synchronous user-facing interfaces — chat applications, search, real-time assistants — latency is paramount and cascades that add a full model call on escalation may be unacceptable. For asynchronous background tasks — document processing, batch classification, overnight report generation — latency is irrelevant and cascade routing is ideal because you get maximum cost savings without any user experience penalty.

For synchronous applications, classifier-based routing that runs in parallel with the fast model is the right architecture. Fire the router and the fast-tier model simultaneously. If the router returns FAST before the fast model completes, you have essentially zero routing overhead. If the router returns ADVANCED, cancel the fast model call and await the large model. This “speculative” routing pattern combines the latency benefits of the fast model with the quality of the large model when needed, at the cost of occasionally wasting the fast model call.

For asynchronous workloads, cascade routing is simpler and more cost-effective. Process your entire batch through the small model, collect all responses, run quality checks in parallel across the batch, and then send only the failed items to the large model for reprocessing. This batched cascade approach maximises throughput and minimises cost for bulk processing pipelines that do not have per-item latency requirements.

A/B Testing Your Router

Before fully committing to a routing configuration, run a shadow deployment: send a percentage of production traffic through the router while simultaneously sending the same requests to your current single-model setup. Compare quality metrics, latency distributions, and cost across both paths on real traffic before switching over entirely. Shadow deployments reveal routing failure modes that your eval set missed — real user queries are always more diverse and surprising than your test set, and a problem that affects 0.5% of traffic only shows up reliably once you have significant production volume.

After launch, continue running periodic A/B experiments to recalibrate the router as your query distribution evolves. A customer base that started with simple queries may develop more sophisticated usage patterns over time, shifting the distribution toward harder queries. A router calibrated on six-month-old traffic data may be routing too aggressively to the small model relative to your current workload. Treat the router’s configuration as a living parameter that gets updated based on ongoing measurement rather than a one-time setup decision.

Routing as Part of a Broader Cost Strategy

Routing is most effective when combined with the other cost reduction techniques described elsewhere — prompt caching for shared prefixes, output length control for verbose tasks, and batching for asynchronous workloads. A routing strategy that directs 60% of requests to the small model, combined with prompt caching that reduces input token costs by 40% for the cached prefix, and output length controls that cut average response length by 20%, can together reduce total API spend by 70–75% compared to unoptimised single-model usage. No single technique achieves this alone; the combination does. Model the cumulative impact of each technique on your actual usage pattern before deciding which to implement first, and sequence implementation by expected return on engineering effort rather than theoretical maximum savings.

The core insight is that LLM routing is fundamentally a resource allocation problem — matching the capability level of the tool to the difficulty of the task. Just as experienced engineers do not use the most powerful server for every workload, effective LLM applications do not use the most capable model for every query. Getting that match right, and measuring it continuously, is one of the highest-leverage operational practices available to teams running LLMs at scale.