LLM Routing in Production: Balancing Cost and Quality with Model Cascades

Not every query needs GPT-4 class capability. Routing queries to cheaper, faster models when the task is simple — and escalating to a larger model only when necessary — can cut LLM API costs by 50-80% with minimal quality loss. This is the core idea behind LLM routing: classifying each incoming request by its complexity and capability requirements, then dispatching it to the most cost-effective model that can handle it well. This article covers how to implement a router, how to design cascade fallback logic, and how to measure the cost-quality tradeoff empirically so you can calibrate the router threshold for your specific workload.

The Case for Routing

In most production LLM workloads, the query distribution is highly skewed. A customer support system might route 70% of queries to a category that can be answered with a fast, cheap model (order status, simple FAQs, factual lookups), 20% to queries needing moderate reasoning (policy clarification, complaint handling), and 10% to complex queries that genuinely benefit from a frontier model (edge cases, multi-turn disambiguation, nuanced judgement calls). Paying frontier-model prices for the easy 70% is straightforward to fix — the challenge is building a router that classifies queries accurately enough that the quality degradation from routing easy queries to a cheaper model is acceptable, while genuinely hard queries still reach the capable model.

Classifier-Based Routing

The most reliable approach to routing is training a lightweight classifier that predicts whether a query requires a large model. The classifier input is the query text; the output is a routing decision (cheap model vs expensive model) or a continuous score that is thresholded. The classifier should be much cheaper to run than either model it routes to — a fine-tuned sentence-transformer embedding plus a logistic regression head, or a small BERT-class model, is appropriate.

from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import numpy as np
import pickle

class EmbeddingRouter:
    """
    Lightweight query router using sentence embeddings + logistic regression.
    Trains on labelled examples of (query, needs_large_model: bool).
    Inference latency: ~5-15ms — negligible vs LLM call.
    """
    def __init__(self, embed_model='all-MiniLM-L6-v2'):
        self.embedder = SentenceTransformer(embed_model)
        self.classifier = LogisticRegression(C=1.0, max_iter=500)
        self.scaler = StandardScaler()
        self.fitted = False

    def fit(self, queries: list[str], labels: list[bool]):
        """labels: True = needs large model, False = small model sufficient."""
        embeddings = self.embedder.encode(queries, show_progress_bar=True)
        embeddings_scaled = self.scaler.fit_transform(embeddings)
        self.classifier.fit(embeddings_scaled, labels)
        self.fitted = True
        train_acc = self.classifier.score(embeddings_scaled, labels)
        print(f"Router train accuracy: {train_acc:.3f}")
        return self

    def predict_proba(self, query: str) -> float:
        """Returns probability that query needs the large model."""
        emb = self.embedder.encode([query])
        emb_scaled = self.scaler.transform(emb)
        return float(self.classifier.predict_proba(emb_scaled)[0, 1])

    def route(self, query: str, threshold: float = 0.5) -> str:
        """Returns 'large' or 'small'."""
        p = self.predict_proba(query)
        return 'large' if p >= threshold else 'small'

    def save(self, path: str):
        with open(path, 'wb') as f:
            pickle.dump({'classifier': self.classifier, 'scaler': self.scaler}, f)

    def load(self, path: str):
        with open(path, 'rb') as f:
            data = pickle.load(f)
        self.classifier = data['classifier']
        self.scaler = data['scaler']
        self.fitted = True
        return self

Building the Training Dataset for the Router

The router classifier needs labelled examples of easy versus hard queries. Three practical approaches generate these labels. The first is human annotation: sample a few hundred queries from your production logs and have domain experts label whether a cheap model would suffice. This is accurate but slow. The second is model-agreement labelling: run both your cheap and expensive model on a held-out set of queries, evaluate both outputs (by human raters or an LLM judge), and label queries where the cheap model was rated as good as the expensive model as easy, and queries where the expensive model was clearly better as hard. The third is proxy labelling: use a fast heuristic — query length, presence of specific keywords, complexity signals — as a cheap label generator, then iteratively refine with a small amount of human correction.

import anthropic
import json

def label_query_difficulty(query: str, cheap_response: str,
                            expensive_response: str,
                            judge_model='claude-sonnet-4-20250514') -> bool:
    """
    Use an LLM judge to label whether the cheap model's response
    was adequate. Returns True if large model is needed.
    """
    client = anthropic.Anthropic()
    prompt = f"""Compare two responses to this query and decide if the cheaper response is adequate.

Query: {query}

Cheap model response: {cheap_response}

Expensive model response: {expensive_response}

Respond with JSON only: {{"needs_large_model": true/false, "reason": "brief reason"}}
Set needs_large_model to true only if the expensive response is meaningfully better."""

    response = client.messages.create(
        model=judge_model, max_tokens=200,
        messages=[{'role': 'user', 'content': prompt}]
    )
    result = json.loads(response.content[0].text)
    return result['needs_large_model']

Cascade Routing: Try Cheap First, Escalate If Unsure

An alternative to a trained classifier is a cascade: always try the cheap model first, then escalate to the expensive model only if the cheap model signals low confidence or the response fails a quality gate. This requires no labelled training data and adapts automatically as the cheap model’s capabilities change. The tradeoff is latency: every request incurs a cheap model call, and some fraction also incur an expensive model call, giving those queries 2x the latency. For latency-sensitive applications this is often unacceptable; for batch or async workloads it is a natural fit.

import anthropic
from dataclasses import dataclass

@dataclass
class RouterConfig:
    cheap_model: str = 'claude-haiku-4-5-20251001'
    expensive_model: str = 'claude-sonnet-4-20250514'
    confidence_threshold: float = 0.85
    max_tokens_cheap: int = 1024
    max_tokens_expensive: int = 4096

class CascadeRouter:
    """
    Try cheap model first. If it expresses uncertainty or the
    quality gate fails, escalate to the expensive model.
    """
    def __init__(self, config: RouterConfig = None):
        self.config = config or RouterConfig()
        self.client = anthropic.Anthropic()
        self.stats = {'cheap_only': 0, 'escalated': 0}

    def _ask_with_confidence(self, query: str, model: str,
                              max_tokens: int) -> tuple[str, float]:
        """Ask model to respond and also rate its own confidence 0-1."""
        system = (
            "Answer the user's question. At the end of your response, "
            "add a single line: CONFIDENCE:  indicating how "
            "confident you are in the accuracy and completeness of your answer."
        )
        response = self.client.messages.create(
            model=model, max_tokens=max_tokens,
            system=system,
            messages=[{'role': 'user', 'content': query}]
        )
        text = response.content[0].text
        # Extract confidence score from last line
        lines = text.strip().split('\n')
        confidence = 0.5
        if lines[-1].startswith('CONFIDENCE:'):
            try:
                confidence = float(lines[-1].split(':')[1].strip())
                text = '\n'.join(lines[:-1]).strip()
            except ValueError:
                pass
        return text, confidence

    def route(self, query: str) -> dict:
        # First try cheap model
        cheap_response, confidence = self._ask_with_confidence(
            query, self.config.cheap_model, self.config.max_tokens_cheap
        )
        if confidence >= self.config.confidence_threshold:
            self.stats['cheap_only'] += 1
            return {'response': cheap_response, 'model': self.config.cheap_model,
                    'confidence': confidence, 'escalated': False}
        # Escalate to expensive model
        expensive_response, exp_confidence = self._ask_with_confidence(
            query, self.config.expensive_model, self.config.max_tokens_expensive
        )
        self.stats['escalated'] += 1
        return {'response': expensive_response, 'model': self.config.expensive_model,
                'confidence': exp_confidence, 'escalated': True,
                'cheap_confidence': confidence}

    def escalation_rate(self) -> float:
        total = self.stats['cheap_only'] + self.stats['escalated']
        return self.stats['escalated'] / total if total > 0 else 0.0

Calibrating the Routing Threshold

The threshold in both the classifier router and the cascade router controls the cost-quality tradeoff. A lower threshold routes more queries to the expensive model, improving quality but increasing cost. A higher threshold routes more to the cheap model, reducing cost but risking quality degradation. The right threshold depends on your application’s tolerance for quality loss and your cost budget.

To calibrate empirically, collect a sample of 200-500 queries representative of your production distribution, get both cheap and expensive model responses, and have them rated (by humans or an LLM judge). For each threshold value, compute: (1) the fraction of queries that would be routed to the cheap model, (2) the fraction of those cheap-routed queries where the cheap model response was rated as adequate. The threshold that gives you an acceptable quality floor at the minimum cost is the operating point to deploy. In practice, routing thresholds of 0.5-0.7 on a well-calibrated classifier often achieve 60-75% cheap routing while keeping quality degradation below 5%.

Cost Tracking and Monitoring in Production

from collections import defaultdict
import time

class RoutingMonitor:
    """Track routing decisions, costs, and latency in production."""
    def __init__(self, cheap_cost_per_1k: float = 0.001,
                 expensive_cost_per_1k: float = 0.015):
        self.costs = {'cheap': cheap_cost_per_1k, 'expensive': expensive_cost_per_1k}
        self.records = defaultdict(list)

    def log(self, model_tier: str, input_tokens: int,
             output_tokens: int, latency_ms: float):
        total_tokens = input_tokens + output_tokens
        cost = self.costs[model_tier] * total_tokens / 1000
        self.records[model_tier].append({
            'cost': cost, 'tokens': total_tokens, 'latency_ms': latency_ms
        })

    def summary(self) -> dict:
        total_cost = sum(r['cost'] for tier in self.records.values() for r in tier)
        n_cheap = len(self.records['cheap'])
        n_expensive = len(self.records['expensive'])
        total = n_cheap + n_expensive
        avg_latency = {tier: sum(r['latency_ms'] for r in recs) / len(recs)
                       for tier, recs in self.records.items() if recs}
        # Counterfactual: what if everything went to expensive model?
        all_tokens = sum(r['tokens'] for tier in self.records.values() for r in tier)
        counterfactual_cost = self.costs['expensive'] * all_tokens / 1000
        savings = counterfactual_cost - total_cost
        return {
            'total_requests': total,
            'cheap_pct': 100 * n_cheap / total if total else 0,
            'total_cost_usd': round(total_cost, 4),
            'counterfactual_cost_usd': round(counterfactual_cost, 4),
            'estimated_savings_usd': round(savings, 4),
            'avg_latency_ms': avg_latency,
        }

When to Use a Trained Router vs a Cascade

Use a trained classifier router when: you have a stable, well-defined query distribution that you can collect labels for; latency is a hard constraint (the router adds only 5-15ms); and the cost budget is tight enough that paying for cheap model calls on escalated queries is wasteful. Use a cascade router when: the query distribution is diverse or rapidly evolving, you cannot afford the labelling effort to train a good classifier, or you need a system that self-adapts as model capabilities change. Hybrid approaches work well in practice — a fast rule-based pre-filter for obviously simple or obviously complex queries (based on length, keywords, or query type), a trained classifier for the middle ground, and a cascade fallback for genuinely ambiguous cases.

One signal that your router needs retraining: monitor the escalation rate (for cascades) or the cheap-routing fraction (for classifiers) over time. If the escalation rate climbs steadily, it suggests the query distribution is shifting toward harder queries, or the cheap model’s self-reported confidence has drifted. Retraining the router on recent labelled data every few weeks is a reasonable maintenance cadence for high-volume production systems.

Rule-Based Pre-Filtering Before the Router

Before any learned router runs, a fast rule-based pre-filter can eliminate the obvious cases cheaply. Queries that are obviously simple — single-sentence factual lookups, short templated requests, queries matching known easy patterns — can be routed directly to the cheap model without ever touching the classifier. Queries that are obviously complex — very long inputs, requests containing code blocks, multi-document analysis tasks — can bypass the classifier and go directly to the expensive model. This pre-filter reduces the classifier’s job to only the ambiguous middle ground, which both improves classification accuracy (the classifier is only asked to decide hard cases it was trained for) and reduces overall routing latency on the clear cases.

import re

def rule_based_prefilter(query: str) -> str | None:
    """
    Fast rule-based routing for obvious cases.
    Returns 'cheap', 'expensive', or None (undecided, use classifier).
    """
    n_words = len(query.split())
    # Obviously simple: short, no code, no multi-step indicators
    simple_patterns = [
        r'^what is\b', r'^define\b', r'^how do I\b.{0,40}$',
        r'^translate\b', r'^summarise\b.{0,60}$'
    ]
    # Obviously complex: long, code, reasoning-heavy keywords
    complex_keywords = ['implement', 'debug', 'refactor', 'architecture',
                         'tradeoffs', 'compare and contrast', 'multi-step',
                         'given the following code']
    has_code = '```' in query or 'def ' in query or 'class ' in query
    if n_words < 15 and any(re.match(p, query.lower()) for p in simple_patterns):
        return 'cheap'
    if n_words > 300 or has_code or any(kw in query.lower() for kw in complex_keywords):
        return 'expensive'
    return None  # let the trained classifier decide

def route_query(query: str, router: 'EmbeddingRouter',
                threshold: float = 0.5) -> str:
    prefilter_result = rule_based_prefilter(query)
    if prefilter_result is not None:
        return prefilter_result
    return router.route(query, threshold=threshold)

Routing for RAG Pipelines

RAG applications have a natural two-level routing opportunity that is different from general chat routing. At the query level, the same cost-complexity routing applies: simple factual questions can be answered by a cheap model if the retrieved context is good, while complex synthesis tasks need a larger model. At the retrieval level, you can route between dense retrieval (embedding search) and sparse retrieval (BM25) based on query type: keyword-heavy queries with technical terminology tend to be better served by BM25 or hybrid search, while semantic or paraphrased queries favour dense retrieval. Combining both levels of routing — retrieval strategy and generation model — into a single routing decision is an underused optimisation that can significantly improve both quality and cost simultaneously.

A practical implementation is a two-stage RAG router: first classify the query as keyword-heavy or semantic (this can be a simple heuristic — queries with low type-token ratio and technical terms are keyword-heavy), route retrieval accordingly, then use the retrieved context quality score (e.g., the max cosine similarity of the top chunk to the query) as an additional input to the generation routing decision. High-confidence retrieval with a simple query often means the cheap model can generate a good answer from the context without needing frontier-model reasoning; low-confidence retrieval combined with a complex query is the case that most benefits from a capable model that can reason across uncertain context.

A/B Testing Your Router

Before fully deploying a router, running a shadow evaluation prevents silent quality regressions. In shadow mode, every query is answered by both the cheap model and the expensive model (or by the router decision and a full-expensive baseline), and the responses are logged but only the router’s chosen response is returned to the user. A random sample of these pairs is then rated to verify the router’s cheap-routing quality is acceptable before the shadow eval is turned off. This two-week shadow period catches routing errors that aggregate metrics miss — for example, a specific query sub-type that the router consistently misclassifies as easy when it is actually hard for the cheap model. Catching these before they affect 100% of traffic is the difference between a smooth router launch and one that requires a rollback.

RouteLLM and Open-Source Routers

RouteLLM is an open-source framework from LMSYS that provides pre-trained routers for the GPT-4 versus GPT-3.5 decision, with transferable representations that generalise to other model pairs. The framework includes several router architectures — a matrix factorisation router, a BERT-based classifier, and a causal LLM router — and provides a benchmark (MMLU, MT-Bench, GSM8K) for evaluating cost-quality tradeoffs. For teams that do not want to build their own training pipeline, RouteLLM offers a reasonable starting point; the main caveat is that its pre-trained routers were calibrated on OpenAI model pairs and may need fine-tuning on your specific model pair and query distribution to achieve the claimed cost savings on your workload.

The practical lesson from RouteLLM’s benchmarks is that the achievable savings depend heavily on the task. On MMLU (multiple choice knowledge questions), routing 50% of queries to GPT-3.5 achieves near-identical accuracy to routing everything to GPT-4 because GPT-3.5 handles factual recall well. On coding benchmarks, the cheap model falls off faster and the router needs a lower threshold to maintain quality, meaning a smaller fraction of queries can be safely routed to the cheap model. Profiling your own workload by task type before setting a single global routing threshold is worth the effort — a single threshold optimised for your average query mix will underperform task-specific thresholds by a meaningful margin in heterogeneous production systems.

Routing Thresholds Are a Business Decision, Not Just a Technical One

The final routing threshold should be set jointly by engineering and product, not decided unilaterally by whoever builds the router. A threshold of 0.6 that routes 70% of queries to the cheap model is a business decision that trades some quality risk for meaningful cost reduction. Product needs to sign off on the acceptable quality floor — ideally backed by user metrics like task completion rate or satisfaction scores, not just automated LLM-judge ratings. Engineering owns calibrating the router to hit that floor reliably. The two-week shadow evaluation period is the right moment to involve stakeholders: present the actual quality distribution at several threshold values, let the team agree on an acceptable operating point, and then deploy with confidence rather than optimising in production after launch. Setting the threshold once at launch and never revisiting it is one of the most common routing mistakes — a quarterly review of the quality distribution at your deployed threshold, compared against the most recent quarter of ratings, keeps the router working as the query mix and model capabilities evolve.

Leave a Comment