LLM Observability in Production: How to Monitor Quality, Cost, and Latency

Why LLM Observability Is Different from Traditional Monitoring

Traditional software monitoring tracks binary outcomes: did the function return? Did the API respond within SLA? Did the database query succeed? LLM applications add a third dimension that traditional monitoring ignores entirely: quality. A response can be returned quickly, at low cost, with a 200 status code — and still be factually wrong, off-topic, or harmful. Standard APM tools cannot detect this. LLM observability requires tracking not just whether the system responded, but how well it responded, and connecting that quality signal back to the specific prompts, models, and configurations that produced it.

The four pillars of LLM observability are cost (how much each request and each application component costs), latency (time to first token and total generation time), quality (is the output actually good?), and reliability (error rates, rate limit hits, timeouts). Most teams instrument cost and latency first because they are easy to measure. Quality is harder — it requires either human evaluation, automated scoring with a judge model, or task-specific metrics — but it is ultimately the most important signal for production LLM systems.

What to Log for Every LLM Request

The minimum logging schema for production LLM observability:

import anthropic, time, uuid
from dataclasses import dataclass, field
from datetime import datetime

@dataclass
class LLMTrace:
    trace_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    timestamp: str = field(default_factory=lambda: datetime.utcnow().isoformat())
    model: str = ""
    input_tokens: int = 0
    output_tokens: int = 0
    cached_tokens: int = 0
    ttft_ms: float = 0.0       # Time to first token
    total_ms: float = 0.0      # Total wall time
    input_cost_usd: float = 0.0
    output_cost_usd: float = 0.0
    prompt_hash: str = ""      # Hash of system prompt (not full text)
    user_id: str = ""
    session_id: str = ""
    application: str = ""
    success: bool = True
    error: str = ""

client = anthropic.Anthropic()
COSTS = {"claude-sonnet-4-6": {"input": 3.0, "output": 15.0, "cache": 0.30}}

def traced_completion(messages, system="", model="claude-sonnet-4-6", **kwargs) -> tuple:
    trace = LLMTrace(model=model)
    t0 = time.perf_counter()
    try:
        resp = client.messages.create(model=model, max_tokens=1024,
            system=system, messages=messages, **kwargs)
        trace.total_ms = (time.perf_counter() - t0) * 1000
        u = resp.usage
        trace.input_tokens = u.input_tokens
        trace.output_tokens = u.output_tokens
        trace.cached_tokens = getattr(u, "cache_read_input_tokens", 0)
        c = COSTS.get(model, {"input": 3.0, "output": 15.0, "cache": 0.30})
        trace.input_cost_usd = (trace.input_tokens / 1e6) * c["input"]
        trace.output_cost_usd = (trace.output_tokens / 1e6) * c["output"]
        return resp.content[0].text, trace
    except Exception as e:
        trace.success = False; trace.error = str(e)
        raise

Tracking Quality with LLM-as-Judge

The most scalable approach to automated quality monitoring is using a second LLM to evaluate the outputs of your production LLM. The judge model scores responses on dimensions relevant to your application — correctness, relevance, completeness, tone — and returns structured scores that can be aggregated into dashboards and used to set quality alerts.

JUDGE_PROMPT = """Evaluate this LLM response on a scale of 1-5 for each criterion.
User question: {question}
LLM response: {response}

Rate each dimension:
- Relevance: Does it answer the actual question? (1=off-topic, 5=directly answers)
- Accuracy: Is the information correct? (1=wrong, 5=factually accurate)  
- Completeness: Is the answer complete? (1=missing key info, 5=comprehensive)
- Tone: Is the tone appropriate? (1=inappropriate, 5=perfect)

Return JSON only: {{"relevance":N,"accuracy":N,"completeness":N,"tone":N,"explanation":"..."}}"""

import json

def judge_response(question: str, response: str, model="claude-haiku-4-5-20251001") -> dict:
    result = client.messages.create(
        model=model, max_tokens=256,
        messages=[{"role":"user","content":JUDGE_PROMPT.format(
            question=question, response=response)}]
    )
    return json.loads(result.content[0].text)

Use a cheap, fast model for judging — Claude Haiku or GPT-4o mini — to keep the evaluation overhead low. Run the judge on 5–10% of production traffic rather than every request to balance coverage with cost. For critical applications, run the judge on 100% and treat low-scoring responses as alerts requiring human review.

Cost Monitoring and Alerting

LLM cost spikes are one of the most common operational surprises. A prompt that grows unexpectedly long, a caching configuration that stopped working, or a traffic spike can all cause costs to balloon before anyone notices. Track cumulative daily spend against a budget threshold and alert when it exceeds 80%:

import redis
from datetime import date

r = redis.Redis()

def record_cost(cost_usd: float, application: str):
    today = date.today().isoformat()
    r.incrbyfloat(f"llm_cost:{application}:{today}", cost_usd)
    r.expire(f"llm_cost:{application}:{today}", 86400 * 7)

def check_budget(application: str, daily_budget: float) -> bool:
    today = date.today().isoformat()
    spent = float(r.get(f"llm_cost:{application}:{today}") or 0)
    if spent > daily_budget * 0.8:
        # send_alert(f"{application} at {spent/daily_budget:.0%} of daily budget")
        pass
    return spent < daily_budget

Break cost tracking down by application component (which endpoint or feature), model (are expensive models being called when cheap ones would suffice?), and user segment (are specific users generating disproportionate costs?). These breakdowns reveal optimisation opportunities that aggregate cost numbers hide.

Latency Percentiles and SLA Tracking

Mean latency is a misleading metric for LLM applications because the distribution is wide and right-skewed — a small number of very slow requests drive the mean up significantly. Track p50, p95, and p99 latencies separately. Users experience p99, not p50. An application with 500ms p50 TTFT and 8 second p99 TTFT has a latency problem that the mean obscures.

import statistics

class LatencyTracker:
    def __init__(self, window_size: int = 1000):
        self.ttft_samples: list[float] = []
        self.total_samples: list[float] = []
        self.window_size = window_size

    def record(self, ttft_ms: float, total_ms: float):
        self.ttft_samples.append(ttft_ms)
        self.total_samples.append(total_ms)
        if len(self.ttft_samples) > self.window_size:
            self.ttft_samples.pop(0)
            self.total_samples.pop(0)

    def percentiles(self) -> dict:
        if not self.ttft_samples:
            return {}
        s = sorted(self.ttft_samples)
        n = len(s)
        return {
            "ttft_p50": s[int(n*0.50)],
            "ttft_p95": s[int(n*0.95)],
            "ttft_p99": s[int(n*0.99)],
        }

Observability Platforms: Build vs. Buy

Several dedicated LLM observability platforms have emerged that provide dashboards, tracing, evaluation, and alerting without custom instrumentation. LangSmith (from LangChain) is the most widely used, offering automatic tracing of LangChain and LangGraph applications, a prompt playground for debugging, and human annotation tools for building evaluation datasets. Langfuse is an open-source alternative deployable on your own infrastructure — important for teams with data residency requirements that prevent sending traces to third-party SaaS. Weights and Biases Weave extends W&B's established ML experiment tracking into LLM tracing and evaluation. Arize Phoenix is open-source with strong RAG evaluation tooling, particularly for tracing retrieval quality in addition to generation quality.

All of these platforms accept traces via SDK instrumentation and provide similar core capabilities: request tracing, cost tracking, latency dashboards, and evaluation pipelines. The choice between them is primarily driven by: whether open-source self-hosting is required, which LLM frameworks you use (LangSmith integrates most naturally with LangChain), and whether you want LLM observability integrated with your existing ML experiment tracking (Weave if you already use W&B).

For teams starting from scratch, LangSmith's free tier is the fastest path to meaningful observability — a few lines of environment variables enable automatic tracing with no code changes if you use LangChain:

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-key"
os.environ["LANGCHAIN_PROJECT"] = "production-rag-app"
# All LangChain calls now automatically traced in LangSmith

Building a Quality Dashboard

An effective LLM quality dashboard tracks six metrics in real time. Daily request volume segmented by application and model. Average quality score from the judge model with a 7-day rolling trend. Cost per request by model tier, flagging when expensive models are called for queries that should route to cheap ones. p99 TTFT latency with a SLA threshold line. Error rate (API failures, rate limits, timeouts) as a percentage of total requests. Cache hit rate for applications using prompt caching — a falling cache hit rate often indicates a prompt format change that broke the caching configuration. Export these metrics to Grafana or DataDog alongside your other infrastructure metrics so LLM health is visible in the same operational context as your other production services.

Regression Testing and Eval Suites

Observability is reactive — it catches problems after they reach production. Complementing it with a proactive evaluation suite that runs on every significant prompt or model change catches regressions before deployment. Maintain a curated set of 50–200 representative queries with expected outputs or quality criteria. Run the eval suite in CI on every prompt change. Track eval scores over time alongside production quality metrics — if they diverge, your eval suite may not be representative of production traffic. The combination of pre-deployment evals and production monitoring creates a feedback loop: production failures identify gaps in the eval suite; improved evals catch issues before they reach production. This loop, consistently maintained, is what separates LLM applications that improve systematically from those that require reactive firefighting.

Prompt Versioning and Change Management

Prompt changes are the most common source of quality regressions in production LLM applications, and they are often not treated with the same rigour as code changes. A prompt that performed well in testing can degrade significantly with a minor wording change, a different model version, or a shift in the production traffic distribution. Treat every prompt change as a code change: version it in source control, test it against your eval suite before deployment, deploy it with a staged rollout that compares quality metrics between old and new versions, and have a rollback plan. Store prompts in a prompt registry — either a dedicated tool like LangSmith's prompt hub or simply a database table with version history — so you can identify exactly which prompt version produced a given output for any historical request. Without this, debugging quality regressions becomes a guessing game.

Observability for RAG Applications

RAG applications have additional observability requirements beyond standard LLM monitoring. The retrieval step can fail independently of the generation step — returning irrelevant chunks, missing relevant documents, or retrieving the right documents but at the wrong granularity. Track retrieval quality metrics separately: the number of chunks retrieved, the similarity scores of the top-k results, and whether the retrieved chunks actually contained the information needed to answer the question (measurable via judge evaluation). A response that is accurate but not grounded in the retrieved context (the model answered from training knowledge rather than the retrieved documents) is a different failure mode from a response that faithfully summarises irrelevant retrieved documents. Distinguishing these two failure modes requires logging both the retrieved context and the final response, and evaluating each independently. Most RAG observability platforms (Arize Phoenix, Langfuse) support this retrieval-specific evaluation natively.

The Minimum Viable Observability Stack

For teams just getting started, a minimal but effective observability stack consists of four components. A trace logger that records request, response, tokens, cost, and latency to a database for every production call. A daily cost report emailed to whoever owns the LLM budget, breaking down spend by application and model. A weekly quality sample where 20–30 random production responses are reviewed by a team member and scored. And a Grafana or DataDog dashboard showing daily request volume, error rate, and p95 latency. This stack takes roughly two days to build and catches the vast majority of production problems. Add automated quality scoring, regression test suites, and sophisticated alerting incrementally as your LLM usage scales and the value of more sophisticated monitoring is justified by the complexity and traffic you are operating at.

Cost Attribution in Multi-Tenant Applications

For SaaS products where LLM costs are a component of serving each customer, accurate cost attribution by customer or tenant is essential for understanding unit economics and pricing decisions. Tag every LLM call with the customer ID and aggregate cost by customer daily. This reveals which customers are disproportionately expensive to serve — a common finding is that 10–20% of customers generate 60–80% of LLM costs, often because their usage patterns involve longer contexts or more complex queries. This data directly informs pricing model decisions: should you introduce usage-based pricing? Are some customers subsidised by others? Is the average serving cost per customer below your pricing floor? Without per-customer cost attribution, these questions are unanswerable. With it, they become routine operational metrics that inform product and commercial decisions.

Setting Up Alerts That Actually Matter

Alert fatigue is a real risk — too many low-signal alerts and teams start ignoring all of them. For LLM observability, five alert conditions are worth configuring and few others. An alert when daily spend exceeds 80% of budget before end of day. An alert when error rate exceeds 5% over a 15-minute window. An alert when p99 TTFT exceeds your SLA threshold for more than 5 minutes. An alert when automated quality scores fall below a threshold for two consecutive hours of traffic. And an alert when cache hit rate drops more than 20 percentage points from its 7-day baseline. These five alerts cover the most common and impactful failure modes for production LLM applications. Start with these and add additional alerts only when you identify a recurring failure mode that none of them would have caught. Keep the signal-to-noise ratio high by resisting the temptation to alert on everything measurable.

Tracing Across the Full Request Lifecycle

Production LLM applications rarely make a single model call per user request. A RAG query involves an embedding call, a vector store retrieval, and a generation call. An agent might make 5–15 tool calls and model calls before producing a final response. Tracing connects all of these into a single request span so you can see the full picture: how long retrieval took versus generation, which tool call added the most latency, and where cost accumulated across a multi-step workflow.

OpenTelemetry is the standard for distributed tracing and LLM observability platforms increasingly support it. LangSmith and Langfuse both accept OpenTelemetry spans, allowing you to instrument LLM calls alongside your existing backend traces in the same observability platform. This means an LLM application's performance profile appears in the same Jaeger or Honeycomb dashboard as your database queries and API calls — making it easy to identify whether a slow user-facing response is caused by retrieval latency, LLM generation time, or something entirely outside the LLM layer like a slow downstream API call.

The practical benefit of unified tracing is faster debugging. Without it, an engineer investigating a slow request must correlate timestamps manually across separate log streams for the LLM provider, the vector database, and the application server. With distributed tracing, the entire causal chain is visible in a single waterfall diagram — which step took how long, in what order, and where the bottleneck was. For complex agentic systems with many steps, this difference between minutes of debugging and hours is significant enough to justify the instrumentation overhead.

Investing in distributed tracing early — before your LLM system grows complex enough to make manual log correlation impractical — is one of the highest-leverage operational decisions you can make for the long-term maintainability of an LLM application.

Leave a Comment