Distributed tracing with OpenTelemetry is the right observability primitive for LLM applications — better suited than logs or metrics for understanding latency, failures, and cost across multi-step pipelines. A trace captures the full execution path of a single request as a tree of spans: one span for the HTTP handler, child spans for each retrieval step, each LLM call, each tool invocation. You can see exactly where time is spent, which component failed, and what inputs and outputs flowed through each step — all correlated by a single trace ID. This guide covers instrumenting a Python LLM application with OpenTelemetry from scratch: setting up the SDK, creating spans around LLM calls and retrieval steps, recording token counts and latency, and exporting to a backend.
OpenTelemetry Concepts for LLM Applications
OpenTelemetry’s data model has three core primitives: traces (a directed acyclic graph of spans representing one request), metrics (aggregated numerical measurements), and logs (structured event records). For LLM observability, traces are the most valuable — they preserve causal structure and timing in a way that aggregated metrics cannot. A single LLM application request might fan out to 3 retrieval calls, 2 LLM calls, and 1 tool call; a trace shows the dependency graph and the latency of each step. Metrics complement traces for alerting and capacity planning: p99 LLM latency, token throughput, error rate. The OpenLLMetry project (opentelemetry-instrumentation-openai and related packages) provides auto-instrumentation for common LLM SDKs, but understanding the underlying OpenTelemetry API lets you instrument custom components and add domain-specific attributes that auto-instrumentation misses.
Setup and Installation
pip install opentelemetry-sdk opentelemetry-exporter-otlp opentelemetry-instrumentation-fastapi
# For auto-instrumentation of OpenAI and Anthropic clients:
pip install opentelemetry-instrumentation-openai opentelemetry-instrumentation-anthropic
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
# Configure the tracer provider with service metadata
resource = Resource.create({
"service.name": "my-llm-app",
"service.version": "1.0.0",
"deployment.environment": "production",
})
provider = TracerProvider(resource=resource)
# Export to any OTLP-compatible backend (Jaeger, Tempo, Honeycomb, Datadog, etc.)
exporter = OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
The BatchSpanProcessor buffers spans and exports them asynchronously in batches — use this in production to avoid blocking the request thread on export. In development, SimpleSpanProcessor (synchronous) is easier to debug. The OTLP exporter sends spans to any OpenTelemetry-compatible backend: Jaeger and Grafana Tempo for self-hosted, Honeycomb or Datadog for managed. The endpoint and format (gRPC vs HTTP) depend on your backend — check its documentation for the correct OTLP receiver address.
Instrumenting LLM Calls
The OpenTelemetry semantic conventions for generative AI (GenAI semconv, currently in experimental status) define standard attribute names for LLM spans. Using these ensures your spans are interpretable by observability tools that understand the GenAI conventions:
import anthropic
import time
from opentelemetry import trace
from opentelemetry.semconv._incubating.attributes import gen_ai_attributes as GenAI
tracer = trace.get_tracer(__name__)
client = anthropic.Anthropic()
def call_llm(prompt: str, system: str = None) -> str:
with tracer.start_as_current_span("llm.call") as span:
span.set_attribute(GenAI.GEN_AI_SYSTEM, "anthropic")
span.set_attribute(GenAI.GEN_AI_REQUEST_MODEL, "claude-sonnet-4-20250514")
span.set_attribute("gen_ai.request.max_tokens", 1024)
span.set_attribute("llm.prompt_length", len(prompt))
try:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=system or "You are a helpful assistant.",
messages=[{"role": "user", "content": prompt}]
)
# Record usage attributes on the span
span.set_attribute(GenAI.GEN_AI_USAGE_INPUT_TOKENS, response.usage.input_tokens)
span.set_attribute(GenAI.GEN_AI_USAGE_OUTPUT_TOKENS, response.usage.output_tokens)
span.set_attribute("gen_ai.response.finish_reason", response.stop_reason)
span.set_status(trace.StatusCode.OK)
return response.content[0].text
except Exception as e:
span.set_status(trace.StatusCode.ERROR, str(e))
span.record_exception(e)
raise
Tracing RAG Pipelines
Multi-step pipelines benefit most from tracing because the latency breakdown across retrieval, reranking, and generation is not visible from a single metric. Nest child spans inside a parent span to capture the full pipeline:
def rag_query(question: str) -> str:
with tracer.start_as_current_span("rag.pipeline") as pipeline_span:
pipeline_span.set_attribute("rag.question", question)
# Retrieval span
with tracer.start_as_current_span("rag.retrieve") as retrieval_span:
docs = vector_store.similarity_search(question, k=5)
retrieval_span.set_attribute("rag.num_docs_retrieved", len(docs))
retrieval_span.set_attribute("rag.top_score", docs[0].score if docs else 0)
# Reranking span
with tracer.start_as_current_span("rag.rerank") as rerank_span:
reranked = reranker.rerank(question, docs, top_n=3)
rerank_span.set_attribute("rag.num_docs_reranked", len(reranked))
# Build context and call LLM
context = "
".join(d.page_content for d in reranked)
with tracer.start_as_current_span("rag.generate") as gen_span:
gen_span.set_attribute("rag.context_length", len(context))
answer = call_llm(
prompt=f"Context:
{context}
Question: {question}",
system="Answer based only on the provided context."
)
gen_span.set_attribute("rag.answer_length", len(answer))
pipeline_span.set_attribute("rag.success", True)
return answer
What to Record on Spans
Span attributes are queryable in your observability backend — choose them based on what questions you want to answer. Token counts (input and output) let you compute cost per request and identify expensive queries. Retrieval scores and document counts help diagnose retrieval quality issues. Model names and versions enable A/B comparison between model versions in the same trace backend. User IDs or session IDs (appropriately anonymised) let you correlate traces with specific users for debugging. Avoid recording raw prompt and completion text as span attributes in production — they can be large (blowing up storage costs), may contain PII, and are better handled via a dedicated prompt logging system with appropriate access controls. Record lengths and hashes instead, and store full prompt/completion content in a separate system with stricter retention and access policies.
Exporting to Different Backends
The OTLP exporter is backend-agnostic — the same instrumentation code works with any OTLP-compatible backend by changing the endpoint. For self-hosted setups, run Grafana Tempo (trace storage) with Grafana (visualisation) and the OpenTelemetry Collector as an intermediary for buffering and routing. For managed setups, Honeycomb has the best out-of-the-box experience for LLM traces with its high-cardinality query engine — querying by token count ranges or model version is much faster than in Jaeger. Datadog and New Relic both support OTLP natively. LangSmith and Langfuse are LLM-specific observability platforms that accept OpenTelemetry traces and provide LLM-aware visualisations (prompt/completion viewers, token cost tracking, evaluation integration) that general-purpose tracing backends lack.
Auto-Instrumentation Tradeoffs
The opentelemetry-instrumentation-openai and opentelemetry-instrumentation-anthropic packages patch the SDK clients automatically, creating spans for every LLM call without any code changes. This is convenient for getting started but has limitations: auto-instrumentation creates spans at the SDK call boundary only, so it doesn’t capture retrieval, tool calls, or any custom logic in your pipeline. The span attributes it records are fixed by the library version and may not include everything you need. For production use, a hybrid approach works best: use auto-instrumentation to cover the LLM call itself (token counts, latency, model metadata), and add manual spans around retrieval, reranking, and business logic. This gives you coverage of the full pipeline with minimal boilerplate for the parts that auto-instrumentation handles well.
Propagating Trace Context Across Service Boundaries
In distributed LLM systems where the application server calls a separate inference service, trace context must be propagated in HTTP headers so that spans on both sides share the same trace ID and appear in the same trace in your observability backend. OpenTelemetry handles this through propagators — the W3C TraceContext propagator is the default and most widely supported. When making an HTTP call to your inference server, inject the current span context into the outgoing headers; on the receiving side, extract it and use it as the parent for the server-side span.
If you’re using the OpenAI-compatible API endpoint of vLLM or TGI, these servers don’t natively propagate OpenTelemetry context, so the trace will break at that boundary. The practical solution is to instrument your client-side call to the inference server as a single span covering the full round-trip latency, and treat the inference server as a black box from the tracing perspective. For internal inference services you control, adding a short OpenTelemetry middleware to the FastAPI or Starlette app that extracts the incoming trace context gives you end-to-end traces across the full stack with minimal effort — the opentelemetry-instrumentation-fastapi package handles this automatically.
Context propagation also matters for async and background tasks. If you kick off a background job (e.g., chunking and indexing a document after an upload), the trace context from the HTTP request won’t automatically propagate into the background thread or Celery task. You need to explicitly extract the current context at the point of task creation and pass it into the task, then re-attach it at the start of the background execution. OpenTelemetry’s context module provides context.attach() and context.detach() for this pattern. Missing this step means background task spans appear as orphaned root spans in your backend rather than as children of the originating request trace.
Sampling Strategies for Production
Recording every span from every request becomes expensive at high traffic volumes — both in terms of storage and the small overhead of span creation and export. OpenTelemetry’s sampling API lets you control what fraction of traces are recorded. The default AlwaysOnSampler records everything; ParentBasedSampler propagates the sampling decision from the parent span (useful when an upstream service controls sampling); TraceIdRatioBased samples a fixed fraction of traces based on the trace ID hash.
For LLM applications, a more useful strategy is tail-based sampling: record all traces but only export those that meet certain criteria — high latency, errors, or high token cost. Tail-based sampling requires a collector in the middle (the OpenTelemetry Collector) that buffers complete traces and applies the sampling decision after the fact. This is more operationally complex than head-based sampling but avoids the problem of discarding traces for requests that later turn out to be interesting (e.g., a request that looks fast at the start but times out later). For most teams at moderate scale, a simple head-based 10–20% sample rate with AlwaysOn sampling for error cases is a good starting point — it reduces storage costs meaningfully while preserving full fidelity for the errors you most want to investigate.
Alerts and SLOs from Trace Data
Traces are not just for debugging — they’re the source of truth for your service level objectives. From your trace data you can derive: p50/p95/p99 end-to-end latency per endpoint, per-step latency breakdown (retrieval vs reranking vs LLM generation), error rate by error type, token cost per request and per user, and cache hit rates if you instrument your prompt cache. Most observability backends let you create metric pipelines from trace data using span attribute filters — in Grafana this is done with TraceQL, in Honeycomb with derived columns and SLO alerts, in Datadog with trace analytics. The advantage of deriving metrics from traces rather than instrumenting metrics separately is that when you investigate an alert, you can immediately drill from the aggregated metric into the individual traces that caused the spike, with full context about what the request was doing at each step.
Connecting Traces to Evaluation and Retraining
The most underused application of LLM tracing is closing the feedback loop into your evaluation and retraining pipeline. Every trace captures the full input–output pair for an LLM call, along with latency, token counts, and any downstream signals you instrument (user thumbs up/down, task completion rates, follow-up questions indicating the answer was unclear). This is exactly the data you need for offline evaluation datasets and for identifying failure modes. Rather than building a separate logging system for prompt and completion capture, you can route your traces to a storage backend (S3, BigQuery, Postgres) alongside your observability backend, and query them for examples matching criteria you care about: long-latency requests, high-token-count requests, requests where the user sent a follow-up within 10 seconds suggesting the answer was unhelpful.
Langfuse is particularly well-suited for this use case because it combines OpenTelemetry-compatible trace ingestion with a prompt management interface, an annotation tool for human labelling of traces, and an evaluation runner that lets you run LLM-as-judge evaluations over your production traces. If you instrument your application with OpenTelemetry and export to Langfuse, you get a complete pipeline from production trace collection to human annotation to automated evaluation with minimal additional infrastructure. The key is to instrument your application thoroughly enough that the traces contain the information you need to evaluate quality — at minimum, the full prompt, the full completion, the model version, and any retrieval context used. With that in place, traces become the connective tissue between your production system and your evaluation and improvement workflow.
Where to Start
If you have an existing LLM application with no tracing today, the fastest path to value is: install the OTLP exporter and the auto-instrumentation package for your LLM SDK, point it at a free Langfuse cloud instance, and add one manual parent span around your top-level request handler. That gets you end-to-end latency, token counts, and error rates in under an hour with no backend infrastructure to manage. From there, add child spans for retrieval and tool calls as you identify questions that the initial traces can’t answer. The investment compounds quickly — every span you add makes the next debugging session faster, and the production traces you accumulate become an increasingly valuable dataset for evaluation and improvement. Start simple, instrument incrementally, and let the questions your traces can’t yet answer drive what you instrument next.