Evaluating LLM outputs is easy to do badly. The most common approach — a handful of handwritten test cases and a vibe check — gives false confidence and catches regressions only after they’ve reached users. Building an eval dataset from production logs is harder upfront but produces evaluations that reflect what your model actually encounters in the real world. This guide covers how to do it systematically: what to collect, how to sample, how to label, and how to turn a raw log dump into an eval suite you can run on every deployment.
Why Production Logs Are the Right Source
Handwritten eval sets have two fundamental problems. First, they reflect what you imagined users would ask, not what they actually ask. Real user inputs are messier, more varied, and hit edge cases you wouldn’t think to write. Second, they go stale — a benchmark written for your v1 model tells you less and less as your product evolves. Production logs solve both problems: they capture real distribution drift as your user base grows, and surface the failure modes that actually affect users rather than the ones you anticipated.
The goal is not to log everything — it’s to build a stratified sample that covers your real input distribution, including the long tail. A dataset of 500 carefully sampled and labeled production examples is more valuable for catching regressions than 5,000 handwritten ones that all look similar.
What to Log
At minimum, log the full prompt (including system prompt and any injected context), the model’s raw output, and a timestamp. For structured applications, also log task type if available from your routing layer, model version and parameters (temperature, max tokens), latency, and any post-processing applied before output reached the user.
Log prompt and completion together as a unit — you need both to evaluate quality. For RAG applications, log which chunks were retrieved and their scores; retrieval quality and generation quality are separate failure modes. For multi-turn conversations, log the full conversation history as context — single-turn evaluation of responses that depend on prior turns produces misleading results.
Handle PII deliberately. Production logs contain sensitive user data. Establish a logging policy before you start: what to redact automatically (email addresses, phone numbers, names via NER), what to store with access controls, and how long to retain. Many teams skip this and then can’t use their logs for eval because the data is too sensitive to share with annotators.
Sampling Strategy
Don’t sample randomly. Random sampling over-represents modal inputs and under-represents the tail. You want stratified sampling that ensures coverage across the full distribution.
Start by clustering your logged inputs. Embed them with a sentence embedding model (BGE, E5, or OpenAI embeddings), cluster with k-means or HDBSCAN, and inspect centroids to understand what each cluster represents. Sample proportionally from large clusters and oversample from small ones — the small clusters contain unusual inputs and edge cases where your model is most likely to fail.
Also sample on failure signals. If your application has any quality signal — user thumbs down, session abandonment, clarification requests, downstream parsing errors — sample more heavily from examples that triggered these. These are your actual failure cases and exactly what your eval set should cover.
A practical target for an initial eval set: 200–500 examples covering 10–15 distinct input categories, with deliberate oversampling of edge cases. This is enough for meaningful regression detection without an overwhelming labeling burden.
Labeling Approaches
Human labeling is the gold standard. Give annotators a clear rubric — not “is this a good response?” but specific criteria: Is the response factually accurate? Does it address all parts of the question? Is it the right length? Does it follow the required output format? A rubric with 4–6 binary or 3-point-scale criteria is easier to apply consistently than a single quality score and gives diagnostic information about which dimensions are failing. Budget 5–15 minutes per example for careful annotation.
LLM-as-judge uses a stronger model (GPT-4o, Claude Opus, or similar) to score outputs automatically. Provide the judge a detailed rubric in its system prompt, the input, and the output to evaluate, and ask for a structured score with justification. LLM-as-judge is fast and cheap but has known biases — it favors longer hedged responses, can miss domain-specific factual errors, and may be sycophantic toward confident-sounding outputs. Use it for first-pass scoring or low-stakes dimensions, and spot-check with human annotation to calibrate reliability.
Reference-based evaluation compares outputs against a reference answer using ROUGE, BERTScore, or exact match. This works well for tasks with deterministic correct answers (structured extraction, classification, factual recall) and poorly for open-ended generation. Don’t apply reference-based metrics to tasks where multiple valid responses exist — a factually correct summary using different words than the reference will score low on ROUGE despite being a good output.
Building the Eval Pipeline
An eval dataset is only useful if run regularly. Build a pipeline that runs on every deployment candidate: pull eval examples, run inference, score, and compare against baseline. Track overall pass rate and per-category scores. A drop in overall score flags a regression; a drop in a specific category tells you where to look.
Version your eval set alongside your model. Add new examples as your product evolves, remove examples that are no longer representative, and keep a changelog. Eval sets that grow without pruning become stale and slow. Aim for a set comprehensive enough to catch regressions but small enough to run in under 10 minutes.
Slice-Based Evaluation
Aggregate metrics hide problems. A model with 85% overall quality might perform at 95% on common queries and 55% on edge cases representing your highest-value users. Always break down eval results by input category, user segment, or whatever slices are meaningful for your application. Set per-slice quality thresholds — a model passing the overall bar while failing a critical slice should not go to production.
The worst-performing slice is almost always not the one you’d predict upfront. Instrument your eval reporting to surface it automatically, and treat unexpectedly low slice scores as signals to add more examples from that category. Production log evals and slice-based reporting form a feedback loop: more logs reveal more failure modes, which inform better evals, which catch more regressions before they ship.
Continuous Evaluation vs Periodic Benchmarking
There are two modes of running evals: continuous evaluation on every deployment candidate, and periodic deep-dive benchmarking when you’re doing a major model update. Both serve different purposes and both are necessary.
Continuous evaluation catches regressions fast. A lightweight eval set (100–200 examples) that runs in 2–3 minutes as part of your CI/CD pipeline means you know within minutes of a code or model change whether quality degraded. The tradeoff is that a small continuous eval set has limited statistical power — it reliably catches large regressions but may miss subtle quality shifts. Treat the continuous eval as a trip wire, not a comprehensive quality report.
Periodic deep-dive evaluations run your full eval suite (500+ examples, multiple labeling methods, human review of failure cases) before major model releases or when you suspect a systematic quality issue. These take hours, not minutes, and require human attention to interpret. Schedule them before any significant model update goes to production and after any incident that suggests a quality problem in a specific input category.
The combination — fast continuous evals for regression detection, periodic deep evals for comprehensive quality assessment — covers both the speed and depth requirements without forcing a false choice between them. Start with the continuous eval set first, since it has the highest leverage per unit of effort, and build toward the deeper eval process as your team and product mature.
Automating Eval Dataset Refresh
A good eval dataset decays. As your product evolves, old examples become less representative and new failure modes emerge that aren’t covered. Build a lightweight refresh process into your team’s workflow: every sprint or every month, pull the latest production logs, run your clustering pipeline, identify any clusters that have grown significantly or any new clusters that didn’t exist before, and add 20–30 new examples from those areas. Label them with your fastest labeling method (LLM-as-judge for speed, human spot-check for calibration) and add them to the running eval set. Prune examples from clusters that are no longer representative of current traffic.
This incremental refresh process keeps your eval set current without requiring a periodic large-scale effort. Teams that treat eval dataset maintenance as a one-time project almost always end up with stale evals within 6 months. Teams that build the refresh into their standard workflow maintain evaluation coverage that genuinely reflects what their model sees in production.
The most underrated investment in LLM evaluation is simply being disciplined about logging from day one. Teams that start logging thoughtfully — capturing full prompts, completions, and quality signals from launch — can build a meaningful eval dataset within weeks of going live. Teams that don’t start logging until they have a quality problem spend weeks backfilling infrastructure before they can even begin building evals. Log early, log comprehensively, and the eval dataset becomes a natural byproduct of a well-instrumented production system.
Handling Label Disagreement
When multiple human annotators label the same example, they will disagree — sometimes frequently. Disagreement is information, not noise to be discarded. High disagreement on a specific example usually signals genuine ambiguity: the input is underspecified, the desired output depends on unstated context, or reasonable people have different intuitions about quality. These high-disagreement examples are often the most diagnostically valuable items in your eval set, because they highlight where your model’s success criteria are themselves unclear.
Track inter-annotator agreement (Cohen’s kappa or Krippendorff’s alpha) across your annotation batches. Agreement below 0.6 kappa on a specific eval dimension suggests the rubric for that dimension is ambiguous and needs clarification. Agreement below 0.4 across the board suggests fundamental disagreement about what quality means for your task — a problem to fix before investing more in labeling. When you refine the rubric in response to low agreement, re-label a sample of previously annotated examples to verify the refinement improved consistency.
For examples with persistent high disagreement even under a clear rubric, you have three options: exclude them from the eval set (if the task is genuinely ambiguous and you can’t resolve it), use the majority label with a disagreement flag (keeping them in the eval but tracking them separately), or use the disagreement itself as a signal by flagging model outputs that match the minority label for human review. The third approach is underused — disagreement items are often exactly the cases where model quality is hardest to assess automatically and most valuable to review manually.
Eval Dataset Size and Statistical Power
A common mistake is treating eval dataset construction as a one-time effort with a fixed target size, rather than as an ongoing process sized to the decisions the eval needs to support. The required dataset size depends on what changes you need to reliably detect. To detect a 5 percentage point quality change at 80% power with a 5% significance level, you need roughly 300–400 examples. To detect a 2 percentage point change, you need over 2,000. For slice-level detection where the slice is 10% of your traffic, multiply the required size by 10.
These calculations have a practical implication: if you’re making deployment decisions based on eval results, your eval set needs to be large enough that the differences you care about are statistically distinguishable from noise. A 200-example eval that shows 84% vs 82% quality between two models doesn’t tell you which is better — the difference is within the margin of error. Teams that deploy based on small-eval results and then find no quality difference in production have usually been fooled by evaluation noise rather than a genuine model improvement.
Report confidence intervals on eval metrics, not just point estimates. A simple 95% Wilson confidence interval on a proportion (the standard for binary quality metrics) is easy to compute and makes the uncertainty explicit. If two models’ confidence intervals overlap substantially, the eval is not discriminating between them and you need more examples, not just a point estimate comparison. Building CI reporting into your eval pipeline from the start avoids the common pitfall of treating small eval differences as meaningful signals.