How to Evaluate LLM Outputs Without Ground Truth Labels

Ground truth labels are expensive, slow to produce, and often impossible to obtain at scale. For most production LLM applications — customer support bots, document summarizers, code assistants, RAG pipelines — you can’t label every output, and you can’t always define what a “correct” answer even looks like. Evaluating quality without ground truth is therefore not an edge case but the standard operating condition for LLM evaluation in production. This guide covers the practical methods for doing it well: reference-free metrics, model-based evaluation, behavioral testing, and consistency-based approaches.

Why Ground Truth Is Often Unavailable

Ground truth labels require knowing the ideal output for a given input. For closed-form tasks — extracting a specific field from a document, classifying sentiment into fixed categories, answering questions with factual answers — ideal outputs exist and can be labeled. For open-ended tasks — summarization, explanation, creative writing, multi-step reasoning, conversational response — there are many valid outputs and no single ground truth. Labeling one as correct implicitly excludes valid alternatives and introduces annotator bias.

Even for tasks where ground truth exists in principle, obtaining it at the scale of production traffic is impractical. A production LLM application processing 100,000 requests per day cannot have human labels for all of them. Evaluation needs to operate at production scale, which means automated methods that don’t require human annotation on every example. The goal is not to replace human judgment entirely but to automate the bulk of evaluation and focus human review on the cases where it matters most.

LLM-as-Judge

Using a stronger LLM to evaluate outputs from a weaker one (or from the same model on a different task) is currently the most widely used reference-free evaluation method. The evaluator model receives the input, the output to evaluate, and a detailed rubric, and returns a score with a justification. With a well-designed rubric and a capable evaluator (GPT-4o or Claude Opus work well), LLM-as-judge achieves reasonable correlation with human ratings on many tasks.

The key to making LLM-as-judge reliable is rubric design. Vague criteria (“is this a good response?”) produce inconsistent and biased scores. Specific, decomposed criteria produce more reliable results. A good rubric for a customer support response might score separately on: does the response address the customer’s question (yes/no), is the information accurate (yes/no/unclear), is the tone appropriate (1-3 scale), does it include unnecessary information (yes/no), is it the right length (too short/appropriate/too long). Decomposed criteria are easier to score consistently, and the separate scores give diagnostic information about which dimensions are failing.

LLM-as-judge has well-documented biases. It favors longer responses over shorter ones even when the shorter response is more appropriate. It favors responses that use confident, authoritative language. It can be sycophantic toward responses that sound similar to the evaluator’s own style. It may miss factual errors in domains where the evaluator has limited knowledge. None of these biases are dealbreakers, but they mean LLM-as-judge scores need calibration against human ratings on a sample before being used for deployment decisions. A judge that agrees with humans 75% of the time on a specific task is useful; one that has a systematic bias toward verbose responses may rank models incorrectly even with high average agreement.

G-Eval and Structured Evaluation Frameworks

G-Eval, introduced by Liu et al. in 2023, improves on naive LLM-as-judge by using chain-of-thought prompting and token probability weighting to produce more calibrated scores. Instead of asking the model to output a score directly, G-Eval asks it to reason step-by-step about the quality criteria and then score, using the probability distribution over score tokens as the final rating rather than the argmax. This reduces the quantization noise of discrete scores and produces scores that are better calibrated and more discriminating between similar-quality outputs.

Frameworks like DeepEval, Ragas (for RAG-specific evaluation), and PromptFlow implement structured evaluation pipelines with pre-built metrics based on LLM-as-judge patterns. These are useful starting points that save implementation time, but their default prompts and rubrics are generic. For production use, always customize the evaluation criteria to your specific task rather than using generic “coherence” or “relevance” metrics that may not capture what actually matters for your application.

Consistency-Based Evaluation

Consistency methods evaluate quality by checking whether a model’s outputs are internally consistent or consistent with known facts, without requiring a reference answer. The core idea: a high-quality response should be self-consistent (doesn’t contradict itself), consistent with the input (doesn’t contradict information provided in the prompt), and consistent across paraphrased versions of the same question.

Self-consistency evaluation generates multiple outputs for the same input (with non-zero temperature) and measures agreement. For factual questions, high variance across generations signals uncertainty — the model doesn’t have a reliable answer. For classification or structured tasks, the majority-vote answer from multiple generations is more reliable than any single generation, and the agreement rate serves as a proxy for confidence. Low self-consistency on a specific input category is a useful signal for identifying where your model is uncertain, which is directly actionable: these are the inputs that most benefit from retrieval augmentation, human review, or explicit uncertainty signaling to the user.

Faithfulness evaluation checks whether a generated response is consistent with the source documents provided in the prompt. This is especially important for RAG applications where hallucination — generating claims not supported by the retrieved context — is the primary failure mode. A faithfulness evaluator takes the retrieved context and the generated response and checks whether every claim in the response can be grounded in the context. Ragas implements this as a decompose-then-verify pipeline: extract individual claims from the response, then verify each claim against the context independently. The faithfulness score is the fraction of claims that are supported.

Behavioral Testing

Behavioral testing evaluates model capabilities through carefully designed input perturbations rather than by scoring individual outputs. The CheckList framework (Ribeiro et al.) defines three test types: minimum functionality tests that check basic capabilities (can the model correctly handle negation?), invariance tests that check outputs are stable under semantically neutral perturbations (changing a name shouldn’t change the answer), and directional tests that check outputs change in the expected direction under meaningful perturbations (adding more negative context should decrease sentiment score).

For LLM applications, behavioral tests are most useful for catching systematic failure modes. Examples: does your summarizer always include the date mentioned in the document? Does your code assistant produce syntactically valid Python? Does your customer support bot correctly identify when a query is out of scope? These tests have binary pass/fail outcomes, don’t require reference answers, and can be generated programmatically at scale. A suite of 500 behavioral tests covering your application’s core capabilities provides more actionable signal than an equivalent number of quality-scored free-form responses, because failures are directly interpretable.

Human Evaluation at Scale

Reference-free automated methods reduce the need for human evaluation but don’t eliminate it. Human evaluation remains necessary for calibrating automated evaluators, for assessing quality dimensions that are difficult to capture in rubrics (naturalness, brand voice, appropriateness of tone in sensitive situations), and for making final deployment decisions on major model updates. The goal is to make human evaluation as efficient as possible by focusing it where automated methods are least reliable.

Side-by-side comparisons (A/B rating between two model outputs for the same input) are more reliable than absolute quality scores. Humans are better at relative judgments than absolute ones, and side-by-side formats reduce inter-annotator variance. For deployment decisions between two model versions, a sample of 200–300 side-by-side comparisons with majority-vote ratings provides statistically meaningful signal. This is a manageable annotation load that can be completed in a day with a small annotation team, making it practical as a pre-deployment gate for major model changes.

Putting It Together

A practical reference-free evaluation stack for a production LLM application layers these methods by cost and reliability. Behavioral tests run on every deployment candidate as a fast automated gate — binary pass/fail, runs in minutes, catches regressions in specific capabilities. LLM-as-judge scores run on a sample of production traffic daily — automated, scalable, calibrated against human ratings. Self-consistency monitoring runs continuously in production — no labeling required, surfaces inputs where the model is uncertain in real time. Human side-by-side evaluation runs before major model releases — high quality, manageable cost, authoritative for deployment decisions.

The most common mistake is treating any single method as sufficient. LLM-as-judge misses systematic biases; behavioral tests miss quality degradation in uncovered scenarios; consistency monitoring misses problems that are consistent but wrong. The combination of automated methods at different levels of the evaluation stack, with human evaluation as the final arbiter, gives both the scale to monitor production continuously and the reliability to make deployment decisions with confidence.

Calibrating Your Automated Evaluators

Any automated evaluation method — LLM-as-judge, consistency scoring, or reference-free metrics — needs periodic calibration against human ratings to remain reliable. Calibration means measuring the correlation between your automated scores and human judgments on a held-out sample, and adjusting your scoring rubric, prompt, or thresholds based on what you find. Without calibration, automated scores can drift systematically from what actually matters for your application, and you won’t notice until model quality degrades visibly in production.

A practical calibration workflow: every month or after any significant change to your evaluation pipeline, sample 100–200 examples from production, score them with your automated evaluator, then have human annotators rate the same examples using a clear rubric. Compute Spearman correlation between automated and human scores. Correlation above 0.7 indicates the automated evaluator is reliably tracking human quality judgments. Correlation below 0.5 signals a systematic mismatch that needs investigation — usually either the rubric is misaligned with what annotators care about, or the evaluator model has a bias that’s pulling scores in a consistent direction.

Pay particular attention to calibration on failure cases. Automated evaluators tend to be better at distinguishing good responses from each other than at reliably flagging bad ones. Compute precision and recall separately for the low-score bucket — how often does a low automated score correspond to a genuinely bad response (precision), and what fraction of genuinely bad responses does the automated evaluator catch (recall)? If recall is low, your automated evaluator is missing real quality problems, and you need either a better rubric or more targeted behavioral tests for the failure modes it’s missing.

Evaluating RAG Systems Specifically

RAG (retrieval-augmented generation) pipelines have two distinct evaluation targets — retrieval quality and generation quality — and conflating them produces misleading results. A response can be well-written and grounded in the retrieved context while still being wrong if the retrieval step returned irrelevant documents. Evaluating only the final response quality hides retrieval failures behind generation quality.

Evaluate retrieval and generation separately. For retrieval: does the retrieved context contain information relevant to the question (context relevance)? For generation: is the response grounded in the retrieved context (faithfulness), and does it fully address the question given what was retrieved (answer relevance)? Ragas implements all three metrics using LLM-based evaluation and provides a standardized pipeline for RAG evaluation that separates these dimensions. Running Ragas scores on a sample of production RAG calls weekly gives early warning of retrieval quality degradation (often caused by index staleness or embedding model drift) before it becomes visible as response quality problems.