How to Evaluate a RAG Pipeline: Metrics, Tools, and What to Fix

Most RAG pipelines fail not because the language model is bad but because the retrieval step is bad. The model can only work with what it is given — if the retrieved chunks do not contain the information needed to answer the question, the model will hallucinate or produce vague hedged non-answers regardless of how capable it is. Evaluating a RAG pipeline therefore requires decomposing the system into its two components and measuring each separately: retrieval quality (did the right chunks come back?) and generation quality (did the model produce a faithful, relevant answer from those chunks?). Conflating these into a single end-to-end accuracy metric makes it impossible to diagnose failures — you cannot tell whether to fix the retriever or the prompt.

Retrieval Metrics

Retrieval evaluation measures how well the vector search step returns relevant context. The two core metrics are context precision and context recall, both available in the RAGAS framework. Context precision measures the fraction of retrieved chunks that are actually relevant to the query — a high-precision retriever returns mostly useful chunks with little noise. Context recall measures the fraction of relevant information (as identified in a reference answer) that appears somewhere in the retrieved chunks — a high-recall retriever does not miss important facts. Both matter: a retriever with high recall but low precision overwhelms the context window with noise that degrades generation quality; a retriever with high precision but low recall misses key facts the model needs.

from ragas import evaluate
from ragas.metrics import context_precision, context_recall, faithfulness, answer_relevancy
from datasets import Dataset

# RAGAS expects a dataset with these columns:
# question, answer, contexts (list of retrieved chunks), ground_truth
eval_data = {
    "question": ["What is the refund policy?", "How do I reset my password?"],
    "answer": ["You can get a refund within 30 days.", "Click 'Forgot Password' on the login page."],
    "contexts": [
        ["Our refund policy allows returns within 30 days of purchase for a full refund."],
        ["To reset your password, visit the login page and click the 'Forgot Password' link."],
    ],
    "ground_truth": [
        "Customers can get a full refund within 30 days of purchase.",
        "Users can reset their password by clicking 'Forgot Password' on the login page.",
    ],
}

dataset = Dataset.from_dict(eval_data)
results = evaluate(
    dataset,
    metrics=[context_precision, context_recall, faithfulness, answer_relevancy],
)
print(results)
# {'context_precision': 0.92, 'context_recall': 0.88, 'faithfulness': 0.95, 'answer_relevancy': 0.91}

Retrieval metrics require ground truth — you need to know what the correct answer is and which chunks are relevant. Building this ground truth dataset is the main upfront cost of rigorous RAG evaluation. A practical approach is to generate synthetic question-answer pairs from your document corpus using an LLM: prompt the model with each document chunk and ask it to generate 3–5 questions that can be answered from that chunk alone, along with the ground truth answer. This gives you a labelled eval set without manual annotation, though you should review a sample manually to verify quality before relying on it for metric computation.

Generation Metrics

Generation evaluation measures the quality of the model’s response given the retrieved context. The two most important metrics are faithfulness and answer relevancy. Faithfulness measures whether every claim in the model’s answer is supported by the retrieved context — a faithfulness score below 0.8 typically indicates the model is hallucinating facts not present in the retrieved chunks. Answer relevancy measures whether the answer actually addresses the question asked — low answer relevancy usually means the model is answering a different question than was posed, which often happens when the retrieved chunks are off-topic and the model tries to be helpful anyway.

RAGAS computes both metrics using an LLM judge: faithfulness decomposes the answer into individual claims and checks each against the context; answer relevancy generates candidate questions from the answer and measures their semantic similarity to the original question. This LLM-based evaluation is more expensive than embedding-based metrics but substantially more accurate for detecting subtle hallucinations and relevance failures. For high-volume production evaluation where cost matters, run LLM-based metrics on a random sample (100–500 examples) and use cheaper embedding similarity metrics for full-population monitoring.

Diagnosing Retrieval Failures

When context recall is low, the retriever is missing relevant information. The most common causes are chunking strategy, embedding model mismatch, and index configuration. Chunking strategy determines what unit of text the retriever can return — chunks that are too small lose context that spans multiple sentences; chunks that are too large dilute the relevant signal with unrelated content. A useful diagnostic is to compute context recall at multiple chunk sizes (256, 512, 1024 tokens) on your eval set and plot the tradeoff: most corpora have a sweet spot where recall peaks before noise dilutes precision. Embedding model mismatch occurs when the embedding model was not trained on text similar to your domain — a general-purpose embedding model performs poorly on specialised technical or legal corpora. Evaluate domain-specific alternatives (domain-tuned BERT variants, or a fine-tuned embedding model using the approach described in the embedding fine-tuning article) before assuming the retrieval architecture is the problem.

import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

def evaluate_retrieval_at_k(queries, relevant_docs, retrieved_docs_per_query, k=5):
    """Compute Precision@K and Recall@K for a retrieval system."""
    precisions, recalls = [], []
    for query, relevant, retrieved in zip(queries, relevant_docs, retrieved_docs_per_query):
        top_k = retrieved[:k]
        relevant_set = set(relevant)
        retrieved_set = set(top_k)
        hits = len(relevant_set & retrieved_set)
        precisions.append(hits / k)
        recalls.append(hits / len(relevant_set) if relevant_set else 0)
    return {
        f'precision@{k}': np.mean(precisions),
        f'recall@{k}': np.mean(recalls),
    }

# Also compute Mean Reciprocal Rank (MRR) — useful for single-answer queries
def mean_reciprocal_rank(queries, relevant_docs, retrieved_docs_per_query):
    rrs = []
    for relevant, retrieved in zip(relevant_docs, retrieved_docs_per_query):
        relevant_set = set(relevant)
        rr = 0.0
        for rank, doc in enumerate(retrieved, start=1):
            if doc in relevant_set:
                rr = 1.0 / rank
                break
        rrs.append(rr)
    return np.mean(rrs)

When precision is low (too many irrelevant chunks returned), the most effective fixes are re-ranking and metadata filtering. Re-ranking runs a cross-encoder model over the top-N retrieved chunks and reorders them by relevance to the query — cross-encoders are more accurate than bi-encoders for relevance scoring because they attend to both the query and the document jointly, but are too slow to run over the full corpus. A standard pipeline retrieves top-20 with the fast bi-encoder, then re-ranks to select the top-5 with a cross-encoder like ms-marco-MiniLM-L-6-v2. Metadata filtering restricts retrieval to a relevant subset of documents before similarity search — filtering by document type, date range, or category can dramatically improve precision when queries are naturally scoped to a subset of the corpus.

Diagnosing Generation Failures

When faithfulness is low, the model is generating claims not supported by the retrieved context. The primary cause is that the model is relying on parametric knowledge (what it learned during training) rather than the provided context. This happens most often when the retrieved context is partial — it contains some relevant information but not enough to fully answer the question, so the model fills the gap with what it knows. The fix is a combination of improved retrieval (higher recall to give the model more complete context) and prompt engineering: explicitly instructing the model to answer only from the provided context and to say it does not know if the context is insufficient. Adding a faithfulness self-check — prompting the model to verify each claim in its answer against the context before returning the response — reduces faithfulness failures at the cost of additional latency.

When answer relevancy is low, the model is answering a different question than was asked. This is almost always a retrieval problem: the retrieved chunks are off-topic, so the model produces a response that is faithful to the (irrelevant) context but does not address the actual query. Check context precision — if it is low, fix the retriever first. If context precision is high but answer relevancy is still low, the model may be ignoring the query and producing a summary of the context instead; strengthen the answering instruction in the system prompt to require the model to directly address the specific question asked.

Building a Production RAG Eval Pipeline

A complete RAG eval pipeline runs automatically on every deployment and alerts when key metrics degrade. The minimum viable setup: a fixed eval dataset of 200–500 question-answer pairs covering the main query types for your application; a nightly batch job that runs the full RAG pipeline on the eval set and computes context precision, context recall, faithfulness, and answer relevancy; a dashboard or alerting rule that flags regressions above a threshold (typically a drop of more than 0.05 on any metric). Track metrics over time rather than just at a single snapshot — gradual degradation as your document corpus changes is often invisible in point-in-time measurements but visible as a trend. When the corpus is updated, rerun the full eval immediately rather than waiting for the nightly job, since document changes are the most common cause of sudden retrieval quality drops.

For production systems with high query volume, supplement the offline eval set with online metrics computed from actual user interactions: thumbs up/down signals, follow-up question rate (a high rate of “can you clarify” follow-ups often indicates low answer quality), and session abandonment after a response. These online signals do not replace offline evaluation — they are noisier and harder to interpret — but they catch real-world failure modes that synthetic eval sets miss, particularly distribution shift between the eval set queries and the actual queries users ask in production.

Common Eval Mistakes and How to Avoid Them

The most costly RAG eval mistake is evaluating end-to-end accuracy only. A single accuracy score on a held-out Q&A set tells you whether the system is getting better or worse overall, but gives you no information about which component to fix. Teams that optimise on end-to-end accuracy alone often end up tuning the generation prompt when the real problem is retrieval, or swapping embedding models when the real problem is chunking. Always measure retrieval and generation quality separately with independent metrics before drawing conclusions about where to invest improvement effort.

The second most common mistake is using an eval set that does not represent the actual query distribution. RAG systems are typically built by developers who understand the document corpus well — the synthetic questions they generate tend to be well-formed, unambiguous, and directly answerable from single chunks. Real user queries are often fragmentary, ambiguous, misspelled, and require synthesising information from multiple chunks. An eval set built entirely from synthetic questions will overestimate real-world performance. Supplement your synthetic eval set with a sample of actual user queries as soon as your system is in production, even if that means manually labelling relevance for 50–100 real queries. The distribution mismatch between synthetic and real queries is almost always larger than expected.

The third mistake is treating faithfulness as a binary pass/fail rather than a graded metric. A response that makes five claims and supports four of them from context is better than one that supports none — and the unsupported claim in the first case may be trivial (a generic transitional phrase) while all five failures in the second case may be substantive hallucinations. Use RAGAS faithfulness scores as a continuous metric and track the distribution of scores across your eval set rather than just the mean. A mean faithfulness of 0.85 looks acceptable, but if 15 percent of responses score below 0.5, those are the cases most likely to produce harmful outputs and most in need of targeted fixes.

Finally, do not neglect latency as part of RAG evaluation. A retrieval pipeline that achieves excellent quality metrics but takes four seconds end-to-end is often not viable for interactive use cases. Measure and log retrieval latency, reranker latency, and generation latency separately so that you can identify the bottleneck when SLAs are not met. The quality-latency tradeoff is a design decision — running a cross-encoder reranker over top-50 results improves precision but adds 200–400ms — and that tradeoff should be made explicitly with measured data, not assumed.

The decision framework for most RAG production issues: if context recall is below 0.80, fix chunking or try a domain-tuned embedding model; if context precision is below 0.75, add re-ranking or metadata filtering; if faithfulness is below 0.85, improve retrieval coverage first then tighten the generation prompt; if answer relevancy is below 0.80 with good context precision, the generation prompt needs to be more directive about answering the specific question asked.