RAG in Production: Fixing Retrieval Failures with Hybrid Search and Reranking

Retrieval-Augmented Generation (RAG) is the dominant pattern for grounding LLM outputs in external knowledge — but the gap between a working RAG prototype and a production RAG system that reliably retrieves the right context is large. Most RAG failures are retrieval failures: the LLM gives a wrong or hallucinated answer not because the generation is bad, but because the retrieval returned the wrong documents, or missed the right ones, or retrieved the right documents in the wrong order. This guide covers the retrieval failures that matter most and the specific techniques that fix them.

Naive RAG and Where It Breaks

The naive RAG setup embeds a query, retrieves the top-k chunks by cosine similarity, concatenates them into the context window, and generates. This works well on benchmark datasets where queries are clean and documents are well-structured, but breaks in predictable ways in production. The most common failure modes are: semantic mismatch between how users phrase queries and how relevant documents are written; chunk boundary problems where a relevant passage is split across chunks and neither chunk alone is sufficient; exact keyword misses where a precise technical term or proper noun isn’t captured by semantic similarity; and top-k retrieval that returns semantically similar but factually irrelevant documents because similarity doesn’t equal relevance for all query types.

Hybrid Search: Dense + Sparse Retrieval

The most impactful single improvement to naive RAG is adding sparse (BM25) retrieval alongside dense (embedding) retrieval and fusing the results. Dense retrieval excels at semantic matching — finding documents that express the same concept in different words. Sparse retrieval excels at exact keyword matching — finding documents that contain specific technical terms, product names, version numbers, or proper nouns. These failure modes are complementary: queries that fail dense retrieval often succeed with sparse, and vice versa.

Reciprocal Rank Fusion (RRF) is the standard algorithm for combining dense and sparse retrieval results. RRF converts each retrieval method’s ranked list into scores using the formula 1/(k + rank), where k is a constant (typically 60), then sums the scores for each document across all retrieval methods and re-ranks by the combined score. RRF is parameter-free (k=60 works well across diverse datasets), doesn’t require calibrating scores between different retrieval systems, and consistently outperforms weighted linear combinations of raw scores.

def reciprocal_rank_fusion(results_list: list[list[str]], k: int = 60) -> list[tuple[str, float]]:
    """
    results_list: list of ranked doc ID lists from each retrieval method
    Returns: merged list of (doc_id, score) sorted by descending score
    """
    scores = {}
    for results in results_list:
        for rank, doc_id in enumerate(results):
            scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Elasticsearch and OpenSearch support both BM25 and dense vector search natively, making hybrid retrieval straightforward. Qdrant, Weaviate, and Pinecone also support hybrid search with built-in fusion. For systems already using a vector database without BM25 support, running a separate BM25 index in parallel (using rank_bm25 in Python or a dedicated Elasticsearch instance) and fusing with RRF is a standard pattern. The improvement in recall from hybrid search over dense-only is typically 5–15 percentage points on mixed query types.

Reranking

Retrieval returns candidates; reranking selects the best ones. A cross-encoder reranker takes a (query, document) pair as input and outputs a relevance score — unlike bi-encoders (embedding models) that encode query and document independently and compare in embedding space, cross-encoders attend to both simultaneously, capturing fine-grained query-document interactions at the cost of higher inference latency. Rerankers can’t be used for initial retrieval over large corpora (too slow to score every document), but applied to the top 20–50 candidates from retrieval, they significantly improve precision by re-ordering based on true relevance rather than embedding similarity.

Cohere Rerank, BGE-reranker, and Jina Reranker are the most widely used cross-encoder rerankers. BGE-reranker-v2-m3 is the current open-source state-of-the-art across most retrieval benchmarks. Apply reranking after hybrid retrieval: retrieve top-50 with hybrid search, rerank to top-5 for the context window. The combination of hybrid retrieval + reranking typically improves NDCG@5 by 10–20 percentage points over dense-only retrieval without reranking — often the difference between a RAG system that frustrates users and one that reliably answers correctly.

Query Transformation

Many retrieval failures are query failures — the user’s query is ambiguous, uses different terminology than the corpus, or is too short to express the information need precisely. Query transformation uses the LLM to reformulate queries before retrieval rather than sending the raw user query directly.

HyDE (Hypothetical Document Embeddings) generates a hypothetical answer to the query, embeds the hypothetical answer instead of the query, and retrieves documents similar to the hypothetical answer. The intuition: a hypothetical answer uses the same vocabulary, structure, and terminology as the actual documents, so it produces a better retrieval signal than the original query. HyDE is particularly effective for technical domains where user queries (“how do I fix X”) are phrased very differently from the documentation (“to resolve X, configure Y”). The cost is one additional LLM call per query for the hypothetical generation.

Multi-query retrieval generates N variants of the original query, retrieves for each, and unions the results before reranking. This improves recall for ambiguous queries that could be interpreted multiple ways. LangChain’s MultiQueryRetriever and LlamaIndex’s equivalent implement this pattern. Use 3–5 query variants; more than 5 rarely adds recall and significantly increases retrieval latency and cost.

Chunking Strategy

Fixed-size chunking (splitting documents into 512-token chunks with 50-token overlap) is the default but suboptimal for most document types. Semantic chunking splits at natural boundaries — paragraph breaks, section headers, sentence boundaries — producing chunks that contain complete thoughts rather than arbitrary token windows. Small-to-big retrieval embeds small chunks (sentences or short paragraphs) for precise retrieval but returns the surrounding larger chunk (or full section) as context, combining the precision of small-chunk matching with the coherence of larger context. For long documents with multiple distinct topics per section, hierarchical indexing (index summaries of sections, retrieve by summary, then fetch the full section) reduces noise from returning irrelevant paragraphs within long sections.

Retrieval Evaluation

RAG systems are frequently deployed without rigorous retrieval evaluation, which makes it impossible to distinguish retrieval failures from generation failures when the system gives wrong answers. Build a retrieval evaluation set of 100–300 query-document pairs (where you know which documents are relevant for each query) and compute Recall@5 and NDCG@5 before and after each retrieval improvement. Retrieval evaluation is fast and cheap — it doesn’t require running the LLM at all — and it tells you exactly which retrieval changes actually improve the system versus which ones look promising in theory but don’t move the needle in practice. Without this measurement, RAG optimization is guesswork.

Document Preprocessing for Better Retrieval

Retrieval quality is partly determined before the retrieval system ever runs — by how well documents are cleaned and preprocessed before indexing. PDF extraction is a common source of retrieval failures: PDFs with multi-column layouts, tables, headers and footers, or scanned text produce noisy extracted text that embeds poorly. Use a PDF extraction library that handles layout (pdfplumber for tables, unstructured.io for mixed-content PDFs) rather than naive text extraction. For documents with tables, consider indexing the table data separately as structured key-value text rather than as raw table cells — “Column A: Q1 Revenue, Value: 4.2M” embeds more reliably than raw TSV.

Metadata filtering is a simple but high-impact addition to retrieval. If your corpus has natural partitions — by date, department, document type, product, or customer — store these as metadata and filter on them during retrieval. A query about “Q4 2024 earnings” should retrieve only documents from Q4 2024, not semantically similar documents from other periods. Pre-filtering reduces the effective corpus size before semantic search, improving both precision and retrieval speed. Most vector databases support metadata filtering natively with negligible overhead. Adding even 2–3 metadata filters typically improves NDCG@5 by 5–10 points on corpora with natural partitions.

Context Window Management

After retrieval and reranking, the selected chunks must fit in the context window along with the system prompt, user query, and generation buffer. A common failure mode is retrieving the right documents but then truncating them to fit context length limits, cutting off the relevant passage. Manage context window budget explicitly: know your total token budget, reserve tokens for system prompt and generation, and use the remainder for retrieved context. For long retrieved passages, extract the most relevant sentences (using a small summarization model or keyword extraction) rather than truncating at a fixed position. Placing the most relevant chunk first and last in the context window (the primacy and recency positions where LLMs attend most strongly) rather than in random or score order is a simple trick that measurably improves generation quality for multi-document contexts.

Failure Mode Analysis

The most productive way to improve a RAG system is systematic failure mode analysis: collect 50–100 examples where the system gave wrong or incomplete answers, manually trace each failure to its root cause (retrieval miss, wrong chunk returned, context truncation, generation error), and address the most common root cause first. In most production RAG systems, retrieval accounts for 60–70% of failures, generation for 20–30%, and preprocessing/chunking for the remainder. This distribution argues strongly for investing in retrieval quality (hybrid search, reranking, query transformation) before optimizing generation. Failure mode analysis also reveals which specific document types or query categories have the highest failure rates, enabling targeted improvements rather than broad changes that may not move the metrics that matter.

Longitudinal tracking of failure modes is as important as the initial analysis. As you fix the most common failure modes, new failure modes become relatively more prominent. Track failure mode distribution over time — if retrieval failures drop from 65% to 40% after adding reranking, that’s a success, but the remaining 25% of retrieval failures may now be a harder category (complex multi-hop queries, cross-document synthesis requirements) that requires different techniques. Keeping a running log of failure mode categories and their prevalence guides the optimization roadmap for a RAG system from initial deployment through maturity.