Most RAG pipelines retrieve the top-k chunks by vector similarity and pass them directly to the language model. This works reasonably well for clean, on-topic queries but fails in a predictable way: embedding similarity captures semantic relatedness, not relevance. A chunk about “model evaluation” is semantically close to a query about “how to evaluate a RAG pipeline” even if it covers offline batch evaluation for classification, which has nothing to do with the RAG-specific metrics the user actually needs. The result is retrieved chunks that are topically adjacent but not actually useful — the language model gets diluted context and produces vaguer answers, or hallucinates to fill the gap.
Cross-encoder reranking fixes this by running a second-stage relevance model that scores each candidate chunk against the full query jointly, rather than comparing independent embeddings. The quality improvement is significant and consistent: in most production RAG systems, adding a cross-encoder reranker on top of vector retrieval improves answer correctness by 15–30% on benchmarks without any changes to the embedding model, vector database, or language model. It is one of the highest-leverage improvements available to a RAG pipeline after baseline retrieval is working.
Bi-Encoders vs Cross-Encoders
Understanding why cross-encoders outperform bi-encoders for reranking requires understanding the architectural difference. A bi-encoder (the standard embedding model used for retrieval) encodes the query and each document independently into fixed-length vectors, then computes similarity via dot product or cosine distance. This is fast — you encode each document once at indexing time and retrieve by approximate nearest-neighbour search — but the query and document never interact during encoding. Subtle relevance signals that depend on how the query and document relate to each other are invisible to a bi-encoder.
A cross-encoder takes the query and document concatenated as a single input and produces a relevance score, allowing full attention between every query token and every document token. This captures fine-grained relevance signals that bi-encoders miss entirely — whether a specific clause in the document directly answers the question, whether the document’s stance on a topic matches the query’s implied intent, whether key terms appear in meaningful proximity. The cost is that cross-encoders cannot be pre-computed: you must run a forward pass for every query-document pair at query time, which is why they are used only for reranking a small candidate set (top 20–50 retrieved chunks) rather than for initial retrieval over millions of documents.
Reranking with sentence-transformers
The sentence-transformers library provides cross-encoder models that are straightforward to drop into any RAG pipeline. CrossEncoder models are specifically trained for query-document relevance scoring:
from sentence_transformers import CrossEncoder
import numpy as np
# Load a cross-encoder trained for passage relevance
# ms-marco models are trained on Microsoft MARCO passage retrieval dataset
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2', max_length=512)
# Larger, higher-quality alternatives:
# 'cross-encoder/ms-marco-electra-base'
# 'BAAI/bge-reranker-v2-m3' (multilingual, state-of-art)
# 'Alibaba-NLP/gte-reranker-modernbert-8b' (large, high quality)
def rerank(query: str, chunks: list[str], top_n: int = 5) -> list[tuple[str, float]]:
"""Rerank chunks by cross-encoder relevance score, return top_n."""
pairs = [(query, chunk) for chunk in chunks]
scores = reranker.predict(pairs) # returns array of relevance scores
ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
return ranked[:top_n]
# Example usage
query = "How does gradient checkpointing reduce memory usage during training?"
# retrieved_chunks comes from your vector database top-k results
retrieved_chunks = [
"Gradient checkpointing trades compute for memory by not storing all intermediate activations...",
"Mixed precision training uses float16 for forward pass weights to reduce memory...",
"The Adam optimiser stores two additional moment vectors per parameter...",
"Gradient checkpointing recomputes activations during the backward pass instead of caching them...",
"Model parallelism splits layers across multiple GPUs to distribute memory...",
]
reranked = rerank(query, retrieved_chunks, top_n=3)
for chunk, score in reranked:
print(f"Score {score:.3f}: {chunk[:80]}...")
The cross-encoder scores have no fixed scale — they are raw logits from the relevance head and vary by model. What matters is the relative ordering, not the absolute values. The two gradient checkpointing chunks should rank clearly above the optimizer, mixed precision, and model parallelism chunks for this query, even though all five are semantically similar to “memory during training”.
Integrating Reranking into a Full RAG Pipeline
A complete reranking pipeline retrieves a larger candidate set than you ultimately want to pass to the model — typically 20 to 50 chunks — reranks them all, then passes only the top 3 to 5 to the LLM. Retrieving more candidates than you need is important: the whole value of reranking is that it can surface relevant chunks that landed outside the top-3 by embedding similarity. If you only retrieve 5 and rerank 5, you are just reordering a set that may already be missing the most relevant chunk:
from sentence_transformers import CrossEncoder
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage
embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
vectorstore = Chroma(persist_directory='./chroma_db', embedding_function=embeddings)
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2', max_length=512)
llm = ChatOpenAI(model='gpt-4o-mini')
def rag_with_reranking(query: str, retrieve_k: int = 30, final_k: int = 5) -> str:
# Stage 1: retrieve broad candidate set by embedding similarity
candidates = vectorstore.similarity_search(query, k=retrieve_k)
candidate_texts = [doc.page_content for doc in candidates]
# Stage 2: cross-encoder reranking
pairs = [(query, text) for text in candidate_texts]
scores = reranker.predict(pairs)
ranked_indices = np.argsort(scores)[::-1][:final_k]
top_docs = [candidates[i] for i in ranked_indices]
# Stage 3: generate with reranked context
context = "
---
".join(doc.page_content for doc in top_docs)
prompt = f"""Answer the question using only the provided context.
Context:
{context}
Question: {query}
Answer:"""
response = llm.invoke([HumanMessage(content=prompt)])
return response.content
answer = rag_with_reranking("What chunking strategy works best for legal documents?")
print(answer)
LangChain ContextualCompressionRetriever
LangChain wraps cross-encoder reranking in a ContextualCompressionRetriever that integrates cleanly with the rest of the LangChain retriever interface, making it easy to swap in reranking without restructuring your pipeline:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
# Wrap the cross-encoder as a LangChain compressor
model = HuggingFaceCrossEncoder(model_name='BAAI/bge-reranker-v2-m3')
compressor = CrossEncoderReranker(model=model, top_n=5)
# base_retriever fetches a larger candidate set
base_retriever = vectorstore.as_retriever(search_kwargs={'k': 30})
# Compression retriever applies reranking transparently
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever,
)
# Usage is identical to any other LangChain retriever
docs = compression_retriever.get_relevant_documents(
"What are the memory implications of attention in long context models?"
)
for doc in docs:
print(doc.page_content[:120])
Latency and Batching Considerations
Cross-encoder inference adds latency to every query. On a CPU, scoring 30 candidate chunks with ms-marco-MiniLM-L-6-v2 takes roughly 200–400ms; on a GPU it drops to 20–50ms. For most RAG applications this is acceptable — users tolerate a 300ms additional wait for meaningfully better answers — but for latency-critical applications (autocomplete, real-time assistants), you need to optimise. The main levers are: reducing the candidate set size (score fewer chunks — 20 instead of 50), using a smaller reranker model (MiniLM-L-6 instead of ELECTRA-base), running inference on GPU, and batching multiple queries together if your application allows it. Avoid the mistake of scoring chunks sequentially in a loop — always batch the full candidate set in a single predict() call, which is significantly faster than N individual forward passes due to padding and kernel launch overhead.
Choosing a Reranker Model
The ms-marco-MiniLM-L-6-v2 model is the standard starting point: fast, small (22M parameters), and good enough for most English-language RAG applications. For higher quality at the cost of 3–4x more compute, ms-marco-electra-base performs noticeably better on complex multi-sentence queries. For multilingual corpora or state-of-art English quality, BAAI/bge-reranker-v2-m3 is the current best open-source option and runs well on a single GPU. For very latency-constrained deployments, Cohere’s Rerank API and Jina AI’s reranker API offer hosted inference with sub-100ms latency and no model management overhead, at the cost of per-query pricing and a network round-trip. In all cases, benchmark the reranker on a sample of real queries from your system before choosing — reranker quality varies significantly by domain and query style, and the ranking on standard benchmarks does not always transfer to specialised corpora.
When Reranking Helps Most — and Least
Reranking produces the largest gains when your queries are complex, multi-part, or require precise matching of specific details within documents. A query like “what is the token limit for Claude 3 Opus for API tier 2 users in 2024?” requires finding a chunk that contains all three specifics — model name, tier, and date — simultaneously. A bi-encoder will retrieve chunks that are semantically close to “token limits” and “Claude API”, but may rank a generic overview chunk higher than the specific chunk that actually answers the question. A cross-encoder can recognise that the specific chunk jointly satisfies all constraints and rank it first.
Reranking helps less when queries are simple and unambiguous and the top-1 bi-encoder result is already correct most of the time. If your eval shows that bi-encoder precision@1 is already above 0.90, adding reranking will produce small incremental gains that may not justify the added latency and infrastructure complexity. Measure first — the 15–30% improvement figure is an average across diverse query sets; some domains and query types see much larger gains and some see almost none. The only way to know is to run the comparison on your own data.
Reranking also helps significantly when your embedding model is mismatched to your domain. General-purpose embedding models trained on web text retrieve poorly for highly technical domains — medical literature, legal contracts, code documentation — because domain-specific terminology is underrepresented in their training data. A cross-encoder trained on similar domain data (or fine-tuned on your corpus) can compensate for this mismatch and surface relevant technical chunks that a general embedding model ranked poorly. In this scenario, reranking and domain-specific embedding fine-tuning are complementary improvements that stack: fine-tune the embedder first if you have labelled relevance data, then add reranking on top for a second stage of quality improvement.
Evaluating Reranking Quality
The standard metric for reranking quality is NDCG@k (Normalised Discounted Cumulative Gain at k) which measures both whether relevant documents are retrieved and whether they are ranked near the top. For practical RAG evaluation, precision@1 (is the most relevant chunk ranked first?) and context recall (does the top-k set contain all information needed to answer the question?) are more directly actionable. Use RAGAS or a custom eval harness to measure both before and after adding the reranker on a held-out set of 50–100 representative queries with human-labelled ground truth. The eval investment pays off quickly — it catches reranker misconfiguration (wrong model for your language, too small a candidate set, chunk size mismatch) that is otherwise invisible until user complaints surface in production.
One common misconfiguration to check: cross-encoders have a max_length parameter that truncates query-document pairs that exceed that length. The default for ms-marco-MiniLM is 512 tokens. If your chunks average 400 tokens and your queries average 20 tokens, pairs are regularly getting truncated, and the reranker is scoring incomplete documents. Either reduce chunk size, increase max_length (within the model’s trained context window), or switch to a reranker with a larger context window such as BAAI/bge-reranker-v2-m3 which supports up to 8192 tokens.
Fine-Tuning a Cross-Encoder on Your Domain
Off-the-shelf cross-encoders are trained on MS MARCO, a web search passage dataset. This gives them strong general English relevance judgement but can underperform on technical or domain-specific corpora where the vocabulary and relevance patterns differ from web search. Fine-tuning a cross-encoder on domain-specific query-document pairs is straightforward and typically yields 5–15% additional improvement over a general model on in-domain queries.
You need labelled pairs: positive examples (query, relevant chunk) and negative examples (query, non-relevant chunk). Negative examples can be hard negatives — chunks that are semantically similar but not actually relevant, which are the hardest cases for the model to learn. Hard negatives are most valuable for training and can be generated automatically by taking bi-encoder top-k results that are not marked relevant — exactly the failure mode the reranker needs to fix. For fine-tuning, use the sentence-transformers CrossEncoder training API with a binary cross-entropy loss on relevance labels or a margin loss on (positive, negative) pairs. Even 500–1000 labelled examples are enough to produce measurable improvement on a domain-specific corpus; the fine-tuning run itself takes under an hour on a single GPU for a MiniLM-sized model.
If you do not have labelled query-document pairs, you can generate synthetic ones using an LLM: extract a chunk from your corpus, prompt an LLM to generate 3–5 questions that the chunk answers, and treat those as positive (question, chunk) pairs. Use BM25 to retrieve hard negatives for each synthetic question. This synthetic data approach consistently outperforms using only the off-the-shelf reranker on domain-specific corpora, even when the LLM-generated questions are imperfect, because it calibrates the model to the specific vocabulary and query patterns of your application.
Cross-Encoder Latency and Production Tradeoffs
The main cost of cross-encoders in production is latency: unlike bi-encoders that precompute document embeddings offline, a cross-encoder must perform a full forward pass on every (query, document) pair at query time. For a reranking pool of 100 documents with a 100ms cross-encoder latency per document, sequential reranking would take 10 seconds — obviously unacceptable. The solutions are batching (run all 100 pairs through the cross-encoder in parallel as a single batch), using a smaller/distilled cross-encoder, and capping the reranking pool size. In practice, a reranking pool of 20–50 documents batched through a 6-layer distilled cross-encoder runs in 50–150ms on a single GPU, which is acceptable for most applications. For CPU-only deployments, keep the pool size at 20 or fewer and use a cross-encoder with 4 layers or fewer — models like ms-marco-MiniLM-L-4-v2 are specifically designed for this tradeoff and are available on Hugging Face. If latency remains a bottleneck after these mitigations, consider late interaction models like ColBERT, which precompute token-level embeddings offline while still performing fine-grained query-document interaction at retrieval time — they sit between bi-encoders and cross-encoders on both the quality and latency spectrums and are worth evaluating for latency-sensitive pipelines that need better precision than bi-encoder retrieval alone provides.