Hard Negative Mining for Embedding Model Training

The quality of an embedding model depends less on the loss function than on the quality of the negatives used during training. Random negatives — examples that are clearly unrelated to the query — are easy for even a mediocre model to push apart, so training on them produces minimal gradient signal and slow improvement. Hard negatives are examples that are semantically similar to the query but not the correct match: a document about a related but distinct concept, a passage that shares key vocabulary with the query but answers a different question, or a paraphrase of the query that maps to a different intent. Training on hard negatives forces the model to learn fine-grained distinctions that actually matter in production retrieval, and consistently produces better recall at every rank position compared to training with random negatives at the same batch size and epoch count.

What Makes a Negative Hard

Hard negatives exist on a spectrum. At one end are easy negatives — documents from a completely different domain that share no vocabulary with the query. In the middle are semi-hard negatives — documents that are in the same general topic area but clearly not relevant. At the hard end are false negatives (documents that are actually relevant but not labelled as such) and near-miss negatives (documents that are topically very close but genuinely non-relevant). The sweet spot for training is semi-hard to hard: easy enough that the model can learn from them without being confused, but hard enough that the gradient signal is strong. False negatives are actually harmful — including them as negatives teaches the model to push apart pairs that should be close — and filtering them out before training is important for high-quality embedding models.

There are several standard strategies for mining hard negatives. BM25-mined negatives use a sparse retrieval system to find documents that share lexical overlap with the query but are not the relevant document. Embedding-mined negatives use a current version of the embedding model to retrieve top-k documents for each query and use the non-relevant ones as negatives — this is the most effective strategy but requires an initial model to bootstrap. Cross-encoder-filtered negatives use a powerful cross-encoder reranker to score candidate negatives and include only those with high scores (indicating the cross-encoder thinks they are relevant) as the hardest negatives for training the bi-encoder.

BM25-Mined Hard Negatives

from rank_bm25 import BM25Okapi
from typing import List, Dict
import random

def mine_bm25_hard_negatives(
    queries: List[str],
    corpus: List[str],
    relevant_ids: List[int],   # index of relevant doc for each query
    n_negatives: int = 5,
    top_k: int = 20,
) -> List[Dict]:
    """Mine hard negatives using BM25 lexical retrieval.
    
    Returns list of {"query": str, "positive": str, "negatives": List[str]}
    """
    # Build BM25 index over corpus
    tokenized_corpus = [doc.lower().split() for doc in corpus]
    bm25 = BM25Okapi(tokenized_corpus)

    training_examples = []
    for query, rel_id in zip(queries, relevant_ids):
        tokenized_query = query.lower().split()
        scores = bm25.get_scores(tokenized_query)

        # Rank all docs by BM25 score, exclude the relevant doc
        ranked_ids = sorted(range(len(corpus)), key=lambda i: scores[i], reverse=True)
        hard_neg_ids = [i for i in ranked_ids if i != rel_id][:top_k]

        # Sample from top-k retrieved (not just top-1) for diversity
        selected = random.sample(hard_neg_ids, min(n_negatives, len(hard_neg_ids)))

        training_examples.append({
            "query": query,
            "positive": corpus[rel_id],
            "negatives": [corpus[i] for i in selected],
        })
    return training_examples

# Example
corpus = [
    "Gradient checkpointing reduces memory by recomputing activations during backward pass.",
    "Gradient clipping prevents exploding gradients by scaling the gradient norm.",
    "Gradient accumulation simulates large batch sizes by accumulating gradients.",
    "Momentum-based optimizers maintain a running average of past gradients.",
    "The learning rate schedule controls how the step size changes over training.",
]
queries = ["how to reduce memory during training", "how to stabilise training gradients"]
relevant_ids = [0, 1]

examples = mine_bm25_hard_negatives(queries, corpus, relevant_ids, n_negatives=2)
for ex in examples:
    print(f"Query: {ex['query']}")
    print(f"Positive: {ex['positive'][:60]}")
    for neg in ex['negatives']:
        print(f"  Hard neg: {neg[:60]}")
    print()

BM25 mining is fast, requires no GPU, and works well for datasets where queries and documents share vocabulary. Its main weakness is that it cannot find semantic hard negatives — documents that are conceptually similar but use different terminology. For embedding models intended to handle paraphrases and semantic similarity, BM25-only negatives leave a significant gap in training signal that embedding-based mining fills.

Embedding-Mined Hard Negatives

Once you have an initial embedding model — even a general-purpose one like all-MiniLM-L6-v2 — you can mine hard negatives by embedding your full corpus, running approximate nearest neighbour search for each query, and treating high-ranked non-relevant documents as negatives. This produces semantically hard negatives that BM25 cannot find. The process is iterative: train on these negatives, get a better model, re-mine with the better model, repeat. Each round of mining with an improved model produces harder negatives that push the next training round further.

import torch
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import List, Dict, Tuple

def embed_corpus(
    texts: List[str],
    model: SentenceTransformer,
    batch_size: int = 256,
    normalize: bool = True,
) -> np.ndarray:
    """Embed a list of texts and return a float32 numpy array."""
    embeddings = model.encode(
        texts,
        batch_size=batch_size,
        show_progress_bar=True,
        normalize_embeddings=normalize,
        convert_to_numpy=True,
    )
    return embeddings.astype(np.float32)

def build_faiss_index(embeddings: np.ndarray, use_gpu: bool = False) -> faiss.IndexFlatIP:
    """Build a FAISS inner-product index (for normalized embeddings = cosine similarity)."""
    dim = embeddings.shape[1]
    index = faiss.IndexFlatIP(dim)
    if use_gpu and faiss.get_num_gpus() > 0:
        res = faiss.StandardGpuResources()
        index = faiss.index_cpu_to_gpu(res, 0, index)
    index.add(embeddings)
    return index

def mine_embedding_hard_negatives(
    queries: List[str],
    corpus: List[str],
    relevant_ids: List[int],
    model: SentenceTransformer,
    n_negatives: int = 5,
    top_k: int = 30,
    similarity_threshold: float = 0.95,  # filter near-duplicates (potential false negatives)
) -> List[Dict]:
    """Mine hard negatives using embedding similarity search."""
    print("Embedding corpus...")
    corpus_embs = embed_corpus(corpus, model)
    print("Embedding queries...")
    query_embs = embed_corpus(queries, model)

    print("Building FAISS index...")
    index = build_faiss_index(corpus_embs)

    print("Searching for hard negatives...")
    scores, indices = index.search(query_embs, top_k + 1)  # +1 to account for relevant doc

    training_examples = []
    for i, (query, rel_id) in enumerate(zip(queries, relevant_ids)):
        retrieved = indices[i].tolist()
        retrieved_scores = scores[i].tolist()

        hard_negs = []
        for doc_id, score in zip(retrieved, retrieved_scores):
            if doc_id == rel_id:
                continue  # skip the relevant document
            if score >= similarity_threshold:
                continue  # skip near-duplicates (likely false negatives)
            hard_negs.append(corpus[doc_id])
            if len(hard_negs) >= n_negatives:
                break

        if hard_negs:
            training_examples.append({
                "query": query,
                "positive": corpus[rel_id],
                "negatives": hard_negs,
            })
    return training_examples

# Usage
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
# Mine from your domain corpus
examples = mine_embedding_hard_negatives(queries, corpus, relevant_ids, model)

Training with Hard Negatives Using MultipleNegativesRankingLoss

The standard loss for embedding model training with hard negatives is MultipleNegativesRankingLoss (MNRL), which treats every other example in the batch as an in-batch negative in addition to any explicit hard negatives. This is efficient because a batch of size N provides N-1 negatives per example for free, and adding explicit hard negatives multiplies the effective number of informative negatives per step.

from sentence_transformers import SentenceTransformer, losses, InputExample
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from torch.utils.data import DataLoader
from datasets import Dataset

def build_training_dataset(examples: List[Dict]) -> List[InputExample]:
    """Convert mined examples to SentenceTransformers InputExample format.
    
    For MNRL with hard negatives: (anchor, positive, hard_neg_1, hard_neg_2, ...)
    """
    input_examples = []
    for ex in examples:
        # Format: [query, positive, neg1, neg2, ...]
        texts = [ex["query"], ex["positive"]] + ex["negatives"]
        input_examples.append(InputExample(texts=texts))
    return input_examples

model = SentenceTransformer("BAAI/bge-base-en-v1.5")

training_examples = build_training_dataset(mined_examples)
train_dataloader = DataLoader(training_examples, batch_size=64, shuffle=True)

# MNRL with hard negatives
train_loss = losses.MultipleNegativesRankingLoss(
    model=model,
    scale=20.0,          # temperature: lower = sharper distribution
    similarity_fct=losses.util.cos_sim,
)

# Evaluation: use a held-out query/relevant doc set
evaluator = InformationRetrievalEvaluator(
    queries={str(i): q for i, q in enumerate(eval_queries)},
    corpus={str(i): d for i, d in enumerate(eval_corpus)},
    relevant_docs={str(i): {str(rel)} for i, rel in enumerate(eval_relevant_ids)},
    score_functions={"cos_sim": losses.util.cos_sim},
    main_score_function="cos_sim",
    name="eval",
)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=evaluator,
    epochs=3,
    evaluation_steps=500,
    warmup_steps=100,
    output_path="./finetuned-embeddings",
    save_best_model=True,
    show_progress_bar=True,
)

Cross-Encoder Filtering for Highest-Quality Negatives

The highest-quality training signal comes from negatives that a powerful cross-encoder scores as highly relevant but that are not actually the labelled positive. These represent the genuinely hardest cases: documents the model should be pushed to distinguish from the true positive. The approach is to take your top-k embedding-mined candidates, score them with a cross-encoder, and select those with scores above a threshold as hard negatives.

from sentence_transformers import CrossEncoder

def filter_with_cross_encoder(
    query: str,
    candidate_docs: List[str],
    cross_encoder_model: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
    min_score: float = 0.3,    # only keep candidates the CE thinks are relevant
    max_score: float = 0.95,   # avoid near-positives that might be false negatives
) -> List[str]:
    """Use a cross-encoder to identify the hardest negative candidates."""
    ce = CrossEncoder(cross_encoder_model)
    pairs = [(query, doc) for doc in candidate_docs]
    scores = ce.predict(pairs)

    # Keep candidates in the "hard but not false negative" range
    hard_negs = [
        doc for doc, score in zip(candidate_docs, scores)
        if min_score <= score <= max_score
    ]
    # Sort by score descending — hardest first
    hard_negs_scored = [
        (doc, score) for doc, score in zip(candidate_docs, scores)
        if min_score <= score <= max_score
    ]
    hard_negs_scored.sort(key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in hard_negs_scored]

Cross-encoder filtering is compute-intensive because it requires running inference on every (query, candidate) pair, but it produces the highest-quality negatives of any mining strategy. In practice, a reasonable pipeline is: BM25 mining to get 50 candidates per query quickly, then cross-encoder filtering on those 50 to select the 5–10 hardest. This keeps the expensive cross-encoder inference bounded while producing better negatives than BM25 alone.

Iterative Hard Negative Mining

The most effective training procedure for embedding models combines all of the above into an iterative loop: train for a few epochs on BM25 negatives, then re-mine with the improved model to get harder embedding-based negatives, train again, re-mine, repeat. Each round uses the current model to find negatives that the current model finds difficult, which means the negatives become progressively harder as the model improves. Most state-of-the-art embedding models (E5, BGE, GTE) were trained with some variant of this iterative mining procedure.

In practice, two to three rounds of iterative mining is usually sufficient — after that, the improvements per round diminish and the compute cost of re-mining the full corpus becomes harder to justify. The most important thing to track across rounds is Recall@10 and MRR on your evaluation set: if these metrics plateau between rounds, mining is no longer providing new signal and training should stop. If they are still improving, another round of mining is worth running. Use the similarity threshold in your mining function to filter potential false negatives at each round — as the model improves, its top-k retrievals will include more true positives that are not labelled, and including these as negatives will hurt training rather than help it.

In-Batch Negatives: Getting the Most from Your Batch Size

Even before adding explicit hard negatives, the choice of batch size has a large impact on embedding model training quality. With MultipleNegativesRankingLoss, each example in the batch serves as a negative for every other example, so a batch of 64 provides 63 negatives per example while a batch of 256 provides 255. The marginal value of each additional in-batch negative decreases as batch size grows, but the jump from small batches (16–32) to large batches (128–256) consistently improves training quality. This is one of the few cases in ML where larger batch size is unambiguously better for the task rather than just faster — the contrastive loss actually gets easier to optimise with more negatives per step because it sees a more complete picture of the embedding space.

The practical implication is that you should maximise batch size first, before worrying about sophisticated mining strategies. If your GPU allows a batch size of 256, train with 256. If it only fits 32, use gradient accumulation to simulate a larger effective batch — accumulate 8 steps with batch size 32 to get an effective batch of 256, which gives the loss function the same number of in-batch negatives per update as a native 256 batch. The MNRL loss is computed over the accumulated batch, so the gradient signal is equivalent to the larger batch size.

from sentence_transformers import SentenceTransformer, losses
from torch.utils.data import DataLoader

model = SentenceTransformer("BAAI/bge-base-en-v1.5")
train_loss = losses.MultipleNegativesRankingLoss(model=model, scale=20.0)

# If GPU memory limits batch size to 32, simulate 256 with accumulation
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    # sentence_transformers handles gradient accumulation via the steps_per_epoch parameter
    # For manual control, use a custom training loop with HuggingFace Trainer
    output_path="./finetuned-embeddings",
)

Evaluating the Impact of Hard Negatives

The standard evaluation metrics for retrieval are Recall@k (the fraction of queries for which the relevant document appears in the top-k results), Mean Reciprocal Rank (MRR, which measures the rank of the first relevant result), and NDCG@k (which accounts for the rank of all relevant documents when there are multiple). For most practical embedding model training, Recall@10 is the primary metric because it corresponds directly to the number of chunks retrieved per query in a typical RAG pipeline. A model that improves Recall@10 from 0.70 to 0.80 means that 10% more queries will have the relevant document in the retrieved context window, which translates directly to better answer quality downstream.

To measure the specific contribution of hard negatives, train two models identically except for the negative strategy — one with random negatives only, one with BM25 or embedding-mined hard negatives — and compare Recall@10 on a held-out evaluation set. In practice, hard negatives consistently improve Recall@10 by 5–15 percentage points on domain-specific retrieval tasks, with the largest gains on tasks where the query and non-relevant documents share significant vocabulary overlap. For tasks where all documents are topically distinct from each other, the improvement is smaller because even random in-batch negatives are reasonably hard. Running this ablation on your specific dataset before committing to the full mining pipeline is worthwhile — if your corpus is small and topically diverse, the simpler random negative baseline may be sufficient and you can skip the mining infrastructure entirely.

False Negative Handling

False negatives — documents that are actually relevant to the query but are not in the labelled positive set — are the primary quality risk in hard negative mining. As your embedding model improves through iterative training, it retrieves more truly relevant documents in its top-k results, which means a higher fraction of your mined negatives are actually false negatives. Training on false negatives teaches the model to push apart relevant document pairs, directly harming retrieval quality in a way that is hard to diagnose because your evaluation set uses the same labels as your training set.

The standard mitigation is the similarity threshold filter shown in the mining code above: exclude any candidate with embedding similarity above 0.95 to the query's known positive. This catches most false negatives because true positives that answer the same query tend to be highly similar to the labelled positive in embedding space. A more robust approach for datasets with high label noise is to run cross-encoder scoring on your candidate negatives and exclude anything the cross-encoder scores above 0.9 — the cross-encoder is better calibrated for semantic relevance than embedding similarity and will catch more false negatives at the cost of extra compute. For production embedding models where quality matters, the extra compute is worth it.

Choosing a Mining Strategy for Your Use Case

For a new embedding model project, the practical decision tree is straightforward. Start with BM25 negatives if your corpus is large (over 100k documents) and you need a quick first training run without GPU-intensive mining. Move to embedding-mined negatives for the second round using the model from the first round. Add cross-encoder filtering only if you have a high-quality cross-encoder available for your domain and your evaluation metrics have plateaued on embedding-mined negatives alone. The marginal gain from cross-encoder filtering is real but diminishing — most of the improvement in retrieval quality comes from moving from random to BM25 negatives, and then from BM25 to embedding-mined negatives. Cross-encoder filtering is the final 2–5 percentage point improvement that matters for production systems but is not worth the infrastructure cost for initial experiments. Start simple, evaluate rigorously, and add complexity only when the numbers justify it.

Leave a Comment