How to Fine-Tune Embedding Models with Contrastive Learning

Contrastive learning is the dominant training paradigm for embedding models, and understanding how it works is essential for anyone who needs to adapt an off-the-shelf embedding model to a specific domain or retrieval task. General-purpose models like text-embedding-3-large or BGE-large-en are trained on broad web-scale data and work well across many domains, but they consistently underperform domain-adapted models on in-domain retrieval tasks — medical literature, legal documents, code search, proprietary knowledge bases — because the embedding space is shaped by training data distribution. Fine-tuning with contrastive learning reshapes that space to match your specific retrieval patterns, and the performance gains are large enough to matter in production: 10–25% improvement in recall@10 is typical for a well-executed domain adaptation on technical corpora.

The Contrastive Learning Objective

The core idea is simple: pull the embeddings of semantically similar text pairs closer together in the embedding space, and push dissimilar pairs apart. A training example consists of an anchor, a positive (semantically similar to the anchor), and one or more negatives (semantically dissimilar). The model is trained to produce embeddings where the anchor-positive cosine similarity is higher than the anchor-negative similarity by at least a margin.

The standard loss function is Multiple Negatives Ranking Loss (MNRL), which treats every other item in the batch as a negative for each anchor. Given a batch of (anchor, positive) pairs, the loss for each anchor is the cross-entropy over the similarity scores between the anchor and all positives in the batch — the anchor’s own positive should score highest. This in-batch negative mining is efficient because you get B-1 negatives per example for free from the other batch items, without needing to explicitly sample negatives. The loss function used in Sentence Transformers:

from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.readers import InputExample
from torch.utils.data import DataLoader

model = SentenceTransformer("BAAI/bge-base-en-v1.5")

# Training pairs: (query, relevant_document)
train_examples = [
    InputExample(texts=["what is gradient checkpointing", "Gradient checkpointing trades compute for memory..."]),
    InputExample(texts=["how to reduce vram usage", "Techniques for reducing GPU memory during training..."]),
    # ... thousands more pairs
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)
train_loss = losses.MultipleNegativesRankingLoss(model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path="./finetuned-bge-base",
)

Batch size is the single most impactful hyperparameter for MNRL. Larger batches mean more in-batch negatives, which makes the training signal harder and the resulting model more discriminative. A batch size of 32 gives 31 negatives per anchor; a batch size of 256 gives 255. In practice, use the largest batch size your GPU memory allows — for a base-size embedding model (110M parameters), batch sizes of 256–1024 are achievable on a single A100 80GB.

Building a Training Dataset

The quality of your training pairs matters far more than the quantity. A few thousand high-quality (query, relevant passage) pairs will outperform tens of thousands of noisy pairs. The most effective source is real user queries paired with the documents they clicked or marked as relevant — if you have a deployed retrieval system with user interaction logs, mine these first. Even a few hundred such pairs anchored in real usage patterns produces a meaningfully better model than synthetic data alone.

When real user data isn’t available, generate synthetic pairs with an LLM. The standard approach: chunk your corpus into passages, prompt an LLM to generate 2–3 plausible queries a user might ask that would be answered by each passage, and use (generated_query, source_passage) as training pairs. This works surprisingly well because the LLM captures the query-document relationship with reasonable fidelity even if the exact phrasing is synthetic:

import anthropic

client = anthropic.Anthropic()

def generate_queries(passage: str, n: int = 3) -> list[str]:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": f"""Generate {n} search queries that a user might type to find this passage.
Return only the queries, one per line, no numbering or explanation.

Passage:
{passage}"""
        }]
    )
    return [q.strip() for q in response.content[0].text.strip().split('
') if q.strip()]

# Generate pairs from your corpus
training_pairs = []
for passage in corpus_passages:
    queries = generate_queries(passage)
    for query in queries:
        training_pairs.append(InputExample(texts=[query, passage]))

Hard Negative Mining

In-batch negatives from random batch items are easy negatives — they’re typically very different from the anchor and provide a weak training signal once the model has learned basic semantic similarity. Hard negatives are documents that are superficially similar to the anchor (retrieved by the current model but irrelevant) — these are much more informative because they force the model to learn fine-grained distinctions. Training with hard negatives consistently improves retrieval precision over random-negative training, especially on technical domains where many passages are topically related but not directly relevant to specific queries.

The mining process: run your current embedding model over the training corpus, retrieve the top-k documents for each training query, then filter out the true positives. The remaining top-k candidates are hard negatives — documents the model currently thinks are relevant but aren’t. Add them to your training data as explicit negatives alongside the in-batch negatives:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
import torch

def mine_hard_negatives(model, queries, corpus, positives_map, top_k=10):
    """
    queries: list of query strings
    corpus: list of (doc_id, text) tuples
    positives_map: dict mapping query_idx -> list of positive doc_ids
    """
    corpus_ids = [doc_id for doc_id, _ in corpus]
    corpus_texts = 

    query_embs = model.encode(queries, convert_to_tensor=True, batch_size=64)
    corpus_embs = model.encode(corpus_texts, convert_to_tensor=True, batch_size=64)

    hard_negatives = []
    scores = cos_sim(query_embs, corpus_embs)  # (n_queries, n_corpus)

    for q_idx, query in enumerate(queries):
        top_indices = torch.topk(scores[q_idx], top_k + 5).indices.tolist()
        positive_ids = set(positives_map.get(q_idx, []))
        negatives = [corpus_ids[i] for i in top_indices if corpus_ids[i] not in positive_ids][:top_k]
        hard_negatives.append(negatives)

    return hard_negatives

Iterative hard negative mining — mine negatives, train, mine again with the updated model — gives the best results but is expensive. A practical compromise is to mine once before training starts using a strong general-purpose model (BGE-large or E5-large) as the initial retriever, then train on those hard negatives without re-mining. This captures most of the benefit at a fraction of the cost.

MatryoshkaLoss for Flexible Embedding Dimensions

Matryoshka Representation Learning (MRL) trains the embedding model so that the first N dimensions of the full embedding are themselves a good embedding at dimension N. This means you can truncate a 768-dim embedding to 256 or 128 dimensions at inference time with minimal quality loss, enabling significant storage and search latency reductions for large corpora. Sentence Transformers supports MRL natively through MatryoshkaLoss, which wraps another loss function and adds auxiliary losses at multiple embedding truncation sizes:

from sentence_transformers import losses

base_loss = losses.MultipleNegativesRankingLoss(model)
train_loss = losses.MatryoshkaLoss(
    model,
    base_loss,
    matryoshka_dims=[768, 512, 256, 128, 64],
)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    output_path="./matryoshka-bge-base",
)

After training, you can serve the full 768-dim model for high-precision retrieval and a truncated 128-dim version for approximate first-stage retrieval in a multi-stage pipeline. The OpenAI text-embedding-3 models use MRL, which is why they support the dimensions parameter to return truncated embeddings.

Evaluation During and After Fine-Tuning

Track retrieval metrics on a held-out evaluation set throughout training, not just at the end. Sentence Transformers’ InformationRetrievalEvaluator computes NDCG@10, MRR@10, and Recall@10 at configurable checkpoints:

from sentence_transformers.evaluation import InformationRetrievalEvaluator

# eval_queries: dict of {query_id: query_text}
# eval_corpus: dict of {doc_id: doc_text}
# eval_relevant: dict of {query_id: set of relevant doc_ids}
evaluator = InformationRetrievalEvaluator(
    queries=eval_queries,
    corpus=eval_corpus,
    relevant_docs=eval_relevant,
    name="domain-eval",
)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=evaluator,
    evaluation_steps=500,
    epochs=3,
    output_path="./finetuned-model",
    save_best_model=True,  # saves checkpoint with best NDCG@10
)

Watch for overfitting: embedding models fine-tuned on small domain datasets (under 5,000 pairs) can overfit within 1–2 epochs, causing NDCG to peak early then decline. Use save_best_model=True and evaluate every 200–500 steps. If the eval set is too small to give stable metrics, use 5-fold cross-validation over your full labelled set rather than a fixed split.

When Fine-Tuning Pays Off

The case for fine-tuning is strongest when your corpus has specialized vocabulary the base model hasn’t seen frequently — domain-specific acronyms, proprietary product names, technical jargon with precise meanings that differ from general usage. In these cases, the base model’s embedding space maps your terms to their general-language neighbors rather than their domain-specific neighbors, and contrastive fine-tuning directly corrects this. The case is weakest for general-language corpora (news articles, Wikipedia-style content, general Q&A) where the base model already handles the vocabulary well. Before committing to fine-tuning, benchmark the base model on your evaluation set — if recall@10 is above 0.80, the marginal gain from fine-tuning is likely small and the engineering cost may not be worth it. If it’s below 0.65, fine-tuning on even a few thousand domain-specific pairs will produce a meaningful improvement worth the investment.

Choosing a Base Model to Fine-Tune

The base model you start from matters substantially. Larger models have richer representations and respond better to fine-tuning, but they cost more to serve — a 560M parameter model like BGE-large produces better embeddings than a 110M model like BGE-base, but at 5x the inference cost per query. For most production retrieval systems, BGE-base or E5-base is the right starting point: small enough for fast inference, large enough to benefit meaningfully from fine-tuning. If serving cost is not a concern and recall is the primary metric, start from BGE-large or E5-large. Avoid fine-tuning API-only models like text-embedding-3-large — you can’t access the weights, so domain adaptation requires distillation rather than direct fine-tuning, which is a substantially more complex pipeline.

The architecture of the base model also matters. Models trained with bi-encoder objectives (one encoder for queries, one for documents, or a shared encoder) are the right architecture for dense retrieval — they allow offline pre-computation of document embeddings. Models trained with cross-encoder objectives score query-document pairs jointly and can’t pre-compute document embeddings, making them suitable only for reranking small candidate sets. BGE, E5, and GTE are all bi-encoders and are the correct starting point for fine-tuning a retrieval embedding model. BERT and RoBERTa can be fine-tuned as bi-encoders but require more careful setup since they aren’t pre-trained with contrastive objectives — starting from a model that was already contrastively pre-trained is strictly better.

Multi-Task Fine-Tuning for Retrieval and Classification

If your application needs embeddings that support both semantic search and text classification — a common pattern in document processing pipelines — you can fine-tune with multiple objectives simultaneously. Sentence Transformers supports multiple train objectives by passing a list of (dataloader, loss) tuples to model.fit. The optimizer alternates between objectives each step, which trains a shared embedding space that works for both tasks. In practice, retrieval quality (NDCG) improves slightly versus single-objective training for the retrieval task, and classification accuracy improves similarly, because each objective acts as a regulariser for the other — the model can’t overfit purely to keyword overlap patterns for retrieval when it also needs to capture class-discriminative features for classification.

A practical note on learning rates: fine-tuning embedding models with contrastive loss is sensitive to learning rate in a way that standard classification fine-tuning is not. The optimal learning rate is typically lower than for classification — 1e-5 to 5e-5 rather than 2e-4 — because the contrastive loss can cause large gradient updates that destabilise the base model’s representations if the learning rate is too high. Use a linear warmup over the first 5–10% of training steps and a cosine decay schedule. If you see the training loss decrease rapidly in the first epoch then plateau or diverge, the learning rate is too high. If the loss decreases very slowly across all epochs, the learning rate is too low or the batch size is too small to provide a useful in-batch negative signal.

Deploying a Fine-Tuned Embedding Model

Once fine-tuned, your model is a standard Sentence Transformers model and can be deployed anywhere the base model could be deployed. The most common patterns are: serving via a FastAPI endpoint with the model loaded into GPU memory and batching requests manually, using the HuggingFace Text Embeddings Inference (TEI) server which provides an optimised production serving layer for embedding models with dynamic batching and ONNX/TensorRT backends, or pushing to the HuggingFace Hub (as a private model) and serving via the Inference Endpoints API. For most teams, TEI is the best production choice — it handles batching, quantization to int8, and GPU utilisation automatically, and reduces embedding latency by 2–4x compared to a naive PyTorch serve loop. Deploy your fine-tuned model by simply pointing TEI at your model directory or Hub repo; no code changes are needed.

Version your embedding models carefully. When you update an embedding model, all previously computed document embeddings become stale — they were computed with the old model and are no longer comparable to embeddings from the new model. This means a model update requires re-embedding the entire corpus, which for large corpora (millions of documents) is a multi-hour operation. Plan model updates during low-traffic windows, maintain the old and new embedding indexes in parallel during the transition, and use a feature flag or weighted routing to shift traffic gradually to the new index while validating retrieval quality. Treating embedding model versions as breaking changes in your data pipeline prevents the subtle retrieval degradation that happens when old and new embeddings are mixed in the same index.

Leave a Comment