What Are Embedding Models?
An embedding model converts text — a word, sentence, paragraph, or entire document — into a dense vector of floating-point numbers. That vector is a point in a high-dimensional space, and the key property is that texts with similar meanings end up close together in that space, while texts with different meanings end up far apart. This geometric relationship is what makes embeddings useful: measuring the distance or angle between two vectors tells you how semantically similar the underlying texts are, without any explicit keyword matching.
Embedding models are distinct from generative LLMs. They do not produce text. They produce representations — fixed-length numerical vectors that encode the meaning of the input. These representations are the foundation of a surprisingly large number of AI application patterns: semantic search, retrieval-augmented generation, clustering, classification, deduplication, recommendation systems, and anomaly detection all rely on embeddings at their core.
How Embedding Models Work
Most modern embedding models are transformer-based encoders trained on large text corpora using contrastive learning objectives. The training process presents the model with pairs of texts — some semantically similar (a question and its correct answer, two paraphrases of the same sentence) and some dissimilar — and trains the model to produce vectors that are close for similar pairs and distant for dissimilar ones. This is often done using a technique called Multiple Negatives Ranking Loss or InfoNCE loss.
The result is a model whose output vectors capture semantic relationships with remarkable fidelity. Synonyms map to nearby points. Sentences about the same topic cluster together. Questions and their answers end up in similar regions of the space, even though they are phrased very differently. The dimensionality of the output vector — typically 384, 768, 1536, or 3072 dimensions depending on the model — determines the richness of the representation: more dimensions can encode finer distinctions, at the cost of more storage and computation.
Key Embedding Models in 2026
The embedding model landscape has matured significantly. A few models stand out for practical use:
text-embedding-3-small and text-embedding-3-large from OpenAI are the most widely deployed commercial embedding models. The small variant outputs 1536-dimensional vectors and offers an excellent price-to-performance ratio for most applications. The large variant outputs 3072-dimensional vectors and performs better on difficult retrieval tasks, at roughly twice the cost.
Voyage AI embeddings — particularly voyage-3 and voyage-3-lite — have consistently topped MTEB (Massive Text Embedding Benchmark) leaderboards and are a strong choice for retrieval-heavy applications where quality matters more than cost.
Cohere Embed v3 is notable for its support for different embedding types — query embeddings and document embeddings are trained with different objectives, which improves retrieval accuracy when the query and document styles differ significantly.
BGE (BAAI General Embedding) models, particularly BGE-M3, are the top open-source options. BGE-M3 supports over 100 languages, handles inputs up to 8192 tokens, and produces embeddings competitive with the best commercial models. It can be run locally with reasonable hardware requirements.
Nomic Embed is another strong open-source option, with a permissive licence and performance that rivals commercial models on many benchmarks. It is a practical choice for applications where data privacy requires local processing.
Generating Embeddings in Practice
Using the OpenAI embeddings API is straightforward:
from openai import OpenAI
client = OpenAI()
def embed(text: str, model: str = "text-embedding-3-small") -> list[float]:
response = client.embeddings.create(input=text, model=model)
return response.data[0].embedding
vector = embed("How do transformers handle long sequences?")
print(f"Dimension: {len(vector)}") # 1536
For batch processing, send multiple texts in a single API call rather than calling one at a time:
texts = ["First document", "Second document", "Third document"]
response = client.embeddings.create(input=texts, model="text-embedding-3-small")
vectors = [item.embedding for item in response.data]
For local embeddings using Sentence Transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-m3")
vectors = model.encode(["First document", "Second document"], normalize_embeddings=True)
print(vectors.shape) # (2, 1024)
Always normalise embeddings to unit length before computing cosine similarity. Most embedding APIs return normalised vectors by default, but local models may not.
Similarity Metrics: Cosine, Dot Product, and Euclidean Distance
Three similarity metrics are commonly used with embeddings, and which one to use matters for both correctness and performance.
Cosine similarity measures the angle between two vectors, ignoring their magnitude. It ranges from -1 (opposite directions) to 1 (identical direction). For normalised vectors, cosine similarity equals the dot product, which is why normalisation matters. Cosine similarity is the right default for semantic similarity tasks because it captures directional alignment regardless of vector scale.
Dot product is computationally cheaper than cosine similarity and equivalent to it for unit-length vectors. Many vector databases use dot product internally for this reason. If your embeddings are normalised — which they should be — dot product and cosine similarity are interchangeable.
Euclidean distance (L2 distance) measures the straight-line distance between two points. It is sensitive to vector magnitude, which makes it less suitable for semantic similarity unless you are certain all your vectors have consistent norms. Some specialised models and tasks perform better with Euclidean distance, but cosine similarity is the safer default.
import numpy as np
def cosine_similarity(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# For pre-normalised vectors, this simplifies to:
def dot_product(a: list[float], b: list[float]) -> float:
return float(np.dot(np.array(a), np.array(b)))
Chunking Strategies for Long Documents
Embedding models have maximum input length limits — typically 512 to 8192 tokens depending on the model. Documents that exceed this limit must be split into chunks before embedding. How you chunk significantly affects retrieval quality.
Fixed-size chunking splits text into chunks of N tokens with an optional overlap. It is simple and fast but can split sentences and paragraphs mid-thought, creating chunks that lack coherent meaning.
Sentence-based chunking splits on sentence boundaries, producing semantically coherent chunks. The variable chunk sizes can complicate downstream processing but generally produce better retrieval results than fixed-size splitting.
Recursive character splitting — the approach used by LangChain’s RecursiveCharacterTextSplitter — tries increasingly fine-grained separators (paragraphs, then sentences, then words) until chunks fit within the target size. It is a practical default that balances coherence with size control.
Semantic chunking groups sentences by their embedding similarity, keeping topically related content together and splitting at topic boundaries. It produces the most semantically coherent chunks but requires embedding every sentence to compute the groupings, making it significantly more expensive than syntactic approaches.
A practical starting point for most applications: chunks of 256–512 tokens with a 20% overlap, split on sentence boundaries. Measure retrieval quality with your actual queries and adjust from there.
Choosing an Embedding Model
Several factors should guide your choice. Task type matters most: retrieval tasks (finding documents relevant to a query) benefit from models trained with asymmetric query-document objectives, like Cohere Embed v3 or BGE with instruction prefixes. Similarity tasks (clustering, deduplication, classification) generally work well with symmetric models. Dimensionality affects storage and search latency — 1536-dimensional vectors need four times the storage of 384-dimensional ones, and similarity search scales with dimensionality. Context length determines the maximum chunk size — models with 8192-token limits like BGE-M3 handle longer passages without requiring aggressive chunking. Language support is critical for multilingual applications — not all models perform equally across languages, and some are effectively English-only despite claims to the contrary. Data privacy determines whether you can use API-hosted models or need to run locally.
The MTEB leaderboard at huggingface.co/spaces/mteb/leaderboard is the most reliable reference for comparing model performance across different task types and languages. It benchmarks dozens of models across over 50 datasets, making it easy to identify which models actually perform best for your specific retrieval or similarity use case rather than relying on marketing claims.
Matryoshka Embeddings and Dimensionality Reduction
A recent development worth knowing is Matryoshka Representation Learning (MRL), which trains embedding models so that the first N dimensions of the vector are already a good representation, for any N. This means you can truncate a 1536-dimensional embedding to 256 dimensions and retain most of the semantic quality — reducing storage and search costs significantly with minimal accuracy loss. OpenAI’s text-embedding-3 models support MRL truncation natively via the dimensions parameter:
response = client.embeddings.create(
input="Your text here",
model="text-embedding-3-large",
dimensions=256 # Truncate from 3072 to 256 dimensions
)
# 12x storage reduction with modest quality trade-off
For applications where storage or search latency is a bottleneck, MRL truncation offers a practical lever for trading a small amount of accuracy for significant infrastructure savings. Benchmark the accuracy impact with your specific queries before committing to a truncated dimension in production.
Fine-Tuning Embedding Models
Off-the-shelf embedding models are trained on general web text and perform well on general retrieval tasks. For specialised domains — legal documents, medical literature, proprietary internal knowledge bases, code — fine-tuning on domain-specific pairs can produce significant quality improvements. The training data consists of positive pairs (query, relevant document) and ideally hard negatives (query, plausible-but-wrong document), which teach the model the subtle distinctions that matter in your domain.
The Sentence Transformers library makes fine-tuning accessible without deep ML expertise. You need a dataset of several thousand to tens of thousands of training pairs, a base model, and a GPU with at least 16GB of VRAM for comfortable fine-tuning. For smaller budgets, you can fine-tune only the final projection layer rather than the entire model, which is faster and requires less data while still capturing domain-specific signal.
Before fine-tuning, exhaust simpler options first. Instruction prefixes — prepending “Represent this document for retrieval: ” or “Query: ” to your inputs — can significantly improve retrieval quality with BGE and similar models without any training. Query expansion, where you generate multiple rephrasings of each query before embedding, improves recall at the cost of more embedding calls. These zero-cost improvements often close most of the gap that fine-tuning addresses, making fine-tuning most valuable when your domain vocabulary is highly specialised and genuinely absent from the base model’s training data.
Embedding Models in Production
A few operational concerns become important once embedding models move into production. Embedding drift occurs when you update your embedding model — all existing embeddings in your vector store become incompatible with the new model’s vector space, requiring a full re-embedding of your entire corpus. Plan for this by version-tagging your embeddings and having a re-indexing pipeline ready before you update the model. Latency for embedding API calls is typically 50–200ms for short texts, but batching reduces this significantly — embedding 100 texts in one call is much faster per item than 100 individual calls. Cost scales with total input tokens across all your embedding calls — for large corpora, this can be significant, and caching embeddings for frequently seen texts pays off quickly. Store your embeddings persistently so you never embed the same text twice.
Beyond Retrieval: Other Uses for Embeddings
Retrieval is the most common use case, but embeddings are useful in a broader set of scenarios that are worth having in your toolkit. Clustering groups similar documents without labels — useful for discovering topics in a large corpus, identifying duplicate or near-duplicate content, or segmenting customers by the language they use. Classification trains a lightweight classifier on top of embeddings as features, often achieving strong results with far less training data than fine-tuning a full language model. Anomaly detection identifies inputs that are unusual relative to your training distribution — useful for detecting prompt injection attempts, out-of-distribution queries, or fraud signals. Recommendation finds items semantically similar to what a user has engaged with, without needing explicit collaborative filtering signals. Each of these applications uses the same underlying embedding infrastructure, making the investment in a solid embedding pipeline pay dividends across many different product features simultaneously.