How to Use pgvector with PostgreSQL for LLM Applications: A Complete Guide

What Is pgvector?

pgvector is a PostgreSQL extension that adds vector similarity search to the world’s most popular open-source relational database. Instead of running a separate vector database service alongside your Postgres instance, pgvector lets you store embeddings as a native column type and query them with familiar SQL syntax. This dramatically simplifies the infrastructure stack for RAG and semantic search applications — one database handles both your structured data and your vector search, with ACID transactions, familiar tooling, and no new operational overhead.

For most LLM applications processing up to 10–20 million vector embeddings, pgvector’s performance is excellent and its operational simplicity is unmatched. Teams at larger scale use it in conjunction with carefully tuned HNSW indexes to handle hundreds of millions of vectors. Even at scales where dedicated vector databases like Pinecone or Qdrant might edge it out on raw query speed, many teams choose pgvector for the operational advantages of keeping everything in a single database they already know how to manage.

Installation and Setup

# Install pgvector extension (Ubuntu/Debian)
sudo apt install postgresql-16-pgvector

# Or via Docker
docker run -d   --name pgvector-db   -e POSTGRES_PASSWORD=password   -p 5432:5432   pgvector/pgvector:pg16

-- Enable the extension in your database
CREATE EXTENSION IF NOT EXISTS vector;

-- Create a table with a vector column
CREATE TABLE document_embeddings (
    id SERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    metadata JSONB,
    embedding vector(1536),  -- dimension matches your embedding model
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Create an HNSW index for fast approximate nearest-neighbour search
CREATE INDEX ON document_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

The m parameter (16) and ef_construction (64) control the HNSW index quality/speed trade-off. Higher values improve recall at the cost of index build time and memory. These defaults work well for most use cases — increase if you find recall is lower than expected.

Inserting Embeddings with Python

import psycopg2, json
from openai import OpenAI

openai_client = OpenAI()
conn = psycopg2.connect("postgresql://postgres:password@localhost:5432/ragdb")

def embed(text: str) -> list[float]:
    return openai_client.embeddings.create(
        input=text, model="text-embedding-3-small"
    ).data[0].embedding

def insert_document(content: str, metadata: dict):
    embedding = embed(content)
    with conn.cursor() as cur:
        cur.execute(
            "INSERT INTO document_embeddings (content, metadata, embedding) VALUES (%s, %s, %s)",
            (content, json.dumps(metadata), embedding)
        )
    conn.commit()

# Insert documents
documents = [
    ("Our refund policy allows returns within 30 days.", {"source": "policy.pdf", "page": 1}),
    ("Enterprise contracts include a 99.9% SLA guarantee.", {"source": "enterprise_terms.pdf", "page": 3}),
]
for content, meta in documents:
    insert_document(content, meta)

Similarity Search

def search_similar(query: str, top_k: int = 5, threshold: float = 0.7) -> list[dict]:
    query_embedding = embed(query)
    with conn.cursor() as cur:
        cur.execute("""
            SELECT content, metadata,
                   1 - (embedding <=> %s::vector) AS cosine_similarity
            FROM document_embeddings
            WHERE 1 - (embedding <=> %s::vector) > %s
            ORDER BY cosine_similarity DESC
            LIMIT %s
        """, (query_embedding, query_embedding, threshold, top_k))
        rows = cur.fetchall()
    return [{"content": r[0], "metadata": r[1], "score": float(r[2])} for r in rows]

results = search_similar("What is the refund policy?")
for r in results:
    print(f"Score: {r['score']:.3f} | {r['content'][:80]}")

Three distance operators are available: <=> (cosine distance, most common for text), <-> (L2/Euclidean distance), and <#> (negative inner product). Cosine distance is the right default for text embeddings because it measures directional similarity regardless of vector magnitude.

Metadata Filtering

pgvector’s key advantage over dedicated vector databases is that standard SQL WHERE clauses work alongside vector search. Filter by any JSONB metadata field without any special configuration:

def search_with_filter(query: str, source_file: str, top_k: int = 5) -> list[dict]:
    query_embedding = embed(query)
    with conn.cursor() as cur:
        cur.execute("""
            SELECT content, metadata,
                   1 - (embedding <=> %s::vector) AS similarity
            FROM document_embeddings
            WHERE metadata->>'source' = %s
            ORDER BY similarity DESC
            LIMIT %s
        """, (query_embedding, source_file, top_k))
        return [{"content": r[0], "score": float(r[2])} for r in cur.fetchall()]

# Search only within policy documents
results = search_with_filter("refund policy", "policy.pdf")

You can filter on any JSONB path, combine multiple conditions with AND/OR, join with other tables, and use any PostgreSQL function alongside vector search. This flexibility is a significant advantage over dedicated vector databases that offer limited filtering options.

Using pgvector with LangChain

from langchain_community.vectorstores import PGVector
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = PGVector(
    connection_string="postgresql+psycopg2://postgres:password@localhost:5432/ragdb",
    embedding_function=embeddings,
    collection_name="document_embeddings",
    use_jsonb=True
)

# Add documents
texts = ["Our refund policy allows 30-day returns.", "Enterprise SLA guarantees 99.9% uptime."]
metadatas = [{"source": "policy.pdf"}, {"source": "enterprise_terms.pdf"}]
vectorstore.add_texts(texts=texts, metadatas=metadatas)

# Search
results = vectorstore.similarity_search_with_score("refund policy", k=5)
for doc, score in results:
    print(f"Score: {score:.3f} | {doc.page_content[:80]}")

# Use as RAG retriever
retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"score_threshold": 0.7, "k": 5}
)

Using pgvector with LlamaIndex

from llama_index.vector_stores.postgres import PGVectorStore
from llama_index.core import StorageContext, VectorStoreIndex

vector_store = PGVectorStore.from_params(
    database="ragdb", host="localhost", port="5432",
    user="postgres", password="password",
    table_name="llama_embeddings",
    embed_dim=1536
)

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(nodes, storage_context=storage_context)
query_engine = index.as_query_engine(similarity_top_k=5)

Performance Tuning

For optimal pgvector performance in production, a few configuration changes matter significantly. Increase maintenance_work_mem to 1–2 GB for faster HNSW index builds — this is a session-level setting you can apply just before creating the index. Set shared_buffers to 25% of your server’s RAM so frequently-accessed index pages stay in memory. For the HNSW search parameter ef_search, higher values improve recall at the cost of query latency — the default of 40 is balanced, increase to 100+ if recall is insufficient. Monitor query plans with EXPLAIN ANALYZE to confirm the vector index is being used rather than a sequential scan, which happens when the planner thinks scanning the full table is cheaper than using the index (usually only for very small tables).

Bulk Loading for Large Corpora

For inserting millions of embeddings efficiently, use COPY rather than individual INSERTs, and disable the index during bulk load:

-- Temporarily increase work memory for bulk operations
SET maintenance_work_mem = '2GB';

-- Drop index before bulk insert for much faster loading
DROP INDEX IF EXISTS document_embeddings_embedding_idx;

-- Bulk load via psycopg2's copy_from or executemany
with conn.cursor() as cur:
    cur.executemany(
        "INSERT INTO document_embeddings (content, metadata, embedding) VALUES (%s, %s, %s)",
        [(doc, json.dumps(meta), emb) for doc, meta, emb in batch]
    )
conn.commit()

-- Rebuild index after bulk load
CREATE INDEX ON document_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

Bulk loading without the index and then building it once is typically 5–10x faster than inserting rows one at a time with the index maintained throughout. For corpora of millions of documents, this difference is hours versus days.

pgvector vs. Dedicated Vector Databases

The honest comparison: dedicated vector databases (Pinecone, Qdrant, Weaviate) offer better raw query performance at scale, more sophisticated approximate nearest-neighbour algorithms, and cloud-managed infrastructure with no operational overhead. pgvector wins on operational simplicity (one database instead of two), cost (no separate managed service), SQL flexibility (filter, join, aggregate alongside vector search), and transaction consistency (vectors updated atomically with related data). For teams with existing PostgreSQL infrastructure and corpora under 20 million vectors, pgvector is almost always the right starting point. Add a dedicated vector database only when you have measured that pgvector’s performance is insufficient for your query volume — which, for most production RAG applications, it never will be.

Hybrid Search: Combining Vector and Full-Text Search

PostgreSQL’s built-in full-text search (tsvector/tsquery) complements pgvector’s semantic search perfectly. Hybrid search combines both: vector similarity finds semantically related content, full-text search finds exact keyword matches. Reciprocal Rank Fusion (RRF) merges the two ranked result sets:

-- Add full-text search column
ALTER TABLE document_embeddings ADD COLUMN content_tsv tsvector
    GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;
CREATE INDEX ON document_embeddings USING GIN (content_tsv);

-- Hybrid search with RRF
WITH vector_results AS (
    SELECT id, content, ROW_NUMBER() OVER (ORDER BY embedding <=> %s::vector) AS rank
    FROM document_embeddings
    LIMIT 20
),
fts_results AS (
    SELECT id, content, ROW_NUMBER() OVER (ORDER BY ts_rank(content_tsv, query) DESC) AS rank
    FROM document_embeddings, plainto_tsquery('english', %s) query
    WHERE content_tsv @@ query
    LIMIT 20
)
SELECT COALESCE(v.content, f.content) AS content,
       (COALESCE(1.0/(60+v.rank), 0) + COALESCE(1.0/(60+f.rank), 0)) AS rrf_score
FROM vector_results v FULL OUTER JOIN fts_results f ON v.id = f.id
ORDER BY rrf_score DESC LIMIT 5;

Hybrid search consistently outperforms pure vector search for queries containing specific identifiers, names, dates, or technical terms that may not be well-represented in the embedding space. It is the recommended default retrieval strategy for production RAG applications over general document corpora.

Managing Embedding Model Migrations

When switching embedding models (e.g., from text-embedding-3-small to text-embedding-3-large, or from OpenAI to a self-hosted model), you must re-embed the entire corpus because vectors from different models are not comparable. Plan model migrations as a background job: add a new vector column for the new model, embed documents in batches running alongside production traffic, swap the search queries to use the new column once embedding is complete, then drop the old column. This zero-downtime migration pattern keeps the application serving queries throughout the re-embedding process.

Monitoring and Maintenance

Monitor pgvector query performance with pg_stat_statements — similarity search queries should complete in under 100ms for well-indexed tables of up to 10 million rows. Run VACUUM ANALYZE document_embeddings after bulk inserts to update table statistics and ensure the query planner uses the vector index correctly. Watch for index bloat in high-write workloads — HNSW indexes accumulate dead tuples that require periodic reindexing. Schedule an index rebuild during low-traffic windows if query performance degrades over time in high-write applications. With these operational habits in place, pgvector is a reliable, low-maintenance foundation for production LLM applications at significant scale.

Similarity Search and Distance Operators

pgvector provides three distance operators for querying vector similarity. The cosine distance operator <=> is the right default for text embeddings because it measures angular similarity — how similar two vectors are in direction — regardless of their magnitudes. L2 distance (<->) measures absolute distance in the vector space and is appropriate when magnitude carries meaning, such as with some image embeddings. Inner product (<#>) is used for maximum inner product search, appropriate when embeddings are normalised and you want the fastest possible query. For all practical text RAG applications, stick with cosine distance.

The similarity score returned by pgvector is a distance — lower means more similar for L2 and cosine. To convert cosine distance to a cosine similarity score between 0 and 1, subtract from 1: 1 - (embedding <=> query_vector). A score above 0.85 typically indicates high semantic similarity; below 0.70 the content is often loosely related or irrelevant. Calibrate your threshold by examining the retrieved results for representative queries on your actual corpus — the right cutoff varies by domain and embedding model.

Multi-Tenancy: Per-User Vector Isolation

For SaaS applications where different tenants should not access each other’s documents, add a tenant identifier to the table and filter on it in every query:

-- Add tenant column with index
ALTER TABLE document_embeddings ADD COLUMN tenant_id UUID NOT NULL;
CREATE INDEX ON document_embeddings (tenant_id);

-- Tenant-scoped similarity search
SELECT content, metadata, 1 - (embedding <=> %s::vector) AS similarity
FROM document_embeddings
WHERE tenant_id = %s  -- always filter by tenant first
  AND 1 - (embedding <=> %s::vector) > 0.70
ORDER BY similarity DESC
LIMIT 5;

PostgreSQL’s row-level security (RLS) can enforce tenant isolation at the database level — the database engine itself rejects queries that cross tenant boundaries even if the application layer forgets to add the filter. Enable RLS on the embeddings table and create a policy that restricts access to rows matching the current session’s tenant ID, set via a session variable at connection time. This provides defence-in-depth that is much harder to bypass accidentally than application-layer filtering alone.

Batch Embedding with Async Postgres

For high-throughput embedding pipelines, use async PostgreSQL with connection pooling to maximise insert throughput:

import asyncio, asyncpg
from openai import AsyncOpenAI

openai_client = AsyncOpenAI()

async def embed_batch(texts: list[str]) -> list[list[float]]:
    response = await openai_client.embeddings.create(
        input=texts, model="text-embedding-3-small"
    )
    return [item.embedding for item in response.data]

async def insert_batch(pool: asyncpg.Pool, rows: list[tuple]):
    async with pool.acquire() as conn:
        await conn.executemany(
            "INSERT INTO document_embeddings (content, metadata, embedding) VALUES ($1, $2, $3::vector)",
            rows
        )

async def process_corpus(documents: list[dict], batch_size: int = 100):
    pool = await asyncpg.create_pool("postgresql://user:pass@localhost/ragdb", min_size=5, max_size=20)
    for i in range(0, len(documents), batch_size):
        batch = documents[i:i+batch_size]
        texts = [d["content"] for d in batch]
        embeddings = await embed_batch(texts)
        rows = [(d["content"], d["metadata"], emb) for d, emb in zip(batch, embeddings)]
        await insert_batch(pool, rows)
        print(f"Inserted {i+batch_size}/{len(documents)} documents")
    await pool.close()

This pipeline embeds 100 documents per OpenAI API call (the maximum batch size for text-embedding-3-small) and inserts them into PostgreSQL concurrently via a connection pool. For a corpus of 100,000 documents, this approach completes in roughly 10–15 minutes versus hours for sequential single-document processing.

Choosing the Right Embedding Model for pgvector

The embedding model determines both the vector dimension you configure for your pgvector column and the semantic quality of your search results. The dimension must match exactly — if you embed with text-embedding-3-small (1,536 dimensions), your column must be vector(1536). Switching embedding models after initial deployment requires re-embedding the entire corpus and potentially resizing the column, so the choice at setup time matters.

For most English-language document corpora, text-embedding-3-small is the right starting point — strong quality, 1,536 dimensions, and low cost. Text-embedding-3-large (3,072 dimensions) improves retrieval quality by 5–10% on hard cases at double the storage cost and roughly 7x the API cost. For multilingual corpora, BGE-M3 self-hosted (1,024 dimensions) delivers competitive quality across 100+ languages at near-zero per-embedding cost once deployed. The storage difference is significant at scale: 1 million 1,536-dimensional float32 vectors require roughly 6 GB of storage; 3,072-dimensional vectors require 12 GB. Both fit comfortably on modern hardware, but at 100 million vectors the difference becomes a meaningful infrastructure consideration.

pgvector vs. Dedicated Vector Databases

Dedicated vector databases (Pinecone, Qdrant, Weaviate) offer better raw query performance at extreme scale, more sophisticated ANN algorithms, and cloud-managed infrastructure. pgvector wins on operational simplicity, cost, SQL flexibility, and transaction consistency. For teams with existing PostgreSQL infrastructure and document corpora under 20 million vectors, pgvector is almost always the right starting point — add a dedicated vector database only when you have measured that pgvector’s query latency is insufficient at your specific scale and query volume. For most production RAG applications, that measurement will show pgvector is fast enough, and the operational benefits of staying with one database are worth keeping.

What Is pgvector?

Installation and Setup

Inserting Embeddings with Python

Similarity Search

Metadata Filtering

Using pgvector with LangChain

Using pgvector with LlamaIndex

Performance Tuning

Bulk Loading for Large Corpora

pgvector vs. Dedicated Vector Databases

Hybrid Search: Combining Vector and Full-Text Search

Managing Embedding Model Migrations

Monitoring and Maintenance

Similarity Search and Distance Operators

Multi-Tenancy: Per-User Vector Isolation

Batch Embedding with Async Postgres

Choosing the Right Embedding Model for pgvector

pgvector vs. Dedicated Vector Databases

Leave a Comment Cancel reply