Why LlamaIndex for Production RAG?
LlamaIndex is a data framework built specifically for connecting LLMs to external data sources. While LangChain is a general-purpose LLM application framework, LlamaIndex focuses specifically on the ingestion, indexing, and retrieval pipeline — the components that make RAG applications work well in production. Its abstractions for document loading, chunking strategies, embedding models, vector stores, and query engines are more mature and flexible than equivalent LangChain components for complex RAG use cases.
A production RAG pipeline has six main components: document ingestion (loading and parsing raw documents), chunking (splitting documents into retrievable units), embedding (converting chunks to vectors), indexing (storing vectors in a searchable database), retrieval (finding relevant chunks for a query), and generation (producing an answer grounded in the retrieved context). LlamaIndex provides well-tested, production-ready implementations of all six, with connectors to over 160 data sources and support for every major vector database and LLM provider.
Installation and Basic Setup
pip install llama-index llama-index-llms-anthropic llama-index-embeddings-openai
pip install llama-index-vector-stores-postgres psycopg2-binary
import os
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-key"
os.environ["OPENAI_API_KEY"] = "your-openai-key" # for embeddings
from llama_index.core import Settings
from llama_index.llms.anthropic import Anthropic
from llama_index.embeddings.openai import OpenAIEmbedding
# Global settings — applied to all index and query operations
Settings.llm = Anthropic(model="claude-sonnet-4-6", max_tokens=1024)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.chunk_size = 512
Settings.chunk_overlap = 50
Document Ingestion and Indexing
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
# Load documents from a directory
documents = SimpleDirectoryReader("./docs", recursive=True).load_data()
print(f"Loaded {len(documents)} documents")
# Parse into nodes (chunks) with overlap
parser = SentenceSplitter(chunk_size=512, chunk_overlap=50)
nodes = parser.get_nodes_from_documents(documents)
print(f"Split into {len(nodes)} chunks")
# Build a vector index (stores embeddings in memory for development)
index = VectorStoreIndex(nodes, show_progress=True)
# Persist to disk for reuse
index.storage_context.persist(persist_dir="./index_storage")
For production, replace the in-memory vector store with a persistent database. LlamaIndex supports pgvector (PostgreSQL), Pinecone, Weaviate, Qdrant, Chroma, and others through pluggable store backends.
Using pgvector for Production Storage
from llama_index.vector_stores.postgres import PGVectorStore
from llama_index.core import StorageContext, VectorStoreIndex
import psycopg2
conn_string = "postgresql://user:password@localhost:5432/ragdb"
vector_store = PGVectorStore.from_params(
database="ragdb", host="localhost", port="5432",
user="user", password="password",
table_name="document_embeddings",
embed_dim=1536 # text-embedding-3-small dimension
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Build index — embeddings stored in postgres
index = VectorStoreIndex(nodes, storage_context=storage_context, show_progress=True)
# Reload existing index without re-embedding
index = VectorStoreIndex.from_vector_store(
vector_store=vector_store,
storage_context=StorageContext.from_defaults(vector_store=vector_store)
)
pgvector is the most practical production vector store for most teams — it runs in the same Postgres instance you likely already operate, requires no additional managed service, and supports all standard SQL operations alongside vector similarity search. For corpora under 10 million chunks, pgvector’s performance is excellent. Above that scale, dedicated vector databases like Pinecone or Qdrant provide better query performance.
Query Engines and Retrieval Modes
from llama_index.core import QueryBundle
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor
# Basic retrieval — top-5 most similar chunks
retriever = VectorIndexRetriever(index=index, similarity_top_k=5)
# Filter low-similarity results (score below 0.7)
postprocessor = SimilarityPostprocessor(similarity_cutoff=0.70)
# Full query engine with response synthesis
query_engine = RetrieverQueryEngine(
retriever=retriever,
node_postprocessors=[postprocessor]
)
response = query_engine.query("What is our refund policy for enterprise contracts?")
print(response.response)
print("
Source nodes:")
for node in response.source_nodes:
print(f" Score: {node.score:.3f} | Source: {node.metadata.get('file_name', 'unknown')}")
The similarity threshold is one of the most important parameters to tune. Too low (0.5) and you retrieve irrelevant chunks that confuse the LLM. Too high (0.9) and you miss relevant content that is phrased differently from the query. For most document corpora, 0.70–0.80 is a reasonable starting range. Tune it by examining the retrieved nodes for a representative set of queries and adjusting until the retrieved content is consistently relevant.
Advanced Retrieval: Hybrid Search
Pure semantic (vector) search excels at finding conceptually similar content but can miss exact keyword matches — important for queries containing specific identifiers, product names, or technical terms. Hybrid search combines vector similarity with BM25 keyword search, taking the best of both approaches:
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
# BM25 keyword retriever
bm25_retriever = BM25Retriever.from_defaults(nodes=nodes, similarity_top_k=5)
# Vector semantic retriever
vector_retriever = VectorIndexRetriever(index=index, similarity_top_k=5)
# Fusion retriever — combines both with reciprocal rank fusion
hybrid_retriever = QueryFusionRetriever(
retrievers=[vector_retriever, bm25_retriever],
similarity_top_k=5,
num_queries=1, # number of query rewrites to generate
mode="reciprocal_rerank",
use_async=True
)
query_engine = RetrieverQueryEngine(retriever=hybrid_retriever)
response = query_engine.query("What are the SLA terms in contract #45892?")
Hybrid search consistently outperforms pure vector search on real-world document corpora, particularly when queries contain specific identifiers, dates, names, or technical terminology that semantic similarity alone handles poorly.
Response Synthesis Modes
LlamaIndex offers several response synthesis strategies that control how retrieved context is combined into a final answer. The default compact mode fits as many chunks as possible into a single LLM call. refine iteratively refines the answer chunk-by-chunk — better quality for complex multi-document queries but more expensive. tree_summarize builds a summary tree bottom-up, useful for very long document sets. accumulate generates individual answers per chunk then combines them — good for list-style questions where each chunk might contribute a separate answer.
from llama_index.core.response_synthesizers import ResponseMode
# Refine mode for highest quality on complex queries
query_engine = index.as_query_engine(
response_mode=ResponseMode.REFINE,
similarity_top_k=8
)
# Tree summarise for large document sets
query_engine = index.as_query_engine(
response_mode=ResponseMode.TREE_SUMMARIZE,
similarity_top_k=10
)
Document Metadata and Filtering
Production RAG pipelines often need to scope retrieval to specific document subsets — by department, date range, document type, or access permission. LlamaIndex supports metadata filtering at query time:
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters, FilterOperator
# Retrieve only from HR department documents published after 2025
filters = MetadataFilters(filters=[
MetadataFilter(key="department", value="HR"),
MetadataFilter(key="year", value=2025, operator=FilterOperator.GTE)
])
query_engine = index.as_query_engine(
filters=filters, similarity_top_k=5
)
response = query_engine.query("What is the parental leave policy?")
Store metadata at ingestion time using LlamaIndex’s Document class with a metadata dict. Useful metadata fields for enterprise documents: source file path, document type (policy/contract/runbook), department, creation date, access level. These fields enable permission-aware retrieval — ensuring that user queries only retrieve documents they are authorised to access.
Evaluating RAG Quality
LlamaIndex includes built-in evaluation modules that measure retrieval and generation quality:
from llama_index.core.evaluation import (
FaithfulnessEvaluator,
RelevancyEvaluator,
CorrectnessEvaluator
)
faithfulness_evaluator = FaithfulnessEvaluator() # Is answer supported by context?
relevancy_evaluator = RelevancyEvaluator() # Is context relevant to query?
query = "What is our data retention policy?"
response = query_engine.query(query)
faith_result = faithfulness_evaluator.evaluate_response(response=response)
rel_result = relevancy_evaluator.evaluate_response(query=query, response=response)
print(f"Faithfulness: {faith_result.score} — {faith_result.feedback}")
print(f"Relevancy: {rel_result.score} — {rel_result.feedback}")
Run these evaluators on a curated set of test questions with known correct answers — the RAG evaluation dataset. A faithfulness score below 0.8 indicates hallucination (the answer is not grounded in the retrieved context). A relevancy score below 0.75 indicates poor retrieval (the retrieved chunks are not relevant to the query). Both metrics should be tracked continuously in production alongside cost and latency.
Incremental Index Updates
Production document corpora change — new documents are added, existing ones are updated or deleted. LlamaIndex supports incremental updates that avoid re-embedding the entire corpus on every change:
from llama_index.core import StorageContext
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=512, chunk_overlap=50),
Settings.embed_model,
],
vector_store=vector_store,
docstore=storage_context.docstore # tracks which docs are already indexed
)
# Only embeds and indexes new/changed documents
new_docs = SimpleDirectoryReader("./new_docs").load_data()
nodes = pipeline.run(documents=new_docs)
print(f"Indexed {len(nodes)} new chunks")
The ingestion pipeline uses document checksums to detect changes and skip re-embedding unchanged documents. For a corpus of 100,000 documents where 100 change daily, this avoids re-embedding 99,900 documents on each update — reducing both embedding API costs and index update time.
Production Deployment Checklist
Before deploying a LlamaIndex RAG pipeline to production, verify the following. Persistent vector storage (not in-memory) with appropriate backup policies. Metadata filtering implemented for any access control requirements. Similarity thresholds tuned on representative test queries. Hybrid search enabled if the corpus contains specific identifiers or technical terms. Evaluation metrics (faithfulness, relevancy) automated and logged for production queries. Incremental update pipeline for corpus maintenance without full re-indexing. Response citation included in UI so users can verify answers against source documents. Rate limiting and authentication on the query endpoint. Latency and cost monitoring integrated with your observability stack. With these in place, LlamaIndex-based RAG pipelines are robust enough for production document Q&A workloads at significant scale.
Sub-Question Query Engine for Complex Questions
Many real-world queries cannot be answered from a single document chunk — they require synthesising information across multiple sources. LlamaIndex’s sub-question query engine breaks complex queries into targeted sub-questions, routes each to the appropriate index, and combines the answers into a final response:
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine
# Create separate indexes per document collection
policy_index = VectorStoreIndex(policy_nodes)
contract_index = VectorStoreIndex(contract_nodes)
faq_index = VectorStoreIndex(faq_nodes)
tools = [
QueryEngineTool(query_engine=policy_index.as_query_engine(),
metadata=ToolMetadata(name="policy_docs", description="Company policies and procedures")),
QueryEngineTool(query_engine=contract_index.as_query_engine(),
metadata=ToolMetadata(name="contracts", description="Customer and vendor contracts")),
QueryEngineTool(query_engine=faq_index.as_query_engine(),
metadata=ToolMetadata(name="faq", description="Frequently asked questions")),
]
sub_question_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=tools)
response = sub_question_engine.query(
"What is our refund policy for enterprise contracts, and are there any exceptions listed in the FAQ?"
)
The engine automatically determines which indexes to query for each sub-question, executes them in parallel where possible, and synthesises a coherent combined answer with citations from multiple sources. This pattern is particularly valuable for knowledge management applications where different document types live in separate collections but users ask questions that span all of them.
Chat Engine for Multi-Turn Conversations
For conversational RAG applications where users ask follow-up questions that reference previous turns, LlamaIndex provides chat engines that maintain conversation history and use it to contextualise retrieval:
from llama_index.core.chat_engine import CondensePlusContextChatEngine
from llama_index.core.memory import ChatMemoryBuffer
memory = ChatMemoryBuffer.from_defaults(token_limit=4096)
chat_engine = CondensePlusContextChatEngine.from_defaults(
retriever=retriever,
memory=memory,
llm=Settings.llm,
context_prompt=(
"You are a helpful assistant with access to company documents. "
"Use the context below to answer questions accurately. "
"Always cite the source document when referencing specific information.
"
"Context: {context_str}
"
)
)
# Multi-turn conversation
response1 = chat_engine.chat("What is our remote work policy?")
print(response1.response)
response2 = chat_engine.chat("Are there any exceptions for contractors?")
# Automatically uses previous turn context to refine the query
print(response2.response)
The CondensePlusContextChatEngine condenses the conversation history into a standalone query before retrieval, ensuring that follow-up questions like “Are there any exceptions?” retrieve relevant context even without the full conversation history in the retrieval query. This is the most reliable chat engine mode for production conversational RAG — simpler alternatives that pass raw conversation history to the retriever often fail on follow-up questions.
Streaming Responses for Better UX
For web applications where users wait for RAG responses, streaming the generation token-by-token dramatically improves perceived responsiveness:
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
app = FastAPI()
@app.post("/chat")
async def chat(query: str):
async def generate():
streaming_response = query_engine.query(query)
async for token in streaming_response.async_response_gen():
yield f"data: {token}
"
yield "data: [DONE]
"
return StreamingResponse(generate(), media_type="text/event-stream")
Users see the first tokens within 300–500ms rather than waiting 3–8 seconds for the full response — a significant improvement in perceived performance for the same underlying retrieval and generation pipeline.
Common Production Issues and Fixes
Three problems recur frequently in production LlamaIndex deployments. Retrieval returning irrelevant chunks: caused by embeddings that fail to capture domain-specific terminology. Fix by evaluating retrieval quality with the RelevancyEvaluator on representative queries, then adjust chunk size (smaller chunks are more semantically focused), similarity threshold (raise it to filter noise), or switch to a domain-appropriate embedding model. Answers not grounded in retrieved context: the LLM generates from training knowledge rather than retrieved documents. Fix by making the system prompt more explicit about using only the provided context, reducing temperature to zero for factual Q&A, and using the FaithfulnessEvaluator to detect and alert on grounding failures in production. Slow query times: often caused by re-embedding the query on every request without caching, or by retrieving too many chunks and having the LLM process a large context. Fix by caching query embeddings for repeated queries, reducing top_k from default values to the minimum that maintains quality, and enabling async retrieval where multiple collections are queried in parallel. Each of these issues has a clear diagnostic and fix — the challenge is identifying which one is responsible for a given quality or latency problem, which is why robust logging of retrieval results alongside generation outputs is essential for any production RAG system.
LlamaIndex vs LangChain: Which to Choose for RAG?
The most common question teams face when starting a RAG project is whether to use LlamaIndex or LangChain. The honest answer is that they serve overlapping but distinct use cases. LlamaIndex is the better choice when the core of your application is document ingestion, indexing, and retrieval — its abstractions for chunking strategies, index types, retrieval modes, and evaluation are more mature and purpose-built for this problem than LangChain’s equivalent components. LangChain is the better choice when you are building a broader application that combines RAG with chains, agents, and many external tool integrations, and where the flexibility of LangChain’s composable primitives is more valuable than LlamaIndex’s retrieval specialisation.
Many production teams use both: LlamaIndex for the retrieval pipeline (ingestion, indexing, retrieval, evaluation) and LangChain for the broader application logic (agents, chains, memory, tool integration). The two frameworks interoperate — you can use a LlamaIndex retriever as a LangChain tool and vice versa. Rather than treating the choice as binary, evaluate which framework’s abstractions are better suited to the specific component you are building, and use both where each is stronger.
For a team starting a net-new RAG project in 2026 with no existing framework dependencies, LlamaIndex is the more focused starting point. Its document loaders, chunking strategies, evaluation modules, and retrieval modes cover the full RAG lifecycle with less custom code than equivalent LangChain setups. The production deployment checklist at the end of this guide captures the most important considerations regardless of which framework you choose — the principles apply equally to both.