Building a Retrieval Augmented Generation (RAG) Pipeline with LLM

Large Language Models have transformed how we interact with information, but they come with a significant limitation: their knowledge is frozen at the time of training. When you ask an LLM about recent events, proprietary company data, or specialized domain knowledge, it simply cannot provide accurate answers because it has never seen that information. This is where Retrieval Augmented Generation fundamentally changes the game.

RAG represents a paradigm shift in how we deploy LLMs for practical applications. Instead of relying solely on the model’s pre-trained knowledge, RAG systems dynamically retrieve relevant information from external knowledge bases and feed it to the LLM as context. This approach solves the knowledge cutoff problem while dramatically reducing hallucinations and enabling LLMs to work with private, up-to-date, or domain-specific information that was never part of their training data.

Understanding the Core RAG Architecture

At its heart, a RAG pipeline consists of two distinct phases that work in concert to deliver accurate, contextually grounded responses. The first phase involves ingesting your knowledge base and preparing it for efficient retrieval. This means breaking down documents into manageable chunks, converting them into vector embeddings that capture semantic meaning, and storing them in a specialized vector database optimized for similarity search.

The second phase occurs at query time. When a user asks a question, the system converts that question into the same vector space as your documents, searches for the most semantically similar chunks, and passes both the original question and the retrieved context to the LLM. The model then generates a response grounded in the retrieved information rather than relying purely on its training data.

This architecture elegantly solves several problems simultaneously. It gives LLMs access to information they were never trained on, provides source attribution for generated responses, allows you to update your knowledge base without retraining the model, and dramatically reduces the computational cost compared to fine-tuning.

RAG Pipeline Flow

📄
Documents
Source Data
🔢
Embeddings
Vector Store
🔍
Retrieval
Find Context
🤖
LLM
Generate Answer

Document Processing and Chunking Strategy

The foundation of any effective RAG system begins with how you process and chunk your documents. This step is far more nuanced than simply splitting text at arbitrary character counts. Your chunking strategy directly impacts retrieval quality, context relevance, and ultimately the accuracy of generated responses.

When you chunk too small, you risk fragmenting important context across multiple pieces, making it difficult for the retrieval system to find complete information. Imagine splitting a paragraph about database indexing strategies right in the middle, the first chunk mentions B-trees without explaining them, and the second chunk discusses their performance characteristics without establishing what they are. Neither chunk alone provides sufficient context for the LLM to generate a useful response.

Conversely, chunking too large creates different problems. Large chunks dilute the semantic focus, making it harder for the vector search to identify precisely relevant information. They also consume more of your LLM’s context window, limiting how many different sources you can include. A 2000-word chunk about cloud computing might contain one relevant paragraph about Kubernetes networking, but the remaining content about EC2 instances and S3 storage adds noise when your user asks specifically about pod-to-pod communication.

The ideal chunk size depends heavily on your content type and use case. Technical documentation with distinct subsections often works well with 500-800 token chunks that align with natural section boundaries. Customer support conversations might benefit from smaller 200-400 token chunks that capture individual exchanges. Research papers could use larger 800-1200 token chunks to maintain the flow of complex arguments.

Beyond size, consider implementing overlapping chunks where the end of one chunk includes the beginning of the next. A 100-150 token overlap ensures that concepts spanning chunk boundaries remain retrievable. You should also preserve document structure in your metadata, storing information like section headings, document titles, dates, and authors. This metadata becomes invaluable for filtering and ranking retrieval results.

Vector Embeddings and Semantic Search

Vector embeddings are the mathematical heart of RAG systems. When you convert text into embeddings, you are transforming human language into high-dimensional numerical vectors that capture semantic meaning. Words and phrases with similar meanings cluster together in this vector space, enabling semantic search that goes far beyond simple keyword matching.

The choice of embedding model significantly impacts your RAG system’s performance. Modern embedding models like OpenAI’s text-embedding-3 or open-source alternatives like BGE create vectors with hundreds or thousands of dimensions. These models have been trained on massive corpora to understand not just word similarity, but contextual relationships, synonyms, and even cross-lingual concepts.

When a user asks “How do I reset my password?”, a vector search can retrieve documents containing phrases like “recover account access” or “restore login credentials” even though they share no common keywords. This semantic understanding is what makes RAG systems feel intelligent and intuitive. The embedding model has learned that these concepts occupy nearby regions in vector space because they appear in similar contexts during training.

The vector database you choose for storing and searching these embeddings matters tremendously for production systems. Purpose-built vector databases like Pinecone, Weaviate, or Qdrant are optimized for approximate nearest neighbor search, which can find similar vectors among millions of entries in milliseconds. They use sophisticated indexing algorithms like HNSW (Hierarchical Navigable Small World) that dramatically reduce search time compared to naive approaches.

For production RAG systems, you need to consider how your vector database handles updates, scales across multiple nodes, and manages different indexes for different collections. You might maintain separate indexes for different document types or time periods, enabling more targeted retrieval. Some systems benefit from hybrid search approaches that combine vector similarity with traditional keyword matching, giving you both semantic understanding and exact term matching when needed.

Building the Retrieval Component

The retrieval phase transforms a user query into relevant context for the LLM. This process involves multiple steps, each requiring careful tuning to balance precision and recall. When a query arrives, your system must first convert it into a vector using the same embedding model you used for your documents. This consistency is critical; mixing embedding models produces vectors in different spaces that cannot be meaningfully compared.

Once you have the query vector, you perform a similarity search against your vector database. The most common similarity metric is cosine similarity, which measures the angle between vectors regardless of their magnitude. A cosine similarity of 1.0 indicates identical semantic content, while 0.0 indicates no relationship. In practice, you typically retrieve the top k most similar chunks, where k ranges from 3 to 10 depending on your context window size and quality requirements.

However, raw similarity scores alone often prove insufficient for high-quality retrieval. Consider implementing reranking, a second-pass process where a more sophisticated model scores the initially retrieved candidates for actual relevance to the query. Cross-encoder models like those from Cohere or open-source alternatives can dramatically improve precision by understanding the interaction between query and document rather than treating them as independent vectors.

Metadata filtering adds another dimension to retrieval quality. Before or during vector search, you can filter candidates based on document attributes. If a user asks about recent policy changes, you might filter to documents from the last six months. For a technical support query, you might filter to documentation matching the user’s product version. This combination of semantic search and structured filtering provides powerful, targeted retrieval.

from openai import OpenAI
import numpy as np

client = OpenAI()

def retrieve_relevant_chunks(query, chunks_db, top_k=5):
    # Generate embedding for the query
    query_embedding = client.embeddings.create(
        input=query,
        model="text-embedding-3-small"
    ).data[0].embedding
    
    # Calculate cosine similarity with all chunks
    similarities = []
    for chunk in chunks_db:
        similarity = np.dot(query_embedding, chunk['embedding'])
        similarities.append({
            'chunk': chunk['text'],
            'score': similarity,
            'metadata': chunk['metadata']
        })
    
    # Sort by similarity and return top k results
    similarities.sort(key=lambda x: x['score'], reverse=True)
    return similarities[:top_k]

Prompt Engineering for RAG

The prompt you construct for your LLM determines how effectively it uses retrieved context to generate responses. A poorly designed prompt leads to the model ignoring retrieved information, hallucinating despite having correct context, or failing to acknowledge when it lacks sufficient information to answer.

Your RAG prompt should establish clear instructions about how to handle the retrieved context. Begin by explicitly telling the model that it will receive relevant passages and should prioritize this information over its pre-trained knowledge. Instruct it to cite specific passages when making claims, which provides transparency and enables users to verify information. Crucially, tell the model to acknowledge uncertainty when the retrieved context does not contain sufficient information to answer the query confidently.

The structure of your prompt matters as much as its content. A typical RAG prompt includes system instructions establishing the model’s role and behavior, the retrieved context clearly demarcated and labeled, and the user’s question. Consider including metadata with each context chunk, like document titles and dates, which helps the model provide more specific, attributable responses.

Here’s an example of an effective RAG prompt structure:

def create_rag_prompt(query, retrieved_chunks):
    context = "\n\n".join([
        f"[Source: {chunk['metadata']['title']}]\n{chunk['text']}"
        for chunk in retrieved_chunks
    ])
    
    prompt = f"""You are a helpful assistant that answers questions based on the provided context.
Use the following context to answer the question. If the context doesn't contain enough 
information to answer fully, acknowledge what you can and cannot answer.

Context:
{context}

Question: {query}

Provide a clear, accurate answer based on the context above. Include references to specific 
sources when making claims."""
    
    return prompt

Optimizing Context Window Usage

Modern LLMs offer increasingly large context windows, with some models supporting over 100,000 tokens. However, more context is not always better. Research has shown that LLMs sometimes struggle to effectively use information buried in the middle of very long contexts, a phenomenon called “lost in the middle.” Your RAG system must strategically manage what context reaches the model.

Rather than simply retrieving the top k chunks and concatenating them, consider their combined relevance and diversity. If your top five retrieved chunks all come from the same document section, they likely contain redundant information. A more diverse set of chunks from different sources might provide better coverage of the topic, even if individual chunks score slightly lower on similarity.

Implement dynamic context selection based on the query type and retrieved content quality. For straightforward factual questions, two or three highly relevant chunks might suffice. For complex analytical questions requiring synthesis across multiple sources, you might include more chunks but summarize less critical ones to save tokens. Some advanced RAG systems use a smaller LLM to summarize retrieved chunks before passing them to the primary model, compressing information while preserving key facts.

The order in which you present retrieved chunks also matters. Placing the most relevant chunks at the beginning and end of your context, where LLMs tend to pay more attention, can improve response quality. Some practitioners achieve better results by placing the user’s question after the context rather than before, encouraging the model to thoroughly consider all provided information before formulating its response.

Key RAG Performance Factors

Chunk Size
Balance context completeness with retrieval precision
Embedding Quality
Better embeddings find more relevant context
Retrieval Count
More chunks provide coverage but may add noise
Prompt Design
Clear instructions improve context utilization

Evaluating and Iterating Your RAG Pipeline

Building a RAG system is only the beginning; continuous evaluation and refinement separate functional systems from truly effective ones. RAG evaluation requires assessing both retrieval quality and generation quality independently, as failures in either component cascade into poor user experiences.

For retrieval evaluation, you need labeled test queries with known relevant documents. Metrics like precision at k (what percentage of retrieved documents are relevant) and recall at k (what percentage of all relevant documents were retrieved) quantify retrieval performance. Mean Reciprocal Rank (MRR) measures how highly your system ranks the most relevant documents. Create a test set by sampling real user queries or synthetically generating questions from your documents, then manually labeling which documents should be retrieved.

Generation quality proves harder to measure objectively. You can assess whether generated responses are factually grounded in retrieved context by checking if claims appear in the source chunks. Evaluate citation accuracy by verifying that references point to passages that actually support the claims being made. For subjective qualities like helpfulness and clarity, human evaluation remains the gold standard, though you can use LLMs themselves as evaluators for scaled testing.

Common failure modes provide clues for system improvements. If users frequently ask follow-up questions seeking clarification, your chunks might be too small or your retrieval is missing key context. If responses often include disclaimers about insufficient information despite relevant documents existing, your retrieval strategy needs refinement. If responses contradict themselves or include information not in retrieved chunks, adjust your prompt to emphasize groundedness more strongly.

Implement logging and analytics from day one. Track which queries result in low confidence responses, which retrieved chunks are selected but not used in generation, and which document sections are frequently retrieved together. These patterns reveal opportunities for better chunking, additional metadata tags, or specialized handling for certain query types. Some teams find success with A/B testing different configurations against real user traffic to measure impact on satisfaction and task completion.

Conclusion

Building a production-ready RAG pipeline requires careful attention to each component, from initial document processing through final response generation. The choices you make about chunking strategies, embedding models, retrieval algorithms, and prompt design compound to determine your system’s ultimate effectiveness. While the basic architecture remains consistent across implementations, the optimal configuration depends heavily on your specific content, user needs, and quality requirements.

The investment in building a well-tuned RAG system pays substantial dividends. You gain an LLM application that remains current with your evolving knowledge base, provides attributable responses users can verify, and avoids the costly process of continuous model retraining. By mastering these core concepts and committing to iterative improvement based on real-world usage, you can build RAG systems that deliver reliable, accurate, and truly useful AI-powered experiences.

Leave a Comment