How to Build a Private AI Assistant on Your Own Data (Step-by-Step)

Large language models like GPT-4 and Claude are impressive, but they don’t know anything about your company’s internal documents, your personal notes, or your proprietary data. Building a private AI assistant that can actually answer questions based on your specific information requires combining a local LLM with retrieval-augmented generation (RAG). This guide walks you through the complete process, from choosing hardware to querying your data, with working code and practical configurations you can implement today.

Understanding RAG: Why Your AI Needs It

Before diving into implementation, you need to understand why simply asking an LLM about your data doesn’t work and how RAG solves this problem.

The fundamental issue: LLMs are trained on data up to a certain cutoff date. They have no knowledge of your personal documents, company wikis, customer records, or any information created after their training. When you ask “What did the Q3 sales report say about the midwest region?”, the model has no context to answer—it’s never seen that report.

The naive solution that fails: You might think “I’ll just paste my documents into the prompt!” This works for small amounts of data but fails quickly:

  • Context windows have limits (4K-128K tokens depending on the model)
  • Costs scale linearly with context size for API-based models
  • Processing huge contexts slows generation significantly
  • The model might ignore information buried deep in long contexts

How RAG solves this: Retrieval-augmented generation splits the process into two steps:

  1. Retrieval: When you ask a question, the system searches your documents and retrieves only the most relevant chunks
  2. Generation: The LLM receives your question plus only the relevant retrieved chunks, generating an answer grounded in your actual data

This approach keeps prompts small, focuses the model on relevant information, and scales to massive document collections.

System Architecture: What You’re Building

RAG System Components

1. Document Processing Pipeline
Loads documents → Splits into chunks → Generates embeddings → Stores in vector database
2. Query Pipeline
User question → Generate embedding → Search vector DB → Retrieve relevant chunks → Construct prompt → Generate answer
3. Local LLM
Runs on your hardware, processes prompts with retrieved context, generates answers
4. Vector Database
Stores document embeddings, performs similarity search, returns relevant chunks

Understanding this architecture helps you debug issues and optimize performance as you build your system.

Hardware Requirements and Recommendations

Running a private AI assistant requires sufficient resources for both the LLM and the embedding model.

Minimum configuration:

  • RAM: 16GB (barely sufficient for 7B models)
  • GPU: Optional but recommended—NVIDIA with 8GB+ VRAM or Apple Silicon
  • Storage: 50GB+ for models and vector database
  • CPU: Modern quad-core minimum

Recommended configuration:

  • RAM: 32GB (comfortable for 13B models)
  • GPU: NVIDIA RTX 3060 12GB or Apple M1 Pro/Max
  • Storage: 100GB+ SSD
  • CPU: 6+ cores

Why these specs matter:

The LLM needs memory to load model weights and KV cache. A 7B Q4 model requires ~5GB, while a 13B Q8 needs ~16GB. The embedding model adds another 1-2GB. Running both simultaneously plus your OS and applications necessitates adequate RAM.

GPU acceleration dramatically improves inference speed. A 7B model generates 10-15 tokens/second on CPU versus 40-60 tokens/second on a decent GPU. For interactive use, GPU acceleration transforms the experience from frustrating to fluid.

Step 1: Installing the Foundation Components

We’ll use a Python-based stack that’s well-supported and actively maintained.

Install Ollama for LLM Inference

Ollama provides the simplest path to running local LLMs:

# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Windows: download from ollama.ai

Pull the models you’ll use:

# Main LLM for generation
ollama pull llama3.1:7b-instruct

# Embedding model for document processing
ollama pull nomic-embed-text

Why these models:

  • Llama 3.1 7B Instruct: Excellent instruction-following, good quality, runs on modest hardware
  • nomic-embed-text: High-quality embeddings specifically designed for RAG, 768 dimensions

Verify installation:

ollama list
# Should show both models

Install Python Dependencies

Create a dedicated environment:

# Create virtual environment
python -m venv rag-assistant
source rag-assistant/bin/activate  # On Windows: rag-assistant\Scripts\activate

# Install required packages
pip install langchain langchain-community chromadb sentence-transformers

Package purposes:

  • langchain: Framework for building LLM applications, handles RAG orchestration
  • chromadb: Vector database for storing and searching embeddings
  • sentence-transformers: Embedding generation

Step 2: Preparing Your Documents

Document preparation significantly impacts retrieval quality. Poorly processed documents lead to irrelevant results and hallucinated answers.

Document Collection

Organize your documents in a single directory:

/documents
  ├── company_policies.pdf
  ├── Q3_sales_report.pdf
  ├── product_specifications.docx
  ├── meeting_notes.txt
  └── technical_documentation.md

Supported formats: PDF, DOCX, TXT, MD, HTML, CSV

Document Loading Code

Create document_loader.py:

from langchain_community.document_loaders import (
    DirectoryLoader,
    PyPDFLoader,
    TextLoader,
    UnstructuredWordDocumentLoader
)

def load_documents(directory_path):
    """Load all documents from a directory."""
    
    # PDF files
    pdf_loader = DirectoryLoader(
        directory_path,
        glob="**/*.pdf",
        loader_cls=PyPDFLoader
    )
    
    # Text and Markdown files
    text_loader = DirectoryLoader(
        directory_path,
        glob="**/*.{txt,md}",
        loader_cls=TextLoader
    )
    
    # Word documents
    docx_loader = DirectoryLoader(
        directory_path,
        glob="**/*.docx",
        loader_cls=UnstructuredWordDocumentLoader
    )
    
    # Load all documents
    documents = []
    documents.extend(pdf_loader.load())
    documents.extend(text_loader.load())
    documents.extend(docx_loader.load())
    
    print(f"Loaded {len(documents)} documents")
    return documents

# Usage
docs = load_documents("./documents")

This code handles multiple formats and combines them into a unified document list.

Text Chunking Strategy

Large documents must be split into chunks that fit within context windows while maintaining semantic coherence.

Chunking parameters that matter:

  • Chunk size: Number of characters per chunk
  • Chunk overlap: Characters shared between adjacent chunks
  • Separators: How to intelligently split text
from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_documents(documents):
    """Split documents into manageable chunks."""
    
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,        # ~250 tokens
        chunk_overlap=200,      # Maintain context between chunks
        length_function=len,
        separators=["\n\n", "\n", " ", ""]  # Try these in order
    )
    
    chunks = text_splitter.split_documents(documents)
    print(f"Created {len(chunks)} chunks from {len(documents)} documents")
    
    return chunks

# Usage
chunks = chunk_documents(docs)

Why these parameters:

  • 1000 characters (~250 tokens): Small enough to be specific, large enough to contain complete thoughts
  • 200 character overlap: Prevents information loss at chunk boundaries
  • Recursive separators: Tries to split on paragraph breaks first, then sentences, then words

Critical consideration: Chunk size trades off between precision and context. Smaller chunks (500 chars) retrieve more precise information but might lack context. Larger chunks (2000 chars) provide more context but might include irrelevant information. Start with 1000 and adjust based on your documents.

Step 3: Building the Vector Database

The vector database stores embeddings and performs similarity searches to find relevant chunks.

Generate Embeddings

Create embedding_generator.py:

from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma

def create_vector_store(chunks, persist_directory="./chroma_db"):
    """Create and persist vector store with embeddings."""
    
    # Initialize Ollama embeddings
    embeddings = OllamaEmbeddings(
        model="nomic-embed-text",
        base_url="http://localhost:11434"
    )
    
    # Create vector store
    print("Generating embeddings... (this takes time)")
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=persist_directory
    )
    
    print(f"Vector store created with {vectorstore._collection.count()} embeddings")
    return vectorstore

# Usage
vectorstore = create_vector_store(chunks)

What’s happening:

  1. Each chunk is sent to the embedding model (nomic-embed-text)
  2. The model returns a 768-dimensional vector representing the semantic meaning
  3. Vectors are stored in ChromaDB with metadata linking back to the original text
  4. An index is built for fast similarity search

Performance note: Embedding generation is the slowest step. For 1,000 chunks, expect 5-15 minutes depending on your hardware. The process only runs once—subsequent queries reuse the stored embeddings.

Testing Retrieval

Verify your vector store works before connecting the LLM:

def test_retrieval(vectorstore, query, k=3):
    """Test retrieval with a sample query."""
    
    results = vectorstore.similarity_search(query, k=k)
    
    print(f"\nQuery: {query}\n")
    for i, doc in enumerate(results, 1):
        print(f"Result {i}:")
        print(f"Content: {doc.page_content[:200]}...")
        print(f"Source: {doc.metadata.get('source', 'unknown')}\n")
    
    return results

# Test
test_retrieval(
    vectorstore,
    "What were the Q3 sales numbers?",
    k=3
)

This returns the 3 most similar chunks to your query. Review the results—do they actually relate to the question? If not, you might need to adjust chunking parameters or try different embedding models.

Step 4: Connecting the LLM

Now we integrate the LLM to generate answers based on retrieved context.

Create the RAG Chain

Create rag_assistant.py:

from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

def create_rag_chain(vectorstore):
    """Create the complete RAG chain."""
    
    # Initialize LLM
    llm = Ollama(
        model="llama3.1:7b-instruct",
        base_url="http://localhost:11434",
        temperature=0.3  # Lower temperature for more factual responses
    )
    
    # Create retriever
    retriever = vectorstore.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 4}  # Retrieve top 4 relevant chunks
    )
    
    # Define prompt template
    prompt_template = """You are an AI assistant answering questions based on provided documents. 
Use the following context to answer the question. If you cannot answer based on the context, say so.

Context:
{context}

Question: {question}

Answer: """
    
    PROMPT = PromptTemplate(
        template=prompt_template,
        input_variables=["context", "question"]
    )
    
    # Create RAG chain
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",  # Stuffs all retrieved docs into prompt
        retriever=retriever,
        return_source_documents=True,
        chain_type_kwargs={"prompt": PROMPT}
    )
    
    return qa_chain

# Usage
qa_chain = create_rag_chain(vectorstore)

Component breakdown:

  • LLM configuration: Temperature 0.3 reduces creativity, focusing on factual responses
  • Retriever: Returns top 4 chunks most similar to the query
  • Prompt template: Explicitly instructs the model to use provided context
  • Chain type “stuff”: Concatenates all retrieved documents into the prompt

Querying Your Assistant

def query_assistant(qa_chain, question):
    """Ask a question and get an answer with sources."""
    
    result = qa_chain.invoke({"query": question})
    
    print(f"\nQuestion: {question}\n")
    print(f"Answer: {result['result']}\n")
    print("Sources:")
    for doc in result['source_documents']:
        source = doc.metadata.get('source', 'unknown')
        print(f"- {source}")
    
    return result

# Ask questions
query_assistant(
    qa_chain,
    "What are the company's remote work policies?"
)

query_assistant(
    qa_chain,
    "What were the key takeaways from the Q3 sales report?"
)

The system now:

  1. Converts your question to an embedding
  2. Searches the vector database for similar chunks
  3. Constructs a prompt with the question and retrieved chunks
  4. Generates an answer using the local LLM
  5. Returns both the answer and source documents

Step 5: Building a Simple Interface

Command-line interfaces work for testing but aren’t user-friendly. Let’s create a simple web interface.

Create a Streamlit Interface

Install Streamlit:

pip install streamlit

Create app.py:

import streamlit as st
from rag_assistant import create_rag_chain, load_documents, chunk_documents, create_vector_store

# Page config
st.set_page_config(page_title="Private AI Assistant", page_icon="🤖")
st.title("🤖 Private AI Assistant")
st.caption("Ask questions about your documents")

# Initialize session state
if 'qa_chain' not in st.session_state:
    with st.spinner("Loading documents and initializing assistant..."):
        # Load and process documents
        docs = load_documents("./documents")
        chunks = chunk_documents(docs)
        vectorstore = create_vector_store(chunks)
        st.session_state.qa_chain = create_rag_chain(vectorstore)
    st.success("Assistant ready!")

# Chat interface
if 'messages' not in st.session_state:
    st.session_state.messages = []

# Display chat history
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])
        if "sources" in message:
            with st.expander("Sources"):
                for source in message["sources"]:
                    st.text(source)

# Query input
if question := st.chat_input("Ask a question about your documents"):
    # Display user message
    st.session_state.messages.append({"role": "user", "content": question})
    with st.chat_message("user"):
        st.markdown(question)
    
    # Generate response
    with st.chat_message("assistant"):
        with st.spinner("Thinking..."):
            result = st.session_state.qa_chain.invoke({"query": question})
            answer = result['result']
            sources = [doc.metadata.get('source', 'unknown') 
                      for doc in result['source_documents']]
            
            st.markdown(answer)
            with st.expander("Sources"):
                for source in set(sources):
                    st.text(source)
        
        st.session_state.messages.append({
            "role": "assistant",
            "content": answer,
            "sources": list(set(sources))
        })

Run the interface:

streamlit run app.py

This creates a ChatGPT-like interface accessible at http://localhost:8501. The conversation history persists during your session, and source documents are displayed for transparency.

Optimization and Troubleshooting

Improving Retrieval Quality

If your assistant returns irrelevant information or says it can’t answer questions about data that exists:

Adjust retriever parameters:

retriever = vectorstore.as_retriever(
    search_type="mmr",  # Maximum Marginal Relevance for diversity
    search_kwargs={
        "k": 6,           # Retrieve more chunks
        "fetch_k": 20,    # Consider 20 candidates before selecting 6
        "lambda_mult": 0.7  # Balance relevance vs diversity
    }
)

MMR retrieval returns diverse results rather than very similar chunks, helping when information spans multiple documents.

Experiment with chunk sizes:

Smaller chunks (500 chars): Better for specific facts and numbers Larger chunks (1500 chars): Better for contextual understanding

Try different embedding models:

ollama pull all-minilm  # Faster but less accurate
ollama pull mxbai-embed-large  # Slower but higher quality

Handling Large Document Collections

For 10,000+ documents, optimization becomes critical:

Use batch embedding generation:

from tqdm import tqdm

def create_vector_store_batch(chunks, batch_size=100):
    vectorstore = Chroma(
        embedding_function=embeddings,
        persist_directory="./chroma_db"
    )
    
    for i in tqdm(range(0, len(chunks), batch_size)):
        batch = chunks[i:i+batch_size]
        vectorstore.add_documents(batch)
    
    return vectorstore

Enable persistent caching:

ChromaDB automatically persists to disk, but ensure you’re not regenerating embeddings:

# Check if database exists
import os
if os.path.exists("./chroma_db"):
    vectorstore = Chroma(
        persist_directory="./chroma_db",
        embedding_function=embeddings
    )
    print("Loaded existing vector store")
else:
    vectorstore = create_vector_store(chunks)

Improving Answer Quality

Adjust LLM temperature:

  • Temperature 0.1-0.3: Very factual, less creative
  • Temperature 0.5-0.7: Balanced
  • Temperature 0.8-1.0: More creative, less reliable

Enhance the prompt:

prompt_template = """You are a helpful AI assistant with access to company documents.

Instructions:
- Answer based ONLY on the provided context
- If the context doesn't contain relevant information, clearly state this
- Quote specific sections when appropriate
- Be concise but thorough

Context:
{context}

Question: {question}

Detailed Answer: """

Use a larger model:

If quality is insufficient, upgrade to a 13B model:

ollama pull llama3.1:13b-instruct

The quality improvement is noticeable, especially for complex queries requiring reasoning.

Security and Privacy Considerations

Since you’re building a private assistant, ensure data remains private:

Data never leaves your machine: Ollama runs entirely locally. No API calls to external services. No telemetry. Your documents stay on your hardware.

Access control: The Streamlit interface runs on localhost by default. If exposing to a network:

streamlit run app.py --server.address 0.0.0.0 --server.port 8501

Implement authentication and use HTTPS in production.

Sensitive document handling: If processing confidential data:

  • Use encrypted storage for the vector database
  • Ensure adequate system security (firewall, antivirus, user permissions)
  • Consider running on an isolated machine without internet access

Model security: Downloaded models are cached in ~/.ollama. Verify model integrity if security is critical.

Complete Working Example

Here’s the minimal complete code to get started:

# setup_rag.py
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# 1. Load documents
loader = DirectoryLoader("./documents", glob="**/*.txt", loader_cls=TextLoader)
documents = loader.load()

# 2. Chunk documents
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)

# 3. Create vector store
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./db")

# 4. Create RAG chain
llm = Ollama(model="llama3.1:7b-instruct", temperature=0.3)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

prompt = PromptTemplate(
    template="Context: {context}\n\nQuestion: {question}\n\nAnswer:",
    input_variables=["context", "question"]
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt}
)

# 5. Query
result = qa_chain.invoke({"query": "Your question here"})
print(result['result'])

This 30-line script creates a functional RAG system. Expand from this foundation.

Conclusion

Building a private AI assistant that understands your specific data is more accessible than ever thanks to local LLMs and vector databases. The RAG architecture—combining document retrieval with language model generation—creates systems that answer questions accurately based on your documents while running entirely on your hardware. This guide covered the complete pipeline from document processing through embeddings to query generation, giving you a working system that respects privacy while delivering ChatGPT-like capabilities on your proprietary data.

Start with the minimal example provided, test it with your documents, and iterate based on results. The beauty of this approach is modularity—swap embedding models, try different LLMs, adjust retrieval parameters, or enhance the interface without rewriting the entire system. Your private AI assistant will improve as you refine each component, ultimately becoming an indispensable tool for navigating your personal or organizational knowledge base.

Leave a Comment