How to Use Ollama with LangChain

LangChain is the most popular Python framework for building LLM-powered applications. It provides abstractions for chains, prompts, memory, and agents that work with any language model backend — including local models via Ollama. This guide covers the key LangChain patterns with Ollama: chains, RAG, and simple agents.

Installation

pip install langchain langchain-ollama langchain-community

Basic LLM and Chat Model

from langchain_ollama import OllamaLLM, ChatOllama
from langchain_core.messages import HumanMessage, SystemMessage

# Raw LLM (generate)
llm = OllamaLLM(model='llama3.2')
response = llm.invoke('Why is Python popular?')
print(response)

# Chat model (preferred for instruction-following)
chat = ChatOllama(model='llama3.2', temperature=0.3)
messages = [
    SystemMessage(content='You are a concise technical writer.'),
    HumanMessage(content='Explain Python in one sentence.')
]
response = chat.invoke(messages)
print(response.content)

Prompt Templates and Chains

from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

chat = ChatOllama(model='llama3.2', temperature=0.3)

# Build a chain: prompt | model | parser
prompt = ChatPromptTemplate.from_messages([
    ('system', 'You are an expert {domain} consultant. Be concise.'),
    ('human', '{question}')
])

chain = prompt | chat | StrOutputParser()

result = chain.invoke({
    'domain': 'software architecture',
    'question': 'What is the strangler fig pattern?'
})
print(result)

# Streaming
for chunk in chain.stream({'domain': 'devops', 'question': 'Explain blue-green deployment'}):
    print(chunk, end='', flush=True)

RAG: Chat with Your Documents

from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Embeddings with Ollama
embeddings = OllamaEmbeddings(model='nomic-embed-text')

# Build a vector store from documents
docs = [
    Document(page_content='LangChain supports multiple LLM backends including Ollama.'),
    Document(page_content='Ollama runs models locally for privacy and offline use.'),
    Document(page_content='nomic-embed-text is a fast local embedding model.'),
]
vectorstore = Chroma.from_documents(docs, embedding=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={'k': 2})

# RAG chain
chat = ChatOllama(model='llama3.2', temperature=0)
template = '''Answer using only this context:
{context}

Question: {question}'''
prompt = ChatPromptTemplate.from_template(template)

def format_docs(docs):
    return '\n\n'.join(d.page_content for d in docs)

rag_chain = (
    {'context': retriever | format_docs, 'question': RunnablePassthrough()}
    | prompt | chat | StrOutputParser()
)

print(rag_chain.invoke('What is nomic-embed-text?'))

Conversation Memory

from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.chat_history import InMemoryChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_core.output_parsers import StrOutputParser

chat = ChatOllama(model='llama3.2')
prompt = ChatPromptTemplate.from_messages([
    ('system', 'You are a helpful assistant.'),
    MessagesPlaceholder('history'),
    ('human', '{input}')
])

chain = prompt | chat | StrOutputParser()

store = {}
def get_history(session_id):
    if session_id not in store:
        store[session_id] = InMemoryChatMessageHistory()
    return store[session_id]

with_memory = RunnableWithMessageHistory(
    chain,
    get_history,
    input_messages_key='input',
    history_messages_key='history'
)

cfg = {'configurable': {'session_id': 'user-123'}}
print(with_memory.invoke({'input': 'My name is Alex.'}, config=cfg))
print(with_memory.invoke({'input': 'What is my name?'}, config=cfg))

Simple Agent with Tools

from langchain_ollama import ChatOllama
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent

chat = ChatOllama(model='llama3.2', temperature=0)

@tool
def get_word_count(text: str) -> int:
    '''Count words in a text string.'''
    return len(text.split())

@tool
def to_uppercase(text: str) -> str:
    '''Convert text to uppercase.'''
    return text.upper()

agent = create_react_agent(chat, [get_word_count, to_uppercase])

result = agent.invoke({
    'messages': [('user', 'Count the words in "hello world foo bar" then uppercase the result')]
})
print(result['messages'][-1].content)

Why LangChain with Ollama?

LangChain provides abstractions that handle the boilerplate of LLM-powered application development: prompt management, output parsing, retrieval, memory, and tool use. Using raw Ollama API calls for complex pipelines — multi-step chains, RAG with dynamic retrieval, agents that call tools — requires significant custom code for each pattern. LangChain standardises these patterns so you can compose them with minimal glue code. The LangChain Expression Language (LCEL) pipe syntax (prompt | model | parser) is particularly clean for expressing data transformation pipelines that pass content through multiple processing steps.

The langchain-ollama package is the official integration — maintained and updated alongside new LangChain releases. It provides ChatOllama for chat models and OllamaEmbeddings for embeddings, both implementing LangChain’s standard interfaces so they drop in anywhere a LangChain-compatible model or embedding is expected. This means community-contributed LangChain components — vector stores, document loaders, output parsers, tools — all work with Ollama without any modification.

LCEL and the Pipe Pattern

LangChain Expression Language is the core composition mechanism in modern LangChain. The pipe operator (|) chains components together: each component’s output becomes the next component’s input. A typical chain is prompt | model | parser — the prompt template formats the input into messages, the model generates a response, and the parser extracts the useful content from the model’s output object. This pattern is composable and lazy: defining a chain does not execute it; execution happens when you call .invoke(), .stream(), or .batch(). Streaming works naturally through LCEL — chain.stream(input) yields chunks as the model generates them without any special configuration.

Choosing Between ChatOllama and OllamaLLM

Use ChatOllama for virtually all new applications. It wraps Ollama’s /api/chat endpoint, which supports multi-turn conversation history through a list of messages with roles (system, human, assistant). OllamaLLM wraps the older /api/generate endpoint, which sends a single text prompt without role structure. Most instruction-following models are trained on the chat format and perform significantly better with ChatOllama — the role structure tells the model how to interpret the input and what kind of response is expected. Use OllamaLLM only for completion-style tasks where you are extending a text prefix rather than having a conversation.

RAG Architecture with Local Embeddings

The RAG chain in this article uses three local components: OllamaEmbeddings with nomic-embed-text for generating document and query embeddings, Chroma as the local vector store for similarity search, and ChatOllama for answering questions with retrieved context. All three run locally — no cloud API calls, no data leaving your machine. This is the canonical local RAG architecture for LangChain projects and scales from small document sets (hundreds of documents, in-memory Chroma) to larger collections (tens of thousands of documents, persistent Chroma with a local directory).

For production RAG, add document loaders (langchain-community includes loaders for PDFs, Word docs, web pages, and many other formats) and text splitters to handle documents longer than the embedding model’s context window. The splitter breaks long documents into overlapping chunks before embedding, ensuring that long documents are fully indexed rather than truncated. The chunk size and overlap are the main parameters to tune based on your documents and the queries you expect.

LangChain vs Direct Ollama Calls

When should you use LangChain versus calling Ollama directly? Use LangChain when you need: RAG with a vector store retrieval step, conversation memory that persists across multiple turns, agent behaviour where the model decides which tools to call, complex multi-step pipelines where LCEL’s composition reduces boilerplate, or integration with LangChain’s broad ecosystem of loaders, parsers, and tools. Use direct Ollama API calls (via the Python library or REST) for simpler tasks: single-turn Q&A, batch text processing, embedding generation without retrieval, and straightforward chat interfaces. LangChain adds real value for complex orchestration but is overkill for simple inference calls — adding a framework dependency for a task that five lines of direct API code handles cleanly is not worth it.

Performance Considerations

LangChain’s abstractions add minimal overhead compared to direct API calls — the main cost is the Python object creation and the LCEL composition, which adds microseconds rather than milliseconds to each call. The dominant performance factor remains Ollama’s inference speed, which is identical whether you call it through LangChain or directly. For RAG applications, the embedding retrieval step (Chroma vector search) adds 10–50ms per query for typical collection sizes — fast enough to be imperceptible to users. For large Chroma collections (100k+ documents), consider persistent Chroma with an HNSW index rather than the default flat index to keep retrieval fast as the collection grows.

Getting Started

Install langchain langchain-ollama langchain-community, pull llama3.2 and nomic-embed-text in Ollama, and run the basic chain and RAG examples from this article. The patterns scale from prototype to production with the same code structure — add persistent Chroma, better document loaders, and more sophisticated prompts as your application matures. LangChain’s documentation and the broad community of existing examples make it the fastest way to build complex local AI applications in Python without reinventing the orchestration layer from scratch.

The LangChain Ecosystem for Local AI

LangChain’s value compounds as you use more of its ecosystem. Once you have Ollama integrated as the model backend, you gain access to: document loaders for PDFs, Markdown, web pages, CSV files, and dozens of other formats; text splitters that handle chunking intelligently for different document types; output parsers that extract structured data from model responses; and integrations with external tools and APIs. All of these work the same way whether your model backend is Ollama, OpenAI, or Anthropic — the abstraction layer is the point. For teams that build multiple AI-powered applications, the consistency of the LangChain patterns across applications reduces cognitive overhead and allows knowledge and components to be shared across projects. The investment in learning LangChain’s patterns pays dividends across every AI application you build with it.

Using LangChain with Multiple Models

One of LangChain’s practical strengths is making it easy to use different models for different parts of a pipeline. A common pattern is using a fast small model for retrieval-augmented generation (where the context is constrained and the task is straightforward) and a larger model for complex reasoning steps. With Ollama and LangChain, switching models is a one-line change — create a different ChatOllama instance with the desired model name and plug it into the chain. This makes it easy to experiment with model selection for different tasks within the same application without restructuring the pipeline:

fast_model = ChatOllama(model='qwen2.5:3b', temperature=0)   # fast retrieval Q&A
slow_model = ChatOllama(model='llama3.2', temperature=0.3)  # nuanced generation

# Use fast model for classification
classify_chain = classify_prompt | fast_model | StrOutputParser()

# Use slow model for final synthesis
synthesize_chain = synth_prompt | slow_model | StrOutputParser()

Testing LangChain Pipelines

LangChain chains are testable without a running Ollama instance by using mock language models. The langchain_core.language_models.fake module provides FakeChatModel and FakeListChatModel that return predefined responses, letting you test your chain logic — prompt formatting, retrieval, output parsing — without inference overhead:

from langchain_core.language_models.fake import FakeListChatModel
from langchain_core.messages import AIMessage

# Replace ChatOllama with a fake for testing
fake_chat = FakeListChatModel(responses=['Paris'])
test_chain = prompt | fake_chat | StrOutputParser()
assert test_chain.invoke({'question': 'Capital of France?'}) == 'Paris'

When to Move Beyond LangChain

LangChain is excellent for getting complex AI pipelines working quickly, but some teams eventually outgrow it for production systems. Signs that you may want to drop to direct Ollama calls: the abstraction is hiding behaviour you need to control precisely (streaming chunk timing, exact token counts, specific request parameters); the framework version updates are breaking your application frequently; or the overhead of LangChain’s object model is measurable in your performance profiling. Direct Ollama API calls give you exact control at the cost of more boilerplate. The right answer depends on your project’s complexity and your team’s maintenance preferences — LangChain is excellent for most projects and a reasonable choice to start with, moving to direct calls only when you have a specific reason to do so.

Streaming in LangChain Applications

Streaming is important for user-facing applications where response latency is visible. LCEL chains stream automatically when you call .stream() instead of .invoke(). For web applications, you can stream LangChain responses to the browser using Server-Sent Events or HTTP chunked transfer encoding. FastAPI and Flask both support streaming responses that work naturally with LangChain’s streaming interface. The key is that every component in the chain must be streaming-compatible — ChatOllama is, StrOutputParser is, and most LangChain output parsers are. Custom components you write need to implement the streaming interface if they will be used in streaming chains, but most standard LangChain components handle this transparently. For non-streaming components in the middle of a chain (retrieval, tool calls), the chain accumulates those results before streaming the final model output — which is the correct behaviour since those steps must complete before generation can begin.

LangChain in Production

For production LangChain applications backed by Ollama, the key operational concerns are: ensuring Ollama is running and the required model is loaded before the application starts (use the preloading pattern from the keep-alive article), configuring appropriate timeouts on the ChatOllama client for user-facing applications, and logging chain inputs and outputs for debugging and quality monitoring. LangChain’s built-in tracing (LangSmith) is excellent for debugging complex chains during development — it shows each step’s input, output, and latency. For production, implement your own lightweight logging that captures the information you need without sending data to external services, consistent with the local-first philosophy of the Ollama-based stack.

Getting Started

Run pip install langchain langchain-ollama langchain-community, pull llama3.2 and nomic-embed-text in Ollama, and try the basic chain and RAG examples from this article. The LCEL pipe syntax takes 15–20 minutes to feel natural — after that, composing new pipelines becomes fast and intuitive. Start with the simplest chain that solves your problem and add memory, retrieval, and tool use incrementally as your application’s requirements grow. LangChain’s documentation is comprehensive and the community produces a large volume of examples, making it relatively easy to find patterns for whatever you want to build — even with a local Ollama backend rather than a cloud model.

LangChain’s Position in the Local AI Stack

LangChain sits in the orchestration layer between your application code and the inference layer (Ollama). It is not required — you can build sophisticated AI applications with direct Ollama API calls — but it provides valuable abstractions for patterns that recur across many projects. The LCEL pipe syntax makes complex pipelines readable and composable. The memory abstractions handle the conversation history bookkeeping that every multi-turn application needs. The retrieval abstractions standardise the RAG pattern that appears in most knowledge-intensive applications. And the agent abstractions provide a framework for tool use that would require significant custom code to implement from scratch. For teams building multiple AI applications, investing in learning LangChain pays dividends across all of them — the patterns transfer, the components are reusable, and the debugging tools apply equally to every project.

Leave a Comment