Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1 that accepts the same request format, returns the same response structure, and works with the official OpenAI Python and JavaScript SDKs. For most applications, switching from OpenAI to local Ollama inference requires changing exactly two lines: the base_url and the api_key. This guide covers the full migration path — from basic chat completions to streaming, embeddings, function calling, and the gaps to be aware of.
Python SDK: The Two-Line Change
# Before: OpenAI cloud
from openai import OpenAI
client = OpenAI(api_key='sk-...')
# After: Ollama local — two lines changed
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama' # required by SDK but ignored by Ollama
)
# Everything else stays identical
response = client.chat.completions.create(
model='llama3.2', # change model name to Ollama model
messages=[{'role': 'user', 'content': 'Hello!'}]
)
print(response.choices[0].message.content)
JavaScript/TypeScript SDK
// Before
import OpenAI from 'openai';
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
// After
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama'
});
// Usage identical
const response = await client.chat.completions.create({
model: 'llama3.2',
messages: [{ role: 'user', content: 'Hello!' }]
});
console.log(response.choices[0].message.content);
Streaming
# Streaming works identically
stream = client.chat.completions.create(
model='llama3.2',
messages=[{'role':'user','content':'Count to 5'}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end='', flush=True)
print()
Embeddings
# Embeddings via OpenAI SDK pointed at Ollama
response = client.embeddings.create(
model='nomic-embed-text', # Ollama embedding model
input='The quick brown fox'
)
vector = response.data[0].embedding
print(f'Embedding dimension: {len(vector)}')
# Batch embeddings
texts = ['first document', 'second document', 'third document']
response = client.embeddings.create(
model='nomic-embed-text',
input=texts
)
vectors = [item.embedding for item in response.data]
Environment Variable Pattern
For applications already reading configuration from environment variables, you can switch between OpenAI and Ollama without code changes:
import os
from openai import OpenAI
client = OpenAI(
base_url=os.getenv('OPENAI_BASE_URL', 'https://api.openai.com/v1'),
api_key=os.getenv('OPENAI_API_KEY', 'ollama')
)
model = os.getenv('LLM_MODEL', 'gpt-4o-mini')
# Use OpenAI (default)
export OPENAI_API_KEY=sk-...
export LLM_MODEL=gpt-4o-mini
# Switch to Ollama
export OPENAI_BASE_URL=http://localhost:11434/v1
export OPENAI_API_KEY=ollama
export LLM_MODEL=llama3.2
LangChain Migration
# Before: LangChain + OpenAI
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model='gpt-4o-mini', temperature=0.7)
# After: LangChain + Ollama (via OpenAI-compatible endpoint)
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model='llama3.2',
temperature=0.7,
openai_api_key='ollama',
openai_api_base='http://localhost:11434/v1'
)
# Or use the native LangChain Ollama integration
from langchain_ollama import ChatOllama
llm = ChatOllama(model='llama3.2', temperature=0.7)
# Everything downstream (chains, agents, tools) works unchanged
from langchain_core.messages import HumanMessage
response = llm.invoke([HumanMessage(content='Hello!')])
print(response.content)
LlamaIndex Migration
# Before
from llama_index.llms.openai import OpenAI
llm = OpenAI(model='gpt-4o-mini')
# After
from llama_index.llms.ollama import Ollama
llm = Ollama(model='llama3.2', request_timeout=120.0)
# Or via OpenAI-compatible endpoint
from llama_index.llms.openai_like import OpenAILike
llm = OpenAILike(
model='llama3.2',
api_base='http://localhost:11434/v1',
api_key='ollama'
)
What the Compatibility Layer Does and Does Not Support
The Ollama OpenAI-compatible endpoint supports: chat completions (/v1/chat/completions), text completions (/v1/completions), embeddings (/v1/embeddings), and model listing (/v1/models). Streaming, temperature, top_p, stop sequences, and max_tokens parameters all work as expected.
What is not supported or behaves differently: Function calling / tool use — Ollama supports function calling but requires models that were specifically trained for it (llama3.2 and qwen2.5 support it; older models do not). Vision / image inputs — supported only with vision-capable models (llava, gemma3, qwen2.5vl). Logprobs — not supported in Ollama’s compatibility layer. Fine-tuned model IDs — Ollama uses its own model naming rather than OpenAI’s fine-tuned model ID format. Usage/billing fields — token counts are returned but billing is obviously not applicable.
Why Migrate to Local Inference?
The OpenAI-compatible endpoint is not just a convenience — it is the key that makes migration reversible and risk-free. You can switch to Ollama for development and testing, verify that your application works correctly with local models, and revert to OpenAI for production with a single environment variable change if the quality is not sufficient. This reversibility removes the main risk from trying local inference: you are not committing to a local-only architecture, just adding a local option alongside the cloud one.
The practical reasons to migrate vary by use case. For development and testing, local inference eliminates API costs during the iteration cycle where you run the same prompts hundreds of times while tuning prompts and code. API costs during development can be significant — running 10,000 test requests during development at GPT-4o-mini pricing adds up faster than expected. For privacy-sensitive applications, local inference keeps all data on your infrastructure without any configuration beyond pointing the client at Ollama. For latency-sensitive applications, eliminating the network round-trip to OpenAI’s servers reduces first-token latency significantly, particularly for users geographically distant from the API endpoints.
Migrating Function Calling
Function calling (tool use) is the most complex part of the migration because it requires a model that was specifically trained for it. Not all Ollama models support function calling — use Llama 3.2, Qwen2.5, or Mistral for tool use. The API format is identical to OpenAI’s:
tools = [{
'type': 'function',
'function': {
'name': 'get_weather',
'description': 'Get current weather for a city',
'parameters': {
'type': 'object',
'properties': {
'city': {'type': 'string', 'description': 'City name'}
},
'required': ['city']
}
}
}]
response = client.chat.completions.create(
model='llama3.2', # must be a tool-capable model
messages=[{'role':'user','content':'What is the weather in Paris?'}],
tools=tools,
tool_choice='auto'
)
# Handle tool calls
if response.choices[0].finish_reason == 'tool_calls':
tool_call = response.choices[0].message.tool_calls[0]
print(f'Tool: {tool_call.function.name}')
print(f'Args: {tool_call.function.arguments}')
Tool use quality with local 7–8B models is functional but less reliable than GPT-4o on complex multi-tool scenarios. For simple single-tool use cases, Llama 3.2 8B handles it well. For complex agentic workflows with multiple tools and conditional logic, test carefully — local models are more likely to call the wrong tool or format arguments incorrectly on edge cases.
Testing the Migration
Before switching production traffic, run a parallel test: send the same requests to both OpenAI and Ollama and compare the responses. This gives you concrete, task-specific quality data:
import os
from openai import OpenAI
openai_client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])
ollama_client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
test_prompts = [
'Summarise this document in 3 bullet points: ...',
'Write a Python function to parse a CSV file...',
'Translate this paragraph to French: ...'
]
for prompt in test_prompts:
oai = openai_client.chat.completions.create(
model='gpt-4o-mini',
messages=[{'role':'user','content':prompt}]
).choices[0].message.content
local = ollama_client.chat.completions.create(
model='llama3.2',
messages=[{'role':'user','content':prompt}]
).choices[0].message.content
print(f'--- Prompt: {prompt[:50]}...')
print(f'OpenAI: {oai[:200]}')
print(f'Ollama: {local[:200]}')
print()
When Local Inference Is Not a Suitable Replacement
Honest assessment: there are cases where the quality gap between OpenAI’s models and locally runnable models is too large to bridge with a simple migration. Complex reasoning tasks that require GPT-4-class capability — multi-step mathematical proofs, code generation for novel algorithms, nuanced analysis of complex documents — show a clear quality advantage for the larger cloud models that cannot currently be replicated locally without a 70B+ model requiring high-end hardware. For these tasks, a hybrid approach is more appropriate than a full migration: use local models for the portions of your application where 7–8B quality is sufficient (the majority of requests in most applications) and route only the complex cases to the cloud API. This hybrid approach reduces API costs by 70–90% while preserving cloud quality where it actually matters.
The Migration Checklist
Before switching: confirm your model name mapping (OpenAI model names to Ollama model names), verify the Ollama endpoint is accessible from your application’s deployment environment, test streaming if your application uses it, test embeddings with nomic-embed-text if your application uses ada-002, and test function calling if your application uses tools. After switching: monitor response quality on a representative sample of real requests, check latency (local inference eliminates network latency but inference speed depends on your hardware), and verify that your error handling covers Ollama-specific errors like model not found and context length exceeded. The migration is typically straightforward — most applications work correctly with only the two-line SDK change — but the checklist ensures you have not missed a feature that your application depends on.
Model Name Mapping
When migrating, you need to choose Ollama model equivalents for the OpenAI models your application currently uses. Here are the practical mappings based on capability tier:
- gpt-4o / gpt-4 → llama3.3:70b (on 64GB+ hardware) or mistral-nemo:12b (on 12GB+ VRAM) for the best locally available quality
- gpt-4o-mini → llama3.2 or qwen2.5:7b — strong general-purpose 7–8B models
- gpt-3.5-turbo → llama3.2 or gemma3:4b — fast, capable for straightforward tasks
- text-embedding-ada-002 / text-embedding-3-small → nomic-embed-text — 768-dim embeddings, widely supported
- text-embedding-3-large → mxbai-embed-large — 1024-dim embeddings for higher precision RAG
The mapping is not exact — local models at the same tier have different strengths and weaknesses compared to OpenAI’s models. Llama 3.2 8B is stronger than gpt-3.5-turbo on some tasks and weaker on others. Test on your specific task distribution rather than assuming a tier-to-tier correspondence.
Cost and Performance Trade-offs
The financial case for migration depends heavily on your usage volume and hardware situation. At low volume (fewer than 100,000 tokens per day), OpenAI’s API costs are modest enough that the migration effort may not be worthwhile purely on cost grounds. At medium volume (1–10 million tokens per day), local inference becomes clearly cost-competitive, paying back the hardware investment in weeks to months depending on the model tier used. At high volume (100M+ tokens per day), local inference is dramatically cheaper regardless of hardware cost — the compute cost of running inference locally is a fraction of cloud API pricing at scale.
Latency is more nuanced. Eliminating network latency to OpenAI’s servers reduces first-token latency, but local inference speed depends entirely on your hardware. A 7B model on an RTX 3080 generates at 50–80 tokens/sec, comparable to or faster than streaming from OpenAI’s API for most users. A 13B or 70B model on the same hardware is slower. For latency-critical applications, benchmark your specific hardware with your target model before committing to the migration.
The Reversibility Advantage
Perhaps the most underappreciated aspect of the OpenAI-compatible endpoint is that it makes the migration completely reversible. Because switching between OpenAI and Ollama requires only changing the base_url and api_key (typically environment variables), you can A/B test between cloud and local inference in production, roll back instantly if quality issues emerge, and gradually migrate traffic from cloud to local as you gain confidence in the local model’s quality for your specific use case. This is the right way to approach the migration for production applications — not a single cutover, but a gradual shift supported by quality monitoring that lets you make data-driven decisions about which requests are best served locally versus in the cloud.
Setting Up for Production
For a production deployment where Ollama serves inference requests from an application server, a few additional considerations matter beyond the basic SDK swap. First, ensure Ollama is configured to listen on all interfaces rather than just localhost if the application and Ollama run on separate hosts: set OLLAMA_HOST=0.0.0.0 before starting Ollama. Second, consider setting up a reverse proxy (Nginx or Caddy) in front of Ollama to add TLS, request logging, and basic authentication for internal deployments — Ollama has no built-in authentication on its API. Third, pre-load your production models at server startup using keep-alive to eliminate cold-start latency on the first request: send a minimal request with keep_alive=-1 during your server’s startup sequence. Fourth, set appropriate timeouts in your application’s HTTP client — local inference can take longer than OpenAI’s API for large outputs, and default timeouts that work for cloud inference may cause spurious errors on long local generations. A 120-second timeout is a safe starting point for 7B models generating up to 1,000 tokens.
The migration from OpenAI to Ollama is among the most straightforward infrastructure changes in the local AI ecosystem, precisely because the OpenAI-compatible endpoint was designed to make it so. The investment in testing and the environment variable pattern described here creates a foundation where you can make the switch confidently, validate the results empirically, and build toward a local-first architecture while retaining the cloud as a fallback for the cases where it genuinely adds value.
Getting Started
The fastest path to validating whether Ollama is a viable replacement for your application: update your development environment to point at Ollama (two environment variable changes), run your test suite, and review the outputs. If your tests pass and the output quality is acceptable, you have a working local inference setup in under an hour. If specific tests fail or quality is insufficient, you have concrete data about which parts of your application need the cloud model — which is precisely the information you need to make a thoughtful, application-specific decision about hybrid versus full local deployment. The OpenAI-compatible endpoint makes this evaluation risk-free and reversible, which is the right foundation for any infrastructure decision with quality implications.
The broader shift this enables — moving AI inference from a recurring operational expense to a fixed infrastructure cost — compounds over time as usage grows. Every request your application handles locally after the initial hardware investment costs effectively nothing, while cloud API requests scale linearly with usage. For applications with growing AI feature adoption, the economics of local inference improve every month, making it one of the few infrastructure decisions where the right time to start is sooner rather than later.
The combination of zero marginal cost per request, full data privacy, and a migration path that requires only two lines of code changes makes the case for local inference compelling for any application with non-trivial AI usage volume — and the OpenAI-compatible endpoint means you can start evaluating that case today without any architectural commitment — and the two-line migration demonstrates that local AI infrastructure can be genuinely accessible without sacrificing the flexibility your application needs.