How to Use Ollama's OpenAI-Compatible API

Ollama exposes an OpenAI-compatible REST API endpoint, which means any tool, library, or application that works with the OpenAI API can be pointed at your local Ollama instance with a one-line change. This includes the official OpenAI Python and JavaScript SDKs, LangChain, LlamaIndex, Continue, and hundreds of other tools. You get local, private, free inference behind a familiar interface.

The Endpoint

Ollama’s OpenAI-compatible endpoint runs at http://localhost:11434/v1. The key routes are:

POST http://localhost:11434/v1/chat/completions   # Chat (GPT-3.5/4 compatible)
POST http://localhost:11434/v1/completions        # Text completion
POST http://localhost:11434/v1/embeddings         # Embeddings
GET  http://localhost:11434/v1/models             # List available models

Ollama must be running (ollama serve or the Ollama app) and you need at least one model pulled before making requests. No API key is required — you can pass any string or leave it blank.

Using the OpenAI Python SDK

The OpenAI Python SDK works with Ollama out of the box. Set base_url to Ollama’s endpoint and api_key to any non-empty string:

from openai import OpenAI

# Point the OpenAI SDK at your local Ollama instance
client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama',  # required but ignored by Ollama
)

# Chat completion — identical to OpenAI API usage
response = client.chat.completions.create(
    model='llama3.2',
    messages=[
        {'role': 'system', 'content': 'You are a helpful assistant.'},
        {'role': 'user', 'content': 'Explain gradient descent in 2 sentences.'}
    ],
    temperature=0.7,
)
print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Write a haiku about Python.'}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end='', flush=True)

Embeddings

from openai import OpenAI
import numpy as np

client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')

# Pull a dedicated embedding model first: ollama pull nomic-embed-text
response = client.embeddings.create(
    model='nomic-embed-text',
    input='The quick brown fox jumps over the lazy dog',
)
embedding = response.data[0].embedding
print(f'Embedding dimension: {len(embedding)}')

# Batch embeddings
texts = ['Python is great', 'JavaScript is fast', 'Rust is safe']
response = client.embeddings.create(model='nomic-embed-text', input=texts)
embeddings = np.array([d.embedding for d in response.data])
print(f'Shape: {embeddings.shape}')

Switching Existing Code from OpenAI to Ollama

If you have existing code using the OpenAI API, switching to Ollama requires changing exactly two things: the base_url and the model name. Everything else — message format, streaming, parameters — stays identical.

# BEFORE: OpenAI API
from openai import OpenAI
client = OpenAI(api_key='sk-...')
response = client.chat.completions.create(
    model='gpt-4o',
    messages=[{'role': 'user', 'content': 'Hello'}]
)

# AFTER: Local Ollama — only two lines change
from openai import OpenAI
client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
response = client.chat.completions.create(
    model='llama3.2',          # change model name to an Ollama model
    messages=[{'role': 'user', 'content': 'Hello'}]
)

Using with LangChain

from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# LangChain ChatOpenAI pointed at Ollama
llm = ChatOpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama',
    model='llama3.2',
    temperature=0.3,
)
response = llm.invoke('What is RAG in one sentence?')
print(response.content)

# LangChain embeddings via Ollama
embeddings = OpenAIEmbeddings(
    base_url='http://localhost:11434/v1',
    api_key='ollama',
    model='nomic-embed-text',
)
vector = embeddings.embed_query('test query')
print(f'Vector length: {len(vector)}')

Using with LlamaIndex

from llama_index.llms.openai import OpenAI as LlamaOpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

llm = LlamaOpenAI(
    api_base='http://localhost:11434/v1',
    api_key='ollama',
    model='llama3.2',
)

embed_model = OpenAIEmbedding(
    api_base='http://localhost:11434/v1',
    api_key='ollama',
    model='nomic-embed-text',
)

response = llm.complete('Explain transformers briefly.')
print(response)

Environment Variable Approach

If you want to switch between OpenAI and Ollama without changing code, use environment variables:

# Switch to local Ollama
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=ollama

# Switch back to real OpenAI
export OPENAI_API_BASE=https://api.openai.com/v1
export OPENAI_API_KEY=sk-your-real-key

import os
from openai import OpenAI

# Reads from environment — no hardcoded URLs
client = OpenAI(
    base_url=os.getenv('OPENAI_API_BASE', 'https://api.openai.com/v1'),
    api_key=os.getenv('OPENAI_API_KEY', 'ollama'),
)
# Same code runs against either OpenAI or Ollama depending on env vars

Limitations vs the Real OpenAI API

A few things the Ollama OpenAI-compatible endpoint does not support that the real API does: function calling and tool use work with models that support it natively (like Llama 3.1+) but the response format may differ slightly from OpenAI’s; vision/image inputs work with multimodal models like LLaVA but require passing images as base64 strings in the message content; the logprobs and n parameters are ignored; and fine-tuned model IDs obviously do not transfer. For the vast majority of chat, completion, and embedding use cases these limitations do not matter — the API is a drop-in replacement that covers 95% of real-world usage.

Why the OpenAI Compatibility Layer Matters

The OpenAI API has become the de facto standard interface for LLM applications. The vast majority of open-source LLM tooling — LangChain, LlamaIndex, Continue, PrivateGPT, Open Interpreter, and hundreds of smaller projects — was built to target the OpenAI API. By implementing the same interface, Ollama makes itself a drop-in backend for this entire ecosystem without any of those tools needing to add explicit Ollama support. This is strategically important: it means the local LLM ecosystem can reuse the enormous investment that went into building OpenAI-compatible tooling, rather than starting from scratch with a new API standard.

From a practical standpoint, it also means you can prototype with the real OpenAI API during development and switch to Ollama for production or privacy-sensitive deployments with minimal code changes. Teams that build internal tools on the OpenAI API can add a configuration option that redirects to a local Ollama instance for on-premises deployment — same codebase, different endpoint.

Choosing the Right Model for API Use

When using Ollama through the OpenAI-compatible API programmatically, a few model considerations differ from interactive use. For chat completions in automated pipelines, lower temperatures (0.1–0.3) produce more consistent, predictable outputs that are easier to parse downstream. For embedding use cases, always use a dedicated embedding model rather than a general chat model — nomic-embed-text is the standard choice with Ollama, producing 768-dimensional embeddings that work well with most vector databases. For applications that need structured JSON output, models with strong instruction following (Llama 3.2, Qwen 2.5) reliably produce valid JSON when the system prompt explicitly requires it and temperature is set low.

One thing to be aware of with the Ollama OpenAI-compatible API is that the model field in your request must match exactly the name of a model you have pulled locally. Unlike the real OpenAI API where model names are stable globally, Ollama model names include version tags — llama3.2 and llama3.2:latest both work, but llama3.2:8b requires you to have pulled that specific variant. Run ollama list to see the exact names available on your machine.

Building a Local AI Proxy with LiteLLM

LiteLLM is a proxy server that sits between your application and multiple LLM backends, presenting a unified OpenAI-compatible interface. Running LiteLLM in front of Ollama lets you add API key authentication, rate limiting, logging, cost tracking, and the ability to fall back to a cloud provider if the local model is unavailable — all without changing your application code.

pip install litellm[proxy]

# Start LiteLLM proxy pointing at Ollama
litellm --model ollama/llama3.2 --port 8000

# Your app now points at LiteLLM instead of Ollama directly
from openai import OpenAI
client = OpenAI(base_url='http://localhost:8000', api_key='any-key')
response = client.chat.completions.create(
    model='ollama/llama3.2',
    messages=[{'role': 'user', 'content': 'Hello'}]
)

The LiteLLM approach is particularly useful in team environments where you want centralised control over which models are accessible, usage logging for billing or auditing purposes, or the ability to transparently swap between local and cloud models based on request load or model availability. It adds one extra process to manage but provides operational flexibility that direct Ollama API access does not.

Testing Your Ollama API Setup

Before integrating into an application, it is worth verifying the API works correctly with a quick curl test. These commands confirm Ollama is running, the model is available, and the OpenAI-compatible endpoint is responding correctly:

# Check available models
curl http://localhost:11434/v1/models | python3 -m json.tool

# Quick chat completion test
curl http://localhost:11434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Say hi"}]}'

# Embedding test
curl http://localhost:11434/v1/embeddings \
  -H 'Content-Type: application/json' \
  -d '{"model": "nomic-embed-text", "input": "test sentence"}'

If the models endpoint returns an empty list, you need to pull a model first with ollama pull llama3.2. If you get a connection refused error, Ollama is not running — start it with ollama serve or open the Ollama desktop app.

Performance Tuning for API Use

When calling Ollama through the API in a loop or from multiple concurrent clients, a few settings make a meaningful difference to throughput and latency. The most important is ensuring Ollama is running with GPU acceleration — if inference is falling back to CPU you will see tokens-per-second in single digits rather than the 30–80 typical on a mid-range GPU. Check by running ollama ps while a request is in flight; it shows which device the model is running on and the current memory utilisation.

For concurrent requests, Ollama queues them by default and processes one at a time. If you need to serve multiple simultaneous API clients, set the OLLAMA_NUM_PARALLEL environment variable to allow concurrent inference. Be aware this multiplies VRAM usage proportionally — running two requests in parallel with a 7B model effectively requires twice the KV cache memory. On a 24GB GPU you can typically handle two or three parallel requests with a 7B model before hitting memory limits.

Response latency for the first token (time to first token, TTFT) is dominated by the prefill phase — processing the input prompt. For applications where the system prompt is long and repeated across many requests, Ollama’s prompt caching helps significantly: if consecutive requests share an identical prefix (same system prompt), Ollama reuses the cached KV state for that prefix rather than recomputing it. Structure your prompts to keep the system prompt at the beginning and vary only the user content to maximise cache hits.

Practical Use Cases

The Ollama OpenAI-compatible API unlocks several practical workflows that are otherwise awkward with the native Ollama API. Text classification pipelines that process thousands of documents locally benefit from the familiar interface and easy batching. Internal developer tools that use an LLM for code review, documentation generation, or test case writing can run entirely on-premises with no data leaving the organisation. Automated content pipelines that previously relied on OpenAI can be migrated to Ollama for cost savings, with the cloud API kept as a fallback for edge cases that exceed local model capability. Embedding pipelines for semantic search, RAG systems, and recommendation engines all work with the embeddings endpoint, replacing OpenAI’s ada-002 embeddings with local alternatives at zero ongoing cost once the hardware investment is made. The combination of full API compatibility, zero per-token cost, and complete data privacy makes the Ollama OpenAI-compatible endpoint one of the most practical features for teams looking to reduce LLM infrastructure costs without a major code rewrite.

Async Usage with the OpenAI SDK

For high-throughput applications that call Ollama from async Python code — FastAPI services, async pipelines, concurrent document processors — the OpenAI SDK’s async client works identically to the sync version, just with await syntax. This lets you fire off multiple Ollama requests concurrently from a single async application without blocking on each response.

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(base_url='http://localhost:11434/v1', api_key='ollama')

async def classify(text: str) -> str:
    response = await client.chat.completions.create(
        model='llama3.2',
        messages=[
            {'role': 'system', 'content': 'Classify the sentiment as positive, negative, or neutral. Reply with one word only.'},
            {'role': 'user', 'content': text}
        ],
        temperature=0.0,
        max_tokens=5,
    )
    return response.choices[0].message.content.strip()

async def batch_classify(texts: list[str]) -> list[str]:
    # Run all classifications concurrently
    tasks = [classify(t) for t in texts]
    return await asyncio.gather(*tasks)

# Example usage
texts = ['Great product!', 'Terrible experience.', 'It was okay.']
results = asyncio.run(batch_classify(texts))
for text, label in zip(texts, results):
    print(f'{label:10} | {text}')

Keep in mind that Ollama processes requests sequentially by default even with concurrent async calls — they queue up rather than running truly in parallel unless you set OLLAMA_NUM_PARALLEL. For batch processing where you care about total throughput rather than individual latency, the async approach still helps because you can pipeline the HTTP overhead while Ollama processes each request in sequence.

Getting Started: The Shortest Path

If you are new to using Ollama’s API and want the shortest path to a working setup, here is the complete sequence. Install Ollama from ollama.com, pull a model with ollama pull llama3.2, verify it is running with curl http://localhost:11434/v1/models, install the OpenAI Python package with pip install openai, then copy the two-line client setup from the SDK example above. That is genuinely all that is required to have a fully functional, free, private LLM API running on your machine. The OpenAI SDK handles connection management, retries, and response parsing exactly as it does for the real OpenAI API — you get the same developer experience with none of the per-token cost or data privacy concerns. For most developers, the most surprising thing about the Ollama OpenAI-compatible API is how little there is to configure — the hard part is choosing which model to use, not setting up the integration.

Security Note for Network-Exposed Setups

By default Ollama listens only on localhost, so the API is not accessible from other machines. If you set OLLAMA_HOST=0.0.0.0 to expose it on your local network, be aware there is no authentication on the API by default — any machine on your network can send requests to it. For home networks this is generally acceptable, but for office or shared networks it is worth either keeping Ollama on localhost and using a reverse proxy with basic auth in front of it, or using LiteLLM as a proxy that adds API key authentication. Never expose a default Ollama API directly to the public internet without authentication, as it would allow anyone to run unlimited inference on your hardware.

How to Use Ollama’s OpenAI-Compatible API