Ollama exposes a REST API on port 11434 that gives you direct programmatic control over model management and inference. Unlike the OpenAI-compatible endpoint at /v1, the native Ollama API has additional capabilities: streaming with full metadata, model loading/unloading control, embedding generation, and process inspection. This is the complete reference for every endpoint with working curl examples.
Base URL and Prerequisites
All endpoints are relative to http://localhost:11434. Ollama must be running — either via the desktop app or ollama serve in a terminal. No authentication is required by default. Test the connection with:
curl http://localhost:11434/api/tags
GET /api/tags — List Models
Returns all locally available models with their size, modification time, and digest.
curl http://localhost:11434/api/tags
import requests
models = requests.get('http://localhost:11434/api/tags').json()
for m in models['models']:
print(m['name'], f"{m['size']/1e9:.1f}GB")
POST /api/chat — Chat Completion
The main inference endpoint. Accepts a model name, message history, and optional parameters.
# Non-streaming
curl http://localhost:11434/api/chat \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Why is the sky blue?"}],
"stream": false
}'
# Streaming (default)
curl http://localhost:11434/api/chat \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Count to 5"}]
}'
import requests, json
def chat(model, messages, stream=False, **options):
payload = {'model': model, 'messages': messages, 'stream': stream}
if options:
payload['options'] = options
r = requests.post('http://localhost:11434/api/chat', json=payload, stream=stream)
if stream:
for line in r.iter_lines():
if line:
chunk = json.loads(line)
print(chunk['message']['content'], end='', flush=True)
if chunk.get('done'): break
else:
return r.json()['message']['content']
# With parameters
chat('llama3.2', [{'role':'user','content':'Hello'}],
temperature=0.5, num_ctx=8192)
POST /api/generate — Raw Text Completion
Simpler than /api/chat — takes a raw prompt string rather than a message array. Useful for completion-style tasks.
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3.2",
"prompt": "The capital of France is",
"stream": false
}'
r = requests.post('http://localhost:11434/api/generate',
json={'model':'llama3.2','prompt':'The capital of France is','stream':False})
print(r.json()['response'])
POST /api/embeddings — Generate Embeddings
curl http://localhost:11434/api/embeddings \
-d '{
"model": "nomic-embed-text",
"prompt": "The quick brown fox"
}'
def embed(text: str, model: str = 'nomic-embed-text') -> list[float]:
r = requests.post('http://localhost:11434/api/embeddings',
json={'model': model, 'prompt': text})
return r.json()['embedding']
vec = embed('machine learning is fascinating')
print(f'Dimension: {len(vec)}')
POST /api/pull — Download a Model
# Pull with streaming progress
curl http://localhost:11434/api/pull \
-d '{"name": "llama3.2"}'
# Pull specific tag
curl http://localhost:11434/api/pull \
-d '{"name": "llama3.2:8b-instruct-q4_K_M"}'
def pull_model(name: str):
r = requests.post('http://localhost:11434/api/pull',
json={'name': name}, stream=True)
for line in r.iter_lines():
if line:
status = json.loads(line)
if 'total' in status and 'completed' in status:
pct = 100 * status['completed'] / status['total']
print(f"\r{status.get('status','')} {pct:.1f}%", end='')
else:
print(status.get('status',''))
pull_model('qwen2.5-coder:7b')
DELETE /api/delete — Remove a Model
curl -X DELETE http://localhost:11434/api/delete \
-d '{"name": "llama3.2"}'
POST /api/copy — Duplicate a Model
# Copy/rename a model
curl http://localhost:11434/api/copy \
-d '{"source": "llama3.2", "destination": "my-llama"}'
GET /api/ps — Running Models
Shows which models are currently loaded in memory, their VRAM usage, and expiry time.
curl http://localhost:11434/api/ps
ps = requests.get('http://localhost:11434/api/ps').json()
for m in ps.get('models', []):
print(m['name'], f"{m.get('size_vram',0)/1e9:.1f}GB VRAM",
'expires:', m.get('expires_at',''))
POST /api/show — Model Details
Returns the full Modelfile, parameters, template, and metadata for a model.
curl http://localhost:11434/api/show \
-d '{"name": "llama3.2"}'
# Get a model's Modelfile programmatically
details = requests.post('http://localhost:11434/api/show',
json={'name': 'llama3.2'}).json()
print(details['modelfile'])
print('Parameters:', details.get('parameters',''))
POST /api/create — Create a Model from a Modelfile
modelfile = """FROM llama3.2
SYSTEM You are a concise assistant.
PARAMETER temperature 0.3
"""
r = requests.post('http://localhost:11434/api/create',
json={'name': 'concise-llama', 'modelfile': modelfile}, stream=True)
for line in r.iter_lines():
if line:
print(json.loads(line).get('status',''))
Key Options for /api/chat and /api/generate
Both inference endpoints accept an options object with these parameters:
{
"options": {
"temperature": 0.7,
"top_p": 0.9,
"top_k": 40,
"num_ctx": 8192,
"num_predict": 512,
"repeat_penalty": 1.1,
"seed": 42,
"stop": ["\n\n", ""]
}
}
Streaming Response Format
When streaming is enabled (the default), each response line is a JSON object. For /api/chat:
{"model":"llama3.2","created_at":"...","message":{"role":"assistant","content":"Hello"},"done":false}
{"model":"llama3.2","created_at":"...","message":{"role":"assistant","content":" there"},"done":false}
{"model":"llama3.2","created_at":"...","message":{"role":"assistant","content":""},"done":true,
"total_duration":2341000000,"load_duration":123000000,
"prompt_eval_count":12,"eval_count":45,"eval_duration":1980000000}
The final chunk (where done: true) includes performance statistics: total_duration, load_duration, prompt_eval_count (input tokens), eval_count (output tokens), and eval_duration. Use these to calculate tokens per second: eval_count / (eval_duration / 1e9).
Health Check
# Simple health check — returns 'Ollama is running' if up
curl http://localhost:11434/
Use this in scripts to verify Ollama is running before attempting model operations. Pair with /api/ps to check if a specific model is loaded and avoid the cold-start latency on the first request.
Why Use the Native API Instead of the OpenAI-Compatible Endpoint?
Ollama exposes two API surfaces: the native API at /api/* and the OpenAI-compatible API at /v1/*. The OpenAI-compatible endpoint is the right choice when you want to use existing tools and SDKs that target the OpenAI API — it requires minimal configuration and covers the most common use cases. The native API is better when you need capabilities the compatibility layer does not expose: model management (pulling, deleting, copying, creating models programmatically), the /api/ps endpoint for inspecting running models and VRAM usage, access to complete performance statistics on each response, and fine-grained control over the streaming format. If you are building a tool that manages models — a model manager UI, an automated deployment script, a monitoring dashboard — the native API gives you everything you need. For pure inference, the OpenAI-compatible endpoint is simpler and better-supported by third-party tooling.
A Minimal Python Client
import requests, json
class OllamaClient:
def __init__(self, base='http://localhost:11434'):
self.base = base
def chat(self, model, messages, stream=False, **opts):
r = requests.post(f'{self.base}/api/chat',
json={'model':model,'messages':messages,'stream':stream,'options':opts},
stream=stream)
if stream:
for line in r.iter_lines():
if line:
c = json.loads(line)
yield c['message']['content']
if c.get('done'): break
else:
return r.json()['message']['content']
def embed(self, text, model='nomic-embed-text'):
return requests.post(f'{self.base}/api/embeddings',
json={'model':model,'prompt':text}).json()['embedding']
def models(self):
return [m['name'] for m in requests.get(f'{self.base}/api/tags').json().get('models',[])]
def running(self):
return requests.get(f'{self.base}/api/ps').json().get('models',[])
client = OllamaClient()
print('Available:', client.models())
print(client.chat('llama3.2',[{'role':'user','content':'Hello'}]))
Tracking Performance Statistics
Every completed response includes timing and token statistics in the final chunk. These are invaluable for monitoring inference performance and diagnosing slow requests:
def chat_with_stats(model, prompt):
r = requests.post('http://localhost:11434/api/chat',
json={'model':model,'messages':[{'role':'user','content':prompt}],'stream':False})
d = r.json()
ns = 1_000_000_000
return {
'response': d['message']['content'],
'input_tokens': d.get('prompt_eval_count',0),
'output_tokens':d.get('eval_count',0),
'load_sec': d.get('load_duration',0)/ns,
'total_sec': d.get('total_duration',0)/ns,
'tok_per_sec': d.get('eval_count',0) / max(d.get('eval_duration',1)/ns,0.001)
}
result = chat_with_stats('llama3.2','Explain transformers briefly')
print(f"{result['tok_per_sec']:.1f} tok/s | {result['input_tokens']} in / {result['output_tokens']} out")
The load_duration field distinguishes a cold start (model loading from disk — several seconds) from a warm request (model already in VRAM — near zero). Use /api/ps before time-sensitive requests to check whether the model is already loaded and avoid unexpected first-request latency.
Error Handling
def safe_chat(model, messages):
try:
r = requests.post('http://localhost:11434/api/chat',
json={'model':model,'messages':messages,'stream':False}, timeout=120)
r.raise_for_status()
d = r.json()
if 'error' in d:
print(f'Ollama error: {d["error"]}')
return None
return d['message']['content']
except requests.exceptions.ConnectionError:
print('Cannot connect — is Ollama running?')
except requests.exceptions.Timeout:
print('Request timed out — model may be loading')
except requests.exceptions.HTTPError as e:
print(f'HTTP {e.response.status_code}: {e.response.text}')
return None
Checking If a Model Is Loaded Before Requesting
For latency-sensitive applications, check /api/ps before sending a request. If the model is not listed as running, the first request will incur a cold-start load penalty — for a 7B model this is typically 3–8 seconds. You can either accept this or pre-warm the model by sending a short dummy request after Ollama starts:
def is_model_loaded(model_name, base='http://localhost:11434'):
running = requests.get(f'{base}/api/ps').json().get('models',[])
return any(m['name'].startswith(model_name) for m in running)
def ensure_loaded(model, base='http://localhost:11434'):
if not is_model_loaded(model, base):
print(f'Pre-warming {model}...')
requests.post(f'{base}/api/generate',
json={'model':model,'prompt':' ','stream':False})
print('Ready')
ensure_loaded('llama3.2')
Concurrent Requests and OLLAMA_NUM_PARALLEL
By default Ollama processes one request at a time and queues additional ones. For multi-user setups or applications that send parallel requests, set the OLLAMA_NUM_PARALLEL environment variable before starting Ollama. Setting it to 2 allows two simultaneous requests at the cost of roughly doubling VRAM usage for the KV cache. For most personal and small-team use cases sequential processing is fine — the queue handles bursts without errors, and each request gets full GPU bandwidth rather than competing with concurrent inference. Only set OLLAMA_NUM_PARALLEL above 1 if you have profiled your workload and confirmed that queue depth is a real bottleneck rather than inference speed itself.
A Quick Reference Card
All endpoints at a glance: GET /api/tags to list models, POST /api/chat for chat with message history, POST /api/generate for raw text completion, POST /api/embeddings for embeddings, POST /api/pull to download a model, DELETE /api/delete to remove one, POST /api/copy to duplicate, GET /api/ps to see running models, POST /api/show to inspect a model’s Modelfile and metadata, POST /api/create to create a model from a Modelfile, and GET / for a simple health check. These eleven endpoints cover everything you need to build, manage, and monitor a complete local LLM application stack on top of Ollama.
Using the API for Model Management Automation
The model management endpoints make it straightforward to automate your Ollama setup. A common pattern in team environments is a setup script that pulls the required models on first run, verifies they are available, and exits gracefully if Ollama is not running. This prevents the frustration of team members running an application that silently fails because the required model was never pulled.
import requests
import sys
REQUIRED_MODELS = ['llama3.2', 'nomic-embed-text', 'qwen2.5-coder:7b']
BASE = 'http://localhost:11434'
def check_ollama():
try:
r = requests.get(BASE, timeout=3)
return r.status_code == 200
except requests.exceptions.ConnectionError:
return False
def list_models():
return {m['name'] for m in
requests.get(f'{BASE}/api/tags').json().get('models', [])}
def pull_if_missing(model):
available = list_models()
# Check if any variant of the model is present
if any(a.startswith(model.split(':')[0]) for a in available):
print(f' {model}: already available')
return
print(f' {model}: pulling...')
r = requests.post(f'{BASE}/api/pull', json={'name': model}, stream=True)
for line in r.iter_lines():
if line:
status = __import__('json').loads(line).get('status','')
if 'pulling' in status or 'success' in status:
print(f' {status}')
if __name__ == '__main__':
if not check_ollama():
print('ERROR: Ollama is not running. Start it with: ollama serve')
sys.exit(1)
print('Ensuring required models are available...')
for model in REQUIRED_MODELS:
pull_if_missing(model)
print('All models ready.')
Run this script as part of your project’s setup or CI pipeline to ensure a consistent model environment across machines. The check for existing models uses a prefix match so that llama3.2 matches llama3.2:latest or llama3.2:8b — adapt the matching logic to your versioning requirements.
Building a Model Status Dashboard
The /api/ps and /api/tags endpoints provide enough information to build a simple status dashboard that shows which models are available, which are currently loaded, their VRAM usage, and when the loaded models will be evicted from memory. This is useful in multi-user setups where you want visibility into what is running without SSH-ing into the server:
import requests
from datetime import datetime
def model_dashboard(base='http://localhost:11434'):
available = requests.get(f'{base}/api/tags').json().get('models',[])
running = requests.get(f'{base}/api/ps').json().get('models',[])
running_names = {m['name'] for m in running}
print('=== Ollama Model Dashboard ===')
print(f'Total models: {len(available)}')
print(f'Loaded in memory: {len(running)}')
print()
if running:
print('LOADED (in VRAM/RAM):')
for m in running:
vram_gb = m.get('size_vram', 0) / 1e9
expires = m.get('expires_at', 'unknown')
print(f" {m['name']:40} {vram_gb:.1f}GB VRAM expires: {expires[:19]}")
print()
print('ALL AVAILABLE MODELS:')
for m in available:
loaded = '* LOADED' if m['name'] in running_names else ''
size_gb = m.get('size', 0) / 1e9
print(f" {m['name']:40} {size_gb:.1f}GB {loaded}")
model_dashboard()
Rate Limiting Considerations
Unlike commercial APIs, Ollama has no rate limiting built in — it will accept as many requests as you send and queue them. The practical limit is hardware: if the queue grows faster than inference completes, memory usage can grow as pending requests each hold their context state. For high-throughput batch processing, use a semaphore to limit concurrent outstanding requests to a number that your hardware can handle without memory pressure. As a rough guide, keep outstanding requests below OLLAMA_NUM_PARALLEL plus two or three queued — beyond that, the queue adds latency without increasing throughput.
The Ollama REST API is intentionally simple and stable — the endpoints described here have not changed significantly since Ollama’s initial public release, and Mistral’s team has stated that backward compatibility is a priority. Building on the native API directly, rather than through a compatibility shim, gives you access to Ollama-specific features as they are added and makes your integration easier to debug since there is no translation layer between your requests and what Ollama actually receives.
Integrating the Ollama API into a Web Application
A common pattern is embedding Ollama API calls into a lightweight backend service that your frontend communicates with, rather than calling Ollama directly from the browser. This keeps the Ollama port internal to your server, adds a layer where you can implement authentication, logging, and rate limiting, and avoids CORS issues that arise when a browser tries to call localhost on a different port. FastAPI makes this straightforward:
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import requests, json
app = FastAPI()
OLLAMA = 'http://localhost:11434'
class ChatRequest(BaseModel):
model: str = 'llama3.2'
messages: list
temperature: float = 0.7
@app.post('/chat')
def chat(req: ChatRequest):
def generate():
r = requests.post(f'{OLLAMA}/api/chat',
json={'model':req.model,'messages':req.messages,
'stream':True,'options':{'temperature':req.temperature}},
stream=True)
for line in r.iter_lines():
if line:
chunk = json.loads(line)
yield chunk['message']['content']
if chunk.get('done'): break
return StreamingResponse(generate(), media_type='text/plain')
@app.get('/models')
def models():
r = requests.get(f'{OLLAMA}/api/tags')
return [m['name'] for m in r.json().get('models', [])]
# Run: uvicorn app:app --port 8080
This pattern is the backbone of most self-hosted LLM web applications — a thin FastAPI wrapper around Ollama that your frontend communicates with over a standard HTTP port, with the Ollama API entirely internal to your server network.
Testing Your API Integration
Before wiring the Ollama API into a production application, a few minutes of manual testing with curl catches the most common configuration issues. Test each endpoint in order: first the health check to confirm Ollama is running, then /api/tags to confirm models are available, then a simple non-streaming /api/chat call with stream: false to confirm inference works, then a streaming call to confirm streaming works end-to-end. If any of these fail, the error message from Ollama usually identifies the issue precisely — a missing model returns a clear 404-style error, a context length overflow returns a specific error message, and a CUDA out-of-memory error appears in both the response and the Ollama server logs. This sequential test approach takes two minutes and saves significant debugging time compared to discovering issues inside a larger application where the error path is harder to isolate.
The Ollama API in Production
The Ollama REST API is intentionally simple and has remained stable since the project’s initial public release — the core endpoints have not had breaking changes, which makes it safe to build on without worrying about frequent migrations. For production deployments, the combination of the native API’s model management capabilities and the OpenAI-compatible endpoint’s broad tool support covers virtually every use case: use /api/pull, /api/ps, and /api/show for operational management, and use /v1/chat/completions for application inference where you want to reuse existing SDK integrations. This layered approach gives you the operational visibility of the native API with the ecosystem compatibility of the OpenAI-compatible surface, without needing to choose one or the other. The eleven native endpoints and three OpenAI-compatible endpoints together constitute a complete, self-contained LLM infrastructure that runs entirely on hardware you control, with no external dependencies beyond the model weights themselves.
Getting Started
If you are new to the Ollama API, the fastest path to a working integration is: ensure Ollama is running (ollama serve), install the requests library (pip install requests), and copy the minimal OllamaClient class from this article. With those three steps you have a working Python client that handles chat, embeddings, and model listing. Add the model management functions as you need them — most applications only ever use /api/chat and /api/embeddings in production, with /api/pull and /api/ps reserved for setup scripts and monitoring. The complete API surface is small enough that a developer familiar with REST APIs can be productive with it in under an hour, which is one of Ollama’s most underrated design achievements — a powerful local LLM runtime with an API simple enough to use without reading extensive documentation. That simplicity, combined with the zero-cost inference and full data privacy that local deployment provides, is why the Ollama API has become the standard interface for the growing ecosystem of local AI applications and developer tools. Bookmark the eleven endpoints, keep the curl examples handy, and you have everything you need to build reliable, private, cost-free AI applications on your own infrastructure. The API documentation at docs.ollama.com stays current with each release if you need the authoritative reference for any new endpoints added after this article was written.