Ollama exposes a REST API on port 11434 that you can call directly with curl, any HTTP client, or the official SDKs. This reference covers every endpoint with practical examples — useful when building applications, debugging integrations, or scripting model management tasks.
Base URL and Authentication
The API base URL is http://localhost:11434 by default. There is no authentication by default — any process on the machine can call it. For remote access, set OLLAMA_HOST=0.0.0.0:11434 and control access via firewall rules. All requests use standard HTTP methods and JSON bodies. Responses are JSON, or newline-delimited JSON for streaming endpoints.
GET / — Health Check
curl http://localhost:11434/
# Response: Ollama is running
POST /api/generate — Text Generation
# Non-streaming
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Why is the sky blue?",
"stream": false
}'
# Streaming (default)
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Tell me a joke."
}'
# With options
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Complete this code:",
"system": "You are a Python expert.",
"options": {
"temperature": 0.2,
"num_ctx": 8192,
"top_p": 0.9
},
"keep_alive": "1h",
"stream": false
}'
POST /api/chat — Chat Completion
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+2?"},
{"role": "assistant", "content": "4"},
{"role": "user", "content": "What about 3+3?"}
],
"stream": false
}'
# With structured output (format parameter)
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Extract: John, john@example.com"}],
"format": {"type":"object","properties":{"name":{"type":"string"},"email":{"type":"string"}},"required":["name","email"]},
"stream": false
}'
POST /api/embeddings — Generate Embeddings
curl http://localhost:11434/api/embeddings -d '{
"model": "nomic-embed-text",
"prompt": "The quick brown fox"
}'
# Returns: {"embedding": [0.123, -0.456, ...]}
GET /api/tags — List Local Models
curl http://localhost:11434/api/tags
# Returns all pulled models with size, digest, modified_at
POST /api/pull — Pull a Model
# Pull with progress stream
curl http://localhost:11434/api/pull -d '{"name": "llama3.2"}'
# Pull without streaming
curl http://localhost:11434/api/pull -d '{"name": "llama3.2", "stream": false}'
DELETE /api/delete — Delete a Model
curl -X DELETE http://localhost:11434/api/delete \
-d '{"name": "llama3.2"}'
POST /api/copy — Copy a Model
curl http://localhost:11434/api/copy -d '{
"source": "llama3.2",
"destination": "my-custom-llama"
}'
POST /api/create — Create from Modelfile
curl http://localhost:11434/api/create -d '{
"name": "mario",
"modelfile": "FROM llama3.2\nSYSTEM You are Mario from Super Mario Bros."
}'
GET /api/ps — Show Running Models
curl http://localhost:11434/api/ps
# Returns currently loaded models, size_vram, until (keep_alive expiry)
POST /api/show — Show Model Info
curl http://localhost:11434/api/show -d '{"name": "llama3.2"}'
# Returns modelfile, parameters, template, details (family, params, quant)
OpenAI-Compatible Endpoints
# Chat completions (OpenAI format)
curl http://localhost:11434/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Hello"}]
}'
# List models (OpenAI format)
curl http://localhost:11434/v1/models
# Embeddings (OpenAI format)
curl http://localhost:11434/v1/embeddings \
-H 'Content-Type: application/json' \
-d '{"model": "nomic-embed-text", "input": "Hello world"}'
Using the API from Python
import requests
BASE = 'http://localhost:11434'
# Non-streaming generate
def generate(prompt: str, model='llama3.2') -> str:
r = requests.post(f'{BASE}/api/generate',
json={'model': model, 'prompt': prompt, 'stream': False})
return r.json()['response']
# Chat
def chat(messages: list, model='llama3.2') -> str:
r = requests.post(f'{BASE}/api/chat',
json={'model': model, 'messages': messages, 'stream': False})
return r.json()['message']['content']
# Embeddings
def embed(text: str, model='nomic-embed-text') -> list:
r = requests.post(f'{BASE}/api/embeddings',
json={'model': model, 'prompt': text})
return r.json()['embedding']
# List models
def list_models() -> list:
return requests.get(f'{BASE}/api/tags').json()['models']
# Running models
def running_models() -> list:
return requests.get(f'{BASE}/api/ps').json()['models']
print(generate('Why is Python popular?'))
print([m['name'] for m in list_models()])
print(len(embed('hello world')), 'dimensions')
Streaming Response Parsing
import requests, json
def stream_chat(messages: list, model='llama3.2'):
with requests.post(
'http://localhost:11434/api/chat',
json={'model': model, 'messages': messages, 'stream': True},
stream=True
) as resp:
for line in resp.iter_lines():
if line:
chunk = json.loads(line)
if not chunk.get('done'):
print(chunk['message']['content'], end='', flush=True)
print() # newline at end
stream_chat([{'role':'user','content':'Count to 5 slowly.'}])
Key Request Parameters
The options object in /api/generate and /api/chat accepts the following commonly used parameters. temperature (0.0–2.0) controls randomness — lower values produce more deterministic output, useful for extraction and coding tasks. num_ctx sets the context window size in tokens — defaults to the model’s built-in setting but can be increased up to the model’s maximum (4K, 8K, 32K, 128K depending on model). top_p (0.0–1.0) is nucleus sampling — 0.9 is a sensible default. repeat_penalty (1.0–1.5) penalises repeated tokens, helping prevent the model from getting stuck in loops — 1.1 is a common setting. seed (integer) sets the random seed for reproducible output when temperature is above 0.
Response Fields
The /api/generate response includes: response (the generated text), done (true on the final chunk), prompt_eval_count (input tokens), eval_count (output tokens), eval_duration (nanoseconds for generation), and total_duration (total request time in nanoseconds). Divide eval_count by eval_duration in seconds to calculate tokens per second — useful for benchmarking model performance. The /api/chat response wraps the output in message.content with the same metadata fields.
Error Handling
import requests
try:
r = requests.post('http://localhost:11434/api/generate',
json={'model': 'nonexistent-model', 'prompt': 'hello', 'stream': False},
timeout=30)
r.raise_for_status()
print(r.json()['response'])
except requests.exceptions.ConnectionError:
print('Ollama not running — start with: ollama serve')
except requests.exceptions.HTTPError as e:
print(f'API error {e.response.status_code}: {e.response.json().get("error")}')
except requests.exceptions.Timeout:
print('Request timed out — model may be loading, try again')
Quick Reference Table
Summary of all endpoints: GET / (health), POST /api/generate (completion), POST /api/chat (chat), POST /api/embeddings (embeddings), GET /api/tags (list models), POST /api/pull (download model), DELETE /api/delete (remove model), POST /api/copy (duplicate model), POST /api/create (create from Modelfile), GET /api/ps (loaded models), POST /api/show (model details), POST /api/push (push to registry). The OpenAI-compatible endpoints at /v1/ mirror the standard OpenAI API format for drop-in compatibility with existing tools and libraries that target the OpenAI API.
Why Use the Raw API Instead of the SDK?
The official Ollama Python and JavaScript libraries are the recommended way to integrate Ollama in most applications — they handle streaming, error types, and type safety automatically. The raw REST API is useful in specific situations: shell scripts that use curl without a Python/Node.js runtime available, environments where installing npm or pip packages is restricted, integration testing where you want to verify the exact HTTP interface, debugging to isolate whether an issue is in your application code or the underlying API, and languages without an official Ollama SDK where you need to implement the integration yourself. Understanding the raw API also helps when reading error messages and understanding what the SDK is doing under the hood, which makes debugging SDK-based applications easier.
Calling the API from JavaScript (fetch)
// Non-streaming chat
async function chat(messages, model = 'llama3.2') {
const res = await fetch('http://localhost:11434/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ model, messages, stream: false })
});
const data = await res.json();
return data.message.content;
}
// Streaming
async function streamChat(messages, model = 'llama3.2') {
const res = await fetch('http://localhost:11434/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ model, messages, stream: true })
});
const reader = res.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
for (const line of decoder.decode(value).split('\n').filter(Boolean)) {
const chunk = JSON.parse(line);
if (!chunk.done) process.stdout.write(chunk.message.content);
}
}
}
// Model management
async function listModels() {
const res = await fetch('http://localhost:11434/api/tags');
const data = await res.json();
return data.models.map(m => m.name);
}
async function pullModel(name) {
const res = await fetch('http://localhost:11434/api/pull', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ name, stream: false })
});
return res.json();
}
Shell Script Examples
#!/bin/bash
# quick-ask.sh — bash quick-ask.sh "What is Docker?"
QUESTION="$1"
RESPONSE=$(curl -s http://localhost:11434/api/generate \
-d "{\"model\":\"llama3.2\",\"prompt\":\"$QUESTION\",\"stream\":false}" \
| python3 -c "import sys,json; print(json.load(sys.stdin)['response'])")
echo "$RESPONSE"
# list-models.sh
curl -s http://localhost:11434/api/tags \
| python3 -c "import sys,json; [print(m['name'], f'{m[\"size\"]//1e9:.1f}GB') for m in json.load(sys.stdin)['models']]"
# check-ollama.sh — exits 0 if running, 1 if not
curl -sf http://localhost:11434/ > /dev/null && echo 'Ollama running' || echo 'Ollama not running'
Rate Limiting and Concurrent Requests
By default, Ollama processes one request at a time per model — concurrent requests queue and are served sequentially. Set OLLAMA_NUM_PARALLEL=N to allow N concurrent requests, which requires enough VRAM to hold N copies of the KV cache simultaneously (roughly N× the context window’s memory overhead). For applications that need consistent response times under concurrent load, either configure parallel requests with adequate VRAM, or implement a client-side queue that manages concurrency at the application layer. The API itself does not return errors for concurrent requests — it simply queues them and serves them in order, so clients will eventually receive responses even under heavy load, just with increased latency for later requests in the queue.
The API as Infrastructure
The Ollama REST API is intentionally simple and stable — the same curl commands that worked in Ollama 0.1 continue to work in current versions. This stability makes it a reliable foundation for building applications, scripts, and integrations without worrying about breaking changes between Ollama updates. The OpenAI-compatible /v1/ endpoints add compatibility with the broader ecosystem of tools and libraries that target the OpenAI API, giving you access to a large ecosystem of existing integrations with minimal configuration changes. Together, the native Ollama API and the OpenAI compatibility layer cover virtually every integration scenario you are likely to encounter when building local AI applications.
Practical Tips for API Integrations
A few patterns that save time when building on the Ollama API. First, always test with stream: false during development — streaming responses are more complex to parse and debug. Once the logic is correct with non-streaming, add streaming for production if the use case benefits from it. Second, pull the response time metrics from every response (eval_duration, eval_count) and log them — this gives you a performance baseline that makes it immediately obvious when a model upgrade, hardware change, or configuration tweak affects throughput. Third, implement retry logic with exponential backoff for production integrations — Ollama occasionally fails to respond if it is in the middle of a model load or is under memory pressure, and a simple retry after 2–5 seconds resolves most transient failures without surfacing errors to users.
Fourth, use the /api/ps endpoint in your application’s health check to verify not just that Ollama is running but that your required model is loaded and ready. A response latency spike often coincides with a model being unloaded and reloaded — if your health check confirms the model is loaded, you can distinguish between Ollama being slow and Ollama needing a warm-up request before serving traffic. Fifth, for applications that process sensitive data, audit what you log: Ollama itself does not log prompt content, but your application’s HTTP logging middleware may capture request bodies. Disable body logging for Ollama API calls in production or ensure your logging pipeline handles the content appropriately.
Versioning and Stability
Ollama does not version its API with explicit version numbers in the path (unlike /api/v2/ patterns). The API has remained remarkably stable across Ollama’s major versions — new endpoints are additive, and existing endpoint behaviour has not changed in ways that break existing integrations. The OpenAI-compatible /v1/ endpoints follow OpenAI’s API versioning conventions and are similarly stable. The practical implication: you can update Ollama on your server without expecting existing API integrations to break. Review the Ollama release notes for each update, but changes to the REST API are rare and always documented when they occur. This stability is intentional — Ollama’s positioning as infrastructure that other tools build on top of requires a stable API contract, and the development team has maintained that commitment consistently.
Multimodal: Sending Images via the API
For vision-capable models (llava, moondream, gemma3), the /api/generate and /api/chat endpoints accept base64-encoded images:
IMAGE_B64=$(base64 -i photo.jpg)
curl http://localhost:11434/api/generate -d "{
\"model\": \"llava\",
\"prompt\": \"What is in this image?\",
\"images\": [\"$IMAGE_B64\"],
\"stream\": false
}"
import base64, ollama
with open('photo.jpg', 'rb') as f:
img_b64 = base64.b64encode(f.read()).decode()
response = ollama.generate(
model='llava',
prompt='Describe this image in detail.',
images=[img_b64]
)
print(response['response'])
The /api/push Endpoint
# Push a local model to a registry (requires account at ollama.com)
curl http://localhost:11434/api/push -d '{
"name": "your-username/your-model:latest"
}'
Pushing models to the Ollama registry is useful for sharing custom Modelfile-based models with your team or the public. The model must be tagged with your registry username prefix before pushing. This workflow — create locally with a Modelfile, test, push to share — is the standard pattern for distributing custom system prompts or model configurations across a team without sharing the Modelfile manually.
Building a Thin API Wrapper
For teams that want to add authentication, rate limiting, or logging on top of Ollama’s unauthenticated API, a minimal proxy in any language adds these capabilities without modifying Ollama itself. A 20-line FastAPI or Express wrapper that validates an API key header and forwards valid requests to Ollama’s port is the simplest approach — it sits between team members’ clients and the Ollama instance, enforcing access control without the complexity of a full API gateway. The Ollama API’s simplicity makes proxying straightforward: forward the request body unchanged, stream the response back unchanged, and log the metadata (model, token counts, latency) from the response fields. This pattern converts Ollama’s single-machine tool into team infrastructure with minimal engineering effort.
Testing and Debugging API Integrations
When an API integration is not behaving as expected, the most efficient debugging workflow is to reproduce the issue with a raw curl command. Strip away application code, SDKs, and abstractions until you can make the exact same request with curl and observe the raw response. This isolates whether the problem is in your application logic or in the Ollama API itself. Common issues to check: model name typos (use /api/tags to confirm exact names including tags), context window overflow (responses that cut off mid-sentence often indicate a num_ctx limit hit — increase it or reduce prompt length), temperature set too high for deterministic tasks (extraction and classification tasks need 0.0 or very low temperature), and streaming response parsing errors (ensure your line parser handles the case where a single chunk spans multiple lines or a line contains multiple JSON objects concatenated).
For integration tests, use the / health endpoint to verify Ollama is running before running test suites, and the /api/ps endpoint to confirm your test model is loaded. Mocking the Ollama API in unit tests is straightforward since every endpoint follows the same JSON request/response pattern — a simple HTTP mock that returns a pre-defined JSON response is sufficient for testing application logic without a running Ollama instance. The API’s simplicity and stability make it one of the easier local AI backends to test against reliably.
Getting Started
With Ollama running, every endpoint in this reference is immediately available — no configuration, no API keys, no setup beyond ollama serve. Start with the health check and /api/tags to confirm everything is working, then move to /api/generate or /api/chat for your first completion requests. The consistent JSON request/response format across all endpoints means that once you understand one endpoint, the pattern transfers directly to the others. Bookmark the Ollama GitHub repository for the authoritative API documentation, which is updated with each release to reflect any new parameters or endpoints added as Ollama continues to develop — the reference in this article covers the stable core that has remained consistent across versions and forms the foundation of every Ollama integration you are likely to build — a solid foundation that scales from simple scripts to production applications without requiring a different underlying API or adding significant complexity to your integration layer — and that consistency is one of the most underrated qualities of Ollama as a local AI platform.