FastAPI has become the go-to Python framework for building APIs quickly — it is fast, well-documented, and generates OpenAPI docs automatically. Paired with Ollama, it gives you a clean way to expose a local LLM as an HTTP API that any client can consume: a web frontend, a mobile app, a CLI tool, or another service. This guide walks through building a production-ready FastAPI wrapper around Ollama, including streaming endpoints, request validation, authentication, and async design patterns throughout.
The result is a self-hosted AI API that behaves like a cloud LLM endpoint but runs entirely on your own hardware. Because it sits in front of Ollama, it also gives you a natural place to add rate limiting, logging, prompt sanitisation, and model routing logic that you cannot easily add to Ollama itself.
Project Setup
Create a project directory and install the required packages:
mkdir ollama-api && cd ollama-api python -m venv venv && source venv/bin/activate pip install fastapi uvicorn httpx python-dotenv pydantic
We use httpx for async HTTP requests to Ollama rather than the requests library, because requests is synchronous and would block FastAPI’s event loop. Create a .env file for configuration:
OLLAMA_BASE_URL=http://localhost:11434 DEFAULT_MODEL=llama3.2 API_KEY=your-secret-key-here
Pydantic Models
Define the request and response shapes using Pydantic. FastAPI uses these for automatic validation and OpenAPI schema generation:
from pydantic import BaseModel, Field
from typing import Optional
class Message(BaseModel):
role: str
content: str
class ChatRequest(BaseModel):
model: str = Field(default="llama3.2")
messages: list[Message]
stream: bool = False
temperature: Optional[float] = None
class ChatResponse(BaseModel):
model: str
message: Message
done: boolThe Field(default=...) on model means callers do not have to specify a model if they are happy with the default, while still allowing it to be overridden per request. FastAPI will include these defaults in the generated OpenAPI docs automatically.
Basic Chat Endpoint
Here is a minimal FastAPI application with a single chat endpoint that proxies requests to Ollama:
import os, httpx
from fastapi import FastAPI, HTTPException
from dotenv import load_dotenv
load_dotenv()
OLLAMA_URL = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
app = FastAPI(title="Local AI API", version="1.0.0")
@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
async with httpx.AsyncClient(timeout=120) as client:
response = await client.post(
f"{OLLAMA_URL}/api/chat",
json={
"model": request.model,
"messages": [m.model_dump() for m in request.messages],
"stream": False,
}
)
if response.status_code != 200:
raise HTTPException(status_code=502, detail="Ollama request failed")
return response.json()Run it with uvicorn main:app --reload and visit http://localhost:8000/docs to see the auto-generated interactive API documentation. FastAPI’s Swagger UI lets you test the endpoint directly from the browser — no curl commands needed during development.
Streaming Endpoint with Server-Sent Events
For a streaming endpoint, use FastAPI’s StreamingResponse and an async generator that reads tokens from Ollama as they arrive:
import json
from fastapi.responses import StreamingResponse
async def ollama_stream(request: ChatRequest):
async with httpx.AsyncClient(timeout=120) as client:
async with client.stream(
"POST", f"{OLLAMA_URL}/api/chat",
json={
"model": request.model,
"messages": [m.model_dump() for m in request.messages],
"stream": True,
}
) as response:
async for line in response.aiter_lines():
if not line: continue
chunk = json.loads(line)
token = chunk.get("message", {}).get("content", "")
if token:
yield f"data: {json.dumps({'token': token})}\n\n"
if chunk.get("done"):
yield "data: [DONE]\n\n"
break
@app.post("/chat/stream")
async def chat_stream(request: ChatRequest):
return StreamingResponse(
ollama_stream(request),
media_type="text/event-stream",
headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"}
)The X-Accel-Buffering: no header prevents nginx from buffering the response before forwarding it to the client. The Cache-Control: no-cache header prevents CDNs from caching SSE responses. Clients can consume this endpoint with the browser’s native EventSource API or any SSE client library.
API Key Authentication
Use FastAPI’s dependency injection to add API key authentication to protected endpoints:
from fastapi import Depends, HTTPException, Security
from fastapi.security import APIKeyHeader
API_KEY = os.getenv("API_KEY", "")
api_key_header = APIKeyHeader(name="X-API-Key", auto_error=False)
async def verify_api_key(key: str = Security(api_key_header)):
if not API_KEY:
return # Auth disabled if no key configured
if key != API_KEY:
raise HTTPException(status_code=401, detail="Invalid API key")
@app.post("/chat", response_model=ChatResponse, dependencies=[Depends(verify_api_key)])
async def chat(request: ChatRequest):
...If API_KEY is empty in the environment, authentication is skipped entirely — useful for local development. In production, set a strong random key in your .env file and require it in the X-API-Key header with every request.
Shared HTTP Client with Lifespan
Creating a new httpx.AsyncClient per request is wasteful. Use FastAPI’s lifespan context manager to create one shared client at startup:
from contextlib import asynccontextmanager
from typing import AsyncGenerator
http_client: httpx.AsyncClient
@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncGenerator:
global http_client
http_client = httpx.AsyncClient(timeout=120)
yield
await http_client.aclose()
app = FastAPI(title="Local AI API", lifespan=lifespan)The shared client reuses TCP connections across requests, reducing latency and avoiding connection pool overhead. The aclose() call in the cleanup phase ensures connections are cleanly released when the server shuts down, preventing connection leak warnings in Uvicorn’s logs.
CORS and Embeddings
If a browser frontend will call your API directly, add CORS middleware:
from fastapi.middleware.cors import CORSMiddleware
app.add_middleware(
CORSMiddleware,
allow_origins=["http://localhost:3000", "https://yourapp.com"],
allow_credentials=True,
allow_methods=["POST", "GET"],
allow_headers=["*"],
)Restrict allow_origins to specific domains rather than the wildcard * in production. Wildcards expose your API to requests from any origin, which matters if you have API key authentication that you do not want bypassed by malicious websites using a user’s browser as a proxy.
Add an embeddings endpoint to expose Ollama’s vector generation capability:
class EmbedRequest(BaseModel):
model: str = "nomic-embed-text"
input: str
@app.post("/embed", dependencies=[Depends(verify_api_key)])
async def embed(request: EmbedRequest):
response = await http_client.post(
f"{OLLAMA_URL}/api/embed",
json={"model": request.model, "input": request.input}
)
if response.status_code != 200:
raise HTTPException(status_code=502, detail="Embedding failed")
return response.json()Exposing embeddings through your FastAPI layer rather than directly to clients gives you a single authenticated interface for all Ollama capabilities. Clients do not need to know which machine Ollama is running on, what port it uses, or which model handles embeddings — all of that is encapsulated behind your API.
Request Logging Middleware
Add a middleware layer to log every request with its latency, which is useful for identifying slow prompts and debugging production issues:
import time, logging
from fastapi import Request
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("ollama-api")
@app.middleware("http")
async def log_requests(request: Request, call_next):
start = time.time()
response = await call_next(request)
duration = time.time() - start
logger.info(
f"{request.method} {request.url.path} "
f"status={response.status_code} "
f"duration={duration:.2f}s"
)
return responseThis middleware logs every request path, HTTP method, response status code, and how long the request took in seconds. For a chat endpoint backed by a large model, you will often see durations of 10 to 60 seconds — completely normal, but useful to track over time to spot regressions when you change models or hardware. For streaming endpoints the logged duration reflects the time to first token rather than the full generation, since FastAPI returns the StreamingResponse object before the generator has finished yielding.
Model Routing
One of the most useful things you can do at the FastAPI layer is route requests to different models based on the task. Rather than exposing raw model names to clients, define named task endpoints that internally select the right model:
MODEL_ROUTES = {
"chat": "llama3.2",
"code": "qwen2.5-coder:7b",
"embed": "nomic-embed-text",
"fast": "llama3.2:3b",
}
class TaskRequest(BaseModel):
task: str = "chat"
messages: list[Message]
@app.post("/task")
async def task_chat(request: TaskRequest, _=Depends(verify_api_key)):
model = MODEL_ROUTES.get(request.task)
if not model:
raise HTTPException(status_code=400, detail=f"Unknown task: {request.task}")
response = await http_client.post(
f"{OLLAMA_URL}/api/chat",
json={"model": model, "messages": [m.model_dump() for m in request.messages], "stream": False}
)
return response.json()This pattern decouples clients from model names entirely. When you want to swap the coding model from qwen2.5-coder:7b to something newer, you update one line in MODEL_ROUTES and all clients benefit immediately without any changes on their end. It also makes it easy to add task-specific system prompts — the routing layer can prepend a different system message depending on which task was requested, giving each model the context it needs to perform well without clients having to manage prompts themselves.
Testing with pytest
FastAPI has excellent testing support via its TestClient. For async endpoints, use httpx.AsyncClient with ASGITransport to test without spinning up a real server:
import pytest
from httpx import AsyncClient, ASGITransport
from unittest.mock import AsyncMock, patch
from main import app
@pytest.mark.asyncio
async def test_chat_endpoint():
mock_response = {
"model": "llama3.2",
"message": {"role": "assistant", "content": "Hello!"},
"done": True
}
with patch("main.http_client") as mock_client:
mock_client.post = AsyncMock(
return_value=AsyncMock(
status_code=200,
json=lambda: mock_response
)
)
async with AsyncClient(
transport=ASGITransport(app=app), base_url="http://test"
) as client:
resp = await client.post(
"/chat",
json={"messages": [{"role": "user", "content": "Hi"}]}
)
assert resp.status_code == 200
assert resp.json()["message"]["content"] == "Hello!"The patch("main.http_client") context manager replaces the shared Ollama client with a mock that returns a fixed response, so your tests run without a real Ollama instance. This is the right pattern for unit testing FastAPI endpoints — you test the routing, validation, and response shaping logic in isolation from the actual LLM. Add pytest-asyncio to your dev dependencies and configure it with asyncio_mode = "auto" in pytest.ini to avoid decorating every async test manually.
Running in Production with Gunicorn
For production deployments, run FastAPI under Gunicorn with Uvicorn workers rather than Uvicorn alone. Gunicorn manages multiple worker processes, which improves resilience — if one worker crashes, the others continue serving requests while Gunicorn restarts the failed process automatically:
pip install gunicorn gunicorn main:app -w 2 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000
Keep the worker count low — typically 2 — because each worker holds its own connection to Ollama, and Ollama processes requests sequentially anyway. Running 8 workers does not give you 8x throughput; it just means 8 concurrent requests all queuing behind Ollama’s single processing slot. Two workers is the right default: one active, one warm and ready to serve while the first is handling a long-running generation request.
Health Check Endpoint
Add a lightweight health check endpoint that verifies connectivity to Ollama. Load balancers and container orchestrators like Kubernetes use this to determine whether the pod is ready to receive traffic:
@app.get("/health")
async def health():
try:
response = await http_client.get(
f"{OLLAMA_URL}/api/tags", timeout=5
)
ollama_ok = response.status_code == 200
except Exception:
ollama_ok = False
return {
"status": "ok" if ollama_ok else "degraded",
"ollama": ollama_ok
}The health endpoint uses a short 5-second timeout rather than the 120-second default — a health check that takes more than a few seconds to respond is not useful for orchestration systems that need a quick answer. Returning a structured JSON body with both an overall status and the individual Ollama connectivity check makes it easy to see at a glance whether a failure is in your FastAPI layer or in the Ollama connection.
Deploying Behind nginx
In production, put nginx in front of your FastAPI application to handle TLS termination, request buffering, and static file serving. A minimal nginx configuration for the API looks like this:
server {
listen 443 ssl;
server_name api.yourdomain.com;
location / {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_buffering off; # Required for SSE streaming
proxy_read_timeout 300s; # Allow long LLM generations
proxy_send_timeout 300s;
}
}The proxy_buffering off directive is critical for the streaming endpoint — with buffering enabled nginx holds the entire response before forwarding it, which means clients see nothing until generation is complete. The extended proxy_read_timeout and proxy_send_timeout values prevent nginx from dropping long-running LLM requests before they finish. The default nginx timeout of 60 seconds is too short for large models generating lengthy responses.
Rate Limiting
Without rate limiting, a single user can flood your API with requests and starve everyone else of Ollama’s processing capacity. Add per-client rate limiting using the slowapi library, which integrates cleanly with FastAPI:
pip install slowapi
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
@app.post("/chat")
@limiter.limit("10/minute")
async def chat(request: Request, body: ChatRequest, _=Depends(verify_api_key)):
...The 10/minute limit means each IP address can make at most 10 chat requests per minute. Requests over the limit receive a 429 status code with a Retry-After header indicating when they can try again. The get_remote_address key function uses the client IP address as the rate limit key — if your API sits behind a reverse proxy, make sure to configure the proxy to forward the real client IP via the X-Forwarded-For header and update the key function to read from it.
Structured Output Endpoint
Add an endpoint that uses Ollama’s JSON schema mode to return structured data rather than free-form text. This is useful for classification, entity extraction, or any task where your application needs to parse the model’s output:
class StructuredRequest(BaseModel):
messages: list[Message]
schema: dict # JSON Schema object
model: str = "llama3.2"
@app.post("/structured", dependencies=[Depends(verify_api_key)])
async def structured(request: StructuredRequest):
response = await http_client.post(
f"{OLLAMA_URL}/api/chat",
json={
"model": request.model,
"messages": [m.model_dump() for m in request.messages],
"format": request.schema,
"stream": False,
}
)
if response.status_code != 200:
raise HTTPException(status_code=502, detail="Ollama error")
data = response.json()
# Parse the content as JSON since Ollama guarantees schema conformance
return json.loads(data["message"]["content"])The caller passes both the conversation messages and a JSON Schema object. Ollama constrains the model’s output to match the schema, and the endpoint parses the content field and returns the structured object directly. Clients receive typed data they can use immediately without any additional parsing logic on their end.
Putting It All Together
A production FastAPI wrapper around Ollama is less than 200 lines of Python but gives you a substantial amount of infrastructure: automatic API documentation, request validation, API key authentication, streaming support, rate limiting, request logging, health checks, and a clean deployment story with Gunicorn and nginx. Each of these pieces is independently useful and can be added incrementally — start with the basic chat endpoint, add authentication when you share the API with others, add rate limiting when usage grows, and add model routing when you have multiple models for different tasks.
Because the API follows standard HTTP conventions and uses SSE for streaming, any client that can make HTTP requests can consume it — no SDK required. A JavaScript frontend can use fetch for the non-streaming endpoint and EventSource for streaming. A mobile app written in Swift, Kotlin, or Dart can use its native HTTP client. A Python script can use requests or httpx. This interoperability is one of the most practical advantages of building the integration layer in FastAPI rather than coupling clients directly to Ollama’s API.
Extending the API Further
Once your FastAPI wrapper is running, there are several natural extensions worth considering. Adding a simple in-memory cache keyed on the request hash means repeated identical prompts return instantly without hitting Ollama at all — useful for demo environments where the same questions come up repeatedly. Adding a /models endpoint that proxies Ollama’s /api/tags response lets clients discover which models are available without direct Ollama access. And adding per-user API keys stored in a database — rather than a single shared key from the environment — lets you track usage per client, revoke access for individual users, and implement tiered rate limits based on user tier. Each of these additions is a few dozen lines of Python and builds naturally on the foundation established by the patterns in this guide.
The combination of FastAPI and Ollama gives you a genuinely production-grade local AI API with minimal operational complexity — no cloud costs, no data leaving your infrastructure, and full control over every layer of the stack.