How to Serve Local LLMs as an API (FastAPI + Ollama)

Running large language models locally gives you privacy, control, and independence from cloud services. But to unlock the full potential of local LLMs, you need to expose them through a robust API that applications can consume reliably. Combining FastAPI—Python’s modern, high-performance web framework—with Ollama’s efficient LLM serving capabilities creates a production-ready API that rivals commercial offerings while keeping everything on your infrastructure.

This comprehensive guide walks you through building a complete API service for local LLMs, covering everything from basic setup to production-grade features like streaming responses, conversation management, and error handling. Whether you’re building internal tools, developing AI-powered applications, or creating services for clients, this architecture provides a solid foundation that scales from prototype to production.

Why FastAPI and Ollama Make a Perfect Combination

Before diving into implementation, understanding why this technology pairing works so well clarifies the architectural decisions we’ll make throughout the guide.

FastAPI’s Strengths for LLM APIs

FastAPI brings several capabilities that perfectly match LLM API requirements. Its asynchronous nature handles long-running LLM inference without blocking other requests—critical when model responses take several seconds. The framework’s automatic OpenAPI documentation generates interactive API docs that make testing and integration straightforward. Built-in request validation using Pydantic models ensures clients send properly formatted data, preventing errors before they reach your LLM.

Performance-wise, FastAPI rivals Node.js and Go frameworks despite being Python-based. This speed matters when you’re adding API overhead on top of already-slow LLM inference. The framework’s streaming support enables real-time token generation, creating responsive experiences where users see text appear progressively rather than waiting for complete responses.

Ollama’s Advantages for Local Serving

Ollama simplifies local LLM deployment dramatically. It handles model downloading, quantization selection, and GPU acceleration automatically, abstracting away complexity that would otherwise require substantial setup work. The service runs as a background daemon, maintaining loaded models in memory for fast subsequent requests.

Ollama’s HTTP API provides a clean interface for programmatic access, making integration straightforward. The platform supports dozens of popular models from Llama and Mistral to specialized variants, all accessible through consistent interfaces. Memory management is intelligent—Ollama loads models on-demand and unloads them when resources are needed elsewhere.

Together, FastAPI and Ollama create a stack that’s easy to develop on, performant in production, and maintainable long-term.

Setting Up Your Development Environment

Proper environment setup prevents common issues and establishes good practices from the start.

Installing Ollama

Begin by installing Ollama on your system. Visit ollama.ai and download the appropriate installer for your platform—macOS, Linux, or Windows. The installation process takes under a minute and automatically starts Ollama as a background service.

Verify the installation by running:

ollama --version

You should see the version number confirming Ollama is installed correctly. The service runs on http://localhost:11434 by default.

Pull a model to ensure everything works:

ollama pull llama2

This downloads Llama 2 7B, a capable model that runs well on most systems. The download takes a few minutes depending on your internet connection.

Setting Up Python Environment

Create a dedicated Python environment for your API project:

python -m venv llm-api-env
source llm-api-env/bin/activate  # On Windows: llm-api-env\Scripts\activate

Install required packages:

pip install fastapi uvicorn pydantic httpx python-dotenv

These packages provide FastAPI for the web framework, Uvicorn as the ASGI server, Pydantic for data validation, HTTPX for making requests to Ollama, and python-dotenv for configuration management.

Building the Basic API Structure

Start with a minimal working API that demonstrates core concepts, then expand functionality incrementally.

Creating the Foundation

Create a file named main.py with the basic FastAPI application structure:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import httpx
from typing import Optional

app = FastAPI(
    title="Local LLM API",
    description="API for serving local LLMs via Ollama",
    version="1.0.0"
)

OLLAMA_BASE_URL = "http://localhost:11434"

class ChatRequest(BaseModel):
    message: str
    model: str = "llama2"
    temperature: Optional[float] = 0.7
    max_tokens: Optional[int] = 500

class ChatResponse(BaseModel):
    response: str
    model: str
    tokens_used: Optional[int] = None

@app.get("/")
async def root():
    return {
        "message": "Local LLM API is running",
        "docs": "/docs"
    }

@app.get("/models")
async def list_models():
    """List available Ollama models"""
    async with httpx.AsyncClient() as client:
        try:
            response = await client.get(f"{OLLAMA_BASE_URL}/api/tags")
            return response.json()
        except httpx.RequestError as e:
            raise HTTPException(status_code=503, detail="Ollama service unavailable")

@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    """Generate a chat response from the LLM"""
    async with httpx.AsyncClient(timeout=120.0) as client:
        try:
            ollama_request = {
                "model": request.model,
                "prompt": request.message,
                "stream": False,
                "options": {
                    "temperature": request.temperature,
                    "num_predict": request.max_tokens
                }
            }
            
            response = await client.post(
                f"{OLLAMA_BASE_URL}/api/generate",
                json=ollama_request
            )
            response.raise_for_status()
            
            result = response.json()
            
            return ChatResponse(
                response=result["response"],
                model=request.model,
                tokens_used=result.get("eval_count")
            )
            
        except httpx.RequestError as e:
            raise HTTPException(status_code=503, detail="Failed to connect to Ollama")
        except httpx.HTTPStatusError as e:
            raise HTTPException(status_code=e.response.status_code, detail=str(e))

This foundation implements three endpoints: a health check, model listing, and basic chat functionality. The code uses Pydantic models for request/response validation, ensuring type safety and automatic documentation.

Testing the Basic API

Start the server with Uvicorn:

uvicorn main:app --reload --host 0.0.0.0 --port 8000

The --reload flag enables auto-restart on code changes during development. Visit http://localhost:8000/docs to access the interactive API documentation that FastAPI generates automatically.

Test the chat endpoint using the documentation interface or curl:

curl -X POST "http://localhost:8000/chat" \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Explain quantum computing in simple terms",
    "model": "llama2",
    "temperature": 0.7
  }'

You should receive a JSON response with the LLM’s generated text.

API Architecture Overview

📱
Client Application
Web, mobile, or desktop app
FastAPI Server
Request handling & validation
🤖
Ollama Service
LLM inference & serving
Request Flow:
1. Client sends HTTP request to FastAPI endpoint
2. FastAPI validates request data using Pydantic models
3. FastAPI forwards request to Ollama’s HTTP API
4. Ollama runs inference using loaded LLM
5. Response flows back through FastAPI to client

Implementing Streaming Responses

Streaming transforms user experience by showing text as the model generates it, eliminating the wait for complete responses.

Understanding Streaming Architecture

Ollama supports streaming via Server-Sent Events (SSE), where the server sends multiple chunks of data over a single connection. FastAPI’s StreamingResponse class handles this pattern elegantly.

Adding Streaming Endpoint

Extend your API with a streaming endpoint:

from fastapi.responses import StreamingResponse
import json

async def stream_ollama_response(model: str, prompt: str, temperature: float, max_tokens: int):
    """Generator function for streaming LLM responses"""
    async with httpx.AsyncClient(timeout=120.0) as client:
        ollama_request = {
            "model": model,
            "prompt": prompt,
            "stream": True,
            "options": {
                "temperature": temperature,
                "num_predict": max_tokens
            }
        }
        
        async with client.stream(
            "POST",
            f"{OLLAMA_BASE_URL}/api/generate",
            json=ollama_request
        ) as response:
            async for line in response.aiter_lines():
                if line:
                    try:
                        chunk = json.loads(line)
                        if "response" in chunk:
                            # Format as Server-Sent Event
                            yield f"data: {json.dumps({'text': chunk['response']})}\n\n"
                    except json.JSONDecodeError:
                        continue

@app.post("/chat/stream")
async def chat_stream(request: ChatRequest):
    """Stream chat responses in real-time"""
    return StreamingResponse(
        stream_ollama_response(
            request.model,
            request.message,
            request.temperature,
            request.max_tokens
        ),
        media_type="text/event-stream"
    )

This implementation creates a generator that yields chunks as they arrive from Ollama, formatted as Server-Sent Events that browsers and HTTP clients can consume progressively.

Client-Side Streaming Consumption

Here’s how clients consume streaming responses using JavaScript:

const eventSource = new EventSource('http://localhost:8000/chat/stream', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    message: 'Explain machine learning',
    model: 'llama2'
  })
});

eventSource.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log(data.text);  // Display incremental text
};

Or with Python’s HTTPX:

import httpx
import json

with httpx.stream(
    "POST",
    "http://localhost:8000/chat/stream",
    json={"message": "Explain neural networks", "model": "llama2"}
) as response:
    for line in response.iter_lines():
        if line.startswith("data: "):
            data = json.loads(line[6:])
            print(data["text"], end="", flush=True)

Adding Conversation Management

Stateless APIs lose conversation context between requests. Implementing conversation management enables multi-turn dialogues where the model remembers previous exchanges.

Designing Conversation Storage

For production systems, use databases like Redis or PostgreSQL. For this guide, we’ll use in-memory storage to demonstrate the pattern:

from datetime import datetime
from typing import Dict, List
import uuid

# In-memory conversation storage
conversations: Dict[str, List[dict]] = {}

class ConversationRequest(BaseModel):
    message: str
    conversation_id: Optional[str] = None
    model: str = "llama2"
    temperature: Optional[float] = 0.7
    max_tokens: Optional[int] = 500

class ConversationResponse(BaseModel):
    response: str
    conversation_id: str
    model: str
    message_count: int

@app.post("/chat/conversation", response_model=ConversationResponse)
async def chat_with_conversation(request: ConversationRequest):
    """Chat with conversation history maintained"""
    
    # Get or create conversation ID
    conv_id = request.conversation_id or str(uuid.uuid4())
    
    # Initialize conversation if new
    if conv_id not in conversations:
        conversations[conv_id] = []
    
    # Add user message to history
    conversations[conv_id].append({
        "role": "user",
        "content": request.message,
        "timestamp": datetime.utcnow().isoformat()
    })
    
    # Build context from conversation history
    context = "\n".join([
        f"{'User' if msg['role'] == 'user' else 'Assistant'}: {msg['content']}"
        for msg in conversations[conv_id][-10:]  # Last 10 messages
    ])
    
    async with httpx.AsyncClient(timeout=120.0) as client:
        ollama_request = {
            "model": request.model,
            "prompt": context + "\nAssistant:",
            "stream": False,
            "options": {
                "temperature": request.temperature,
                "num_predict": request.max_tokens
            }
        }
        
        response = await client.post(
            f"{OLLAMA_BASE_URL}/api/generate",
            json=ollama_request
        )
        result = response.json()
        
        # Add assistant response to history
        conversations[conv_id].append({
            "role": "assistant",
            "content": result["response"],
            "timestamp": datetime.utcnow().isoformat()
        })
        
        return ConversationResponse(
            response=result["response"],
            conversation_id=conv_id,
            model=request.model,
            message_count=len(conversations[conv_id])
        )

@app.get("/conversation/{conversation_id}")
async def get_conversation(conversation_id: str):
    """Retrieve conversation history"""
    if conversation_id not in conversations:
        raise HTTPException(status_code=404, detail="Conversation not found")
    return {"conversation_id": conversation_id, "messages": conversations[conversation_id]}

@app.delete("/conversation/{conversation_id}")
async def delete_conversation(conversation_id: str):
    """Delete conversation history"""
    if conversation_id in conversations:
        del conversations[conversation_id]
        return {"message": "Conversation deleted"}
    raise HTTPException(status_code=404, detail="Conversation not found")

This implementation maintains conversation state across requests, formats history as context for the model, and provides endpoints for retrieving and deleting conversations.

Implementing Error Handling and Retry Logic

Production APIs need robust error handling that gracefully manages failures and provides informative error messages.

Comprehensive Error Handling

from fastapi import Request, status
from fastapi.responses import JSONResponse
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@app.exception_handler(HTTPException)
async def http_exception_handler(request: Request, exc: HTTPException):
    """Custom HTTP exception handler"""
    logger.error(f"HTTP error: {exc.status_code} - {exc.detail}")
    return JSONResponse(
        status_code=exc.status_code,
        content={
            "error": exc.detail,
            "status_code": exc.status_code,
            "path": str(request.url)
        }
    )

@app.exception_handler(Exception)
async def general_exception_handler(request: Request, exc: Exception):
    """Catch-all exception handler"""
    logger.error(f"Unexpected error: {str(exc)}", exc_info=True)
    return JSONResponse(
        status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
        content={
            "error": "Internal server error",
            "detail": "An unexpected error occurred"
        }
    )

Adding Retry Logic

Network issues or temporary Ollama unavailability should trigger automatic retries:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def call_ollama_with_retry(url: str, payload: dict):
    """Call Ollama with automatic retry on failure"""
    async with httpx.AsyncClient(timeout=120.0) as client:
        response = await client.post(url, json=payload)
        response.raise_for_status()
        return response.json()

This retry mechanism attempts failed requests up to three times with exponential backoff, handling transient network issues gracefully.

Adding Authentication and Rate Limiting

Production APIs require authentication to control access and rate limiting to prevent abuse.

Simple API Key Authentication

from fastapi import Security, HTTPException
from fastapi.security import APIKeyHeader
import os

API_KEY = os.getenv("API_KEY", "your-secret-key-here")
api_key_header = APIKeyHeader(name="X-API-Key")

async def verify_api_key(api_key: str = Security(api_key_header)):
    """Verify API key from request header"""
    if api_key != API_KEY:
        raise HTTPException(
            status_code=status.HTTP_401_UNAUTHORIZED,
            detail="Invalid API key"
        )
    return api_key

# Apply to protected endpoints
@app.post("/chat", dependencies=[Security(verify_api_key)])
async def chat(request: ChatRequest):
    # Existing implementation
    pass

Rate Limiting

Install slowapi for rate limiting:

pip install slowapi

Implement rate limits:

from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.post("/chat")
@limiter.limit("10/minute")
async def chat(request: Request, chat_request: ChatRequest):
    # Existing implementation
    pass

This configuration limits each IP address to 10 requests per minute, preventing abuse while allowing legitimate usage.

Production Deployment Checklist

Environment Configuration
Use environment variables for secrets, configure proper logging levels
Error Handling
Implement comprehensive error catching, retry logic, and informative error messages
Security Measures
Add API key authentication, implement rate limiting, enable CORS appropriately
Performance Optimization
Configure proper timeouts, implement caching where appropriate, monitor resource usage
Monitoring & Logging
Track request metrics, log errors and slow queries, set up alerts for failures
Documentation
Maintain API docs, create usage examples, document deployment procedures

Deploying to Production

Moving from development to production requires additional considerations around reliability, monitoring, and deployment.

Running with Gunicorn for Production

Uvicorn alone isn’t ideal for production. Use Gunicorn as a process manager:

pip install gunicorn

Create a startup script start.sh:

#!/bin/bash
gunicorn main:app \
  --workers 4 \
  --worker-class uvicorn.workers.UvicornWorker \
  --bind 0.0.0.0:8000 \
  --timeout 120 \
  --access-logfile - \
  --error-logfile -

This configuration runs 4 worker processes, sets appropriate timeouts for LLM inference, and logs to stdout for container environments.

Containerization with Docker

Create a Dockerfile:

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8000

CMD ["gunicorn", "main:app", "--workers", "4", "--worker-class", "uvicorn.workers.UvicornWorker", "--bind", "0.0.0.0:8000"]

Build and run:

docker build -t llm-api .
docker run -p 8000:8000 --network host llm-api

The --network host flag allows the container to access Ollama running on the host machine.

Environment Configuration

Create a .env file for configuration:

API_KEY=your-production-api-key
OLLAMA_BASE_URL=http://localhost:11434
LOG_LEVEL=INFO
CORS_ORIGINS=https://yourdomain.com

Load these in your application:

from dotenv import load_dotenv
import os

load_dotenv()

API_KEY = os.getenv("API_KEY")
OLLAMA_BASE_URL = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")

Monitoring and Observability

Production systems require monitoring to identify issues before they impact users.

Adding Request Metrics

from prometheus_client import Counter, Histogram
from prometheus_fastapi_instrumentator import Instrumentator

# Initialize metrics
request_count = Counter('api_requests_total', 'Total API requests', ['endpoint', 'status'])
request_duration = Histogram('api_request_duration_seconds', 'Request duration', ['endpoint'])

# Instrument FastAPI
Instrumentator().instrument(app).expose(app)

This exposes Prometheus-compatible metrics at /metrics for monitoring request rates, latencies, and error rates.

Structured Logging

Implement structured logging for better log analysis:

import json
from datetime import datetime

class StructuredLogger:
    @staticmethod
    def log(level: str, message: str, **kwargs):
        log_entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": level,
            "message": message,
            **kwargs
        }
        print(json.dumps(log_entry))

logger = StructuredLogger()

# Usage
logger.log("INFO", "Chat request processed", 
           model="llama2", 
           tokens=150, 
           duration_ms=2340)

Structured logs integrate seamlessly with log aggregation systems like Elasticsearch or Datadog.

Conclusion

Building a production-ready API for serving local LLMs combines FastAPI’s modern web capabilities with Ollama’s efficient model serving to create a powerful, self-hosted alternative to commercial AI APIs. The architecture we’ve built supports streaming responses for responsive user experiences, maintains conversation context for multi-turn dialogues, implements robust error handling and retry logic, and includes security measures like authentication and rate limiting—all the features necessary for real-world applications.

This foundation scales from personal projects to production deployments serving thousands of requests. Start with the basic implementation, add features as your requirements grow, and don’t hesitate to customize the architecture for your specific use case. With this setup, you maintain complete control over your AI infrastructure while delivering experiences that rival cloud-based services, all while keeping your data private and your costs predictable.

Leave a Comment