How to Use Ollama with Flask

Flask is Python’s most popular micro web framework — minimal, flexible, and ideal for building APIs and small web applications quickly. Pairing it with Ollama gives you a self-hosted AI backend that you can expose to a web frontend, another service, or a mobile app. This guide walks through building a production-ready Flask API around Ollama: a synchronous chat endpoint, a streaming response endpoint using Server-Sent Events, API key authentication, rate limiting, and deployment with Gunicorn. By the end you have a fully functional local AI API that any HTTP client can consume.

Flask is the right choice when you want a lightweight API without the code generation and conventions of Django, or when you are already familiar with Flask from other projects. The patterns here are clean and idiomatic Flask — no heavy frameworks or unnecessary abstractions.

Setup

pip install flask httpx python-dotenv
ollama pull llama3.2

Create a .env file for configuration:

OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3.2
API_KEY=your-secret-key
FLASK_ENV=development

Basic Chat Endpoint

Here is a minimal Flask application with a chat endpoint:

import os, httpx
from flask import Flask, request, jsonify
from dotenv import load_dotenv

load_dotenv()

app = Flask(__name__)
OLLAMA_URL = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
MODEL = os.getenv("OLLAMA_MODEL", "llama3.2")
API_KEY = os.getenv("API_KEY", "")

def check_auth():
    if not API_KEY:
        return True
    return request.headers.get("X-API-Key") == API_KEY

@app.route("/chat", methods=["POST"])
def chat():
    if not check_auth():
        return jsonify({"error": "Unauthorized"}), 401
    data = request.get_json()
    if not data or "messages" not in data:
        return jsonify({"error": "messages required"}), 400
    try:
        with httpx.Client(timeout=120) as client:
            resp = client.post(
                f"{OLLAMA_URL}/api/chat",
                json={"model": MODEL, "messages": data["messages"], "stream": False}
            )
            resp.raise_for_status()
        return jsonify(resp.json())
    except httpx.ConnectError:
        return jsonify({"error": "Ollama not running"}), 503
    except Exception as e:
        return jsonify({"error": str(e)}), 500

if __name__ == "__main__":
    app.run(debug=True)

Run with python app.py and test with curl -X POST http://localhost:5000/chat -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Hello"}]}'. The auth check passes through when API_KEY is empty, so you can develop without credentials and add them for production by setting the environment variable.

Streaming with Server-Sent Events

Add a streaming endpoint using Flask’s Response with a generator:

import json
from flask import Response, stream_with_context

@app.route("/chat/stream", methods=["POST"])
def chat_stream():
    if not check_auth():
        return jsonify({"error": "Unauthorized"}), 401
    data = request.get_json()
    messages = data.get("messages", [])

    def generate():
        with httpx.Client(timeout=120) as client:
            with client.stream(
                "POST", f"{OLLAMA_URL}/api/chat",
                json={"model": MODEL, "messages": messages, "stream": True}
            ) as resp:
                for line in resp.iter_lines():
                    if not line:
                        continue
                    chunk = json.loads(line)
                    token = chunk.get("message", {}).get("content", "")
                    if token:
                        yield f"data: {json.dumps({'token': token})}

"
                    if chunk.get("done"):
                        yield "data: [DONE]

"
                        break

    return Response(
        stream_with_context(generate()),
        mimetype="text/event-stream",
        headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"}
    )

The stream_with_context wrapper maintains the Flask request context inside the generator, which is required for accessing request objects within the streaming function. Without it, accessing request context variables inside the generator raises a RuntimeError. The X-Accel-Buffering: no header disables nginx buffering when the Flask app is deployed behind nginx, ensuring tokens reach the client as they are generated.

A Reusable OllamaClient

Extract the Ollama logic into a reusable class and use Flask’s application context to share it across requests:

class OllamaClient:
    def __init__(self, base_url: str, model: str):
        self.base_url = base_url
        self.model = model

    def chat(self, messages: list, model: str = None) -> dict:
        with httpx.Client(timeout=120) as client:
            resp = client.post(
                f"{self.base_url}/api/chat",
                json={"model": model or self.model, "messages": messages, "stream": False}
            )
            resp.raise_for_status()
            return resp.json()

    def embed(self, text: str, model: str = "nomic-embed-text") -> list:
        with httpx.Client(timeout=60) as client:
            resp = client.post(
                f"{self.base_url}/api/embed",
                json={"model": model, "input": text}
            )
            resp.raise_for_status()
            return resp.json()["embeddings"][0]

ollama = OllamaClient(OLLAMA_URL, MODEL)

# Use in any route:
@app.route("/embed", methods=["POST"])
def embed():
    if not check_auth():
        return jsonify({"error": "Unauthorized"}), 401
    data = request.get_json()
    text = data.get("text", "")
    if not text:
        return jsonify({"error": "text required"}), 400
    embedding = ollama.embed(text)
    return jsonify({"embedding": embedding, "dimensions": len(embedding)})

Using a module-level singleton for the Ollama client is the standard Flask pattern — httpx manages its own connection pool internally, so the singleton is both safe and efficient. Each request creates a new httpx client in the chat and embed methods, which is slightly wasteful but keeps the code simple. For a higher-performance API, refactor to use a long-lived httpx client with explicit lifecycle management via app.before_request and app.teardown_appcontext.

Rate Limiting with Flask-Limiter

Add rate limiting to prevent API abuse:

pip install flask-limiter
from flask_limiter import Limiter
from flask_limiter.util import get_remote_address

limiter = Limiter(
    app=app,
    key_func=get_remote_address,
    default_limits=["100 per hour"],
    storage_uri="memory://"
)

@app.route("/chat", methods=["POST"])
@limiter.limit("10 per minute")
def chat():
    ...

The 10 per minute limit applies per IP address. Requests over the limit receive a 429 response with a Retry-After header. For production, use Redis as the storage backend (storage_uri="redis://localhost:6379") so rate limit state persists across server restarts and is shared across multiple Gunicorn workers. The memory backend is fine for single-worker development but resets on every restart and does not work correctly with multiple processes.

CORS for Frontend Access

If a browser-based frontend will call your Flask API directly, add CORS support:

pip install flask-cors
from flask_cors import CORS
CORS(app, origins=["http://localhost:3000", "https://yourapp.com"])

Restrict origins to your actual frontend domains in production. Allowing all origins with CORS(app) is convenient for development but exposes the API to cross-site requests from any origin when deployed. The Flask-CORS library handles preflight OPTIONS requests automatically and adds the necessary headers to all responses.

Deploying with Gunicorn

Flask’s built-in development server is not suitable for production. Deploy with Gunicorn:

pip install gunicorn
gunicorn app:app -w 2 --bind 0.0.0.0:8000 --timeout 180

Keep the worker count low — 2 is the right default for an Ollama-backed API. More workers do not improve throughput because Ollama processes requests sequentially; they just mean more threads blocking on Ollama’s queue simultaneously. The --timeout 180 flag prevents Gunicorn from killing workers that are waiting on slow Ollama responses — the default 30-second timeout is too short for large models generating lengthy responses.

Put Nginx in front of Gunicorn for TLS termination, request buffering, and static file serving. Set proxy_buffering off in the Nginx config for the streaming endpoint, and extend proxy_read_timeout to at least 300 seconds to accommodate long LLM generations without Nginx dropping the connection mid-stream.

Flask vs FastAPI for Ollama APIs

Flask and FastAPI are the two most common Python choices for building APIs around Ollama. FastAPI has automatic OpenAPI documentation, native async support, and Pydantic-based request validation. Flask has a simpler mental model, a larger ecosystem of extensions, and is easier to get started with if you already know it. For a new project where the primary interface is an Ollama API, FastAPI’s async-first design and built-in streaming support make it marginally better suited. For a project where the Ollama API is one part of a larger Flask application, staying in Flask avoids adding a second web framework to the stack. Both work well — the patterns in this guide translate directly to FastAPI with minimal changes to the route structure and response handling.

Adding Conversation Memory with Flask Sessions

Flask’s built-in session system provides a natural place to store per-user conversation history without a database. Sessions are stored in signed cookies by default — the conversation history is kept on the client side, signed with the app’s secret key to prevent tampering. For short conversations this works well, but conversation history can grow large enough to exceed browser cookie limits (typically 4KB). For longer conversations, switch to server-side sessions using Flask-Session with a Redis backend, which stores session data server-side and sends only a session ID in the cookie.

Implement conversation memory in the chat endpoint by reading the session history at the start of each request, appending the new user message, sending the full history to Ollama, and then saving the assistant reply back to the session. Add a separate endpoint or a !reset message handler that clears the session history for the requesting user. Session-based history is per-browser-session, so the same user in a different browser tab or after clearing cookies starts with a fresh conversation — which is usually the behaviour you want for a web-accessible chat API.

Structured Output Endpoint

Add an endpoint that uses Ollama’s JSON schema mode to extract structured data from text — useful for classification, entity extraction, and form processing:

@app.route("/extract", methods=["POST"])
def extract():
    if not check_auth():
        return jsonify({"error": "Unauthorized"}), 401
    data = request.get_json()
    text = data.get("text", "")
    schema = data.get("schema", {})
    if not text or not schema:
        return jsonify({"error": "text and schema required"}), 400
    try:
        with httpx.Client(timeout=60) as client:
            resp = client.post(
                f"{OLLAMA_URL}/api/chat",
                json={
                    "model": MODEL,
                    "messages": [{"role": "user", "content": f"Extract information from: {text}"}],
                    "format": schema,
                    "stream": False
                }
            )
        import json
        content = resp.json()["message"]["content"]
        return jsonify({"result": json.loads(content)})
    except Exception as e:
        return jsonify({"error": str(e)}), 500

Callers pass both the text to extract from and the JSON Schema describing what to extract. The schema-constrained output guarantees the model returns valid JSON conforming to the schema, which the endpoint parses and returns as a structured response object. This is more reliable than asking the model to return JSON via a prompt instruction, which works most of the time but occasionally produces malformed output or extra prose around the JSON.

Health Check and Model Status

Add a health check endpoint that reports whether Ollama is reachable and which models are available:

@app.route("/health")
def health():
    try:
        with httpx.Client(timeout=5) as client:
            resp = client.get(f"{OLLAMA_URL}/api/tags")
            models = [m["name"] for m in resp.json().get("models", [])]
            ollama_ok = True
    except Exception:
        models = []
        ollama_ok = False
    return jsonify({
        "status": "ok" if ollama_ok else "degraded",
        "ollama": ollama_ok,
        "models": models,
        "default_model": MODEL
    }), 200 if ollama_ok else 503

The 5-second timeout on the health check is intentional — a health endpoint that takes longer than a few seconds to respond is not useful for load balancers and container orchestrators that need a quick answer. Returning a 503 status when Ollama is unreachable lets infrastructure tools mark the instance as unhealthy and route traffic elsewhere, which is the correct behaviour for a degraded service.

Error Handling and Logging

Add proper error handling and request logging to make the API production-ready:

import logging, time
from flask import g

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@app.before_request
def before_request():
    g.start_time = time.time()

@app.after_request
def after_request(response):
    duration = time.time() - g.start_time
    logger.info(f"{request.method} {request.path} {response.status_code} {duration:.2f}s")
    return response

@app.errorhandler(429)
def rate_limit_handler(e):
    return jsonify({"error": "Rate limit exceeded", "retry_after": e.retry_after}), 429

@app.errorhandler(500)
def server_error(e):
    logger.error(f"Server error: {e}")
    return jsonify({"error": "Internal server error"}), 500

The before/after request hooks log every request with its method, path, status code, and duration. For an Ollama-backed API where response times range from under a second to over a minute for long generations, tracking duration in logs is essential for understanding performance and diagnosing slow requests. The custom 429 handler includes the retry_after value from Flask-Limiter so clients know exactly when they can retry rather than having to implement their own backoff logic.

Testing Flask Ollama Routes

Flask’s built-in test client makes it easy to write unit tests for the API routes without running a real server. Mock the httpx calls to avoid needing a running Ollama instance in tests:

import pytest
from unittest.mock import patch, MagicMock
from app import app

@pytest.fixture
def client():
    app.config["TESTING"] = True
    with app.test_client() as c:
        yield c

def test_chat_returns_reply(client):
    mock_resp = MagicMock()
    mock_resp.json.return_value = {
        "message": {"role": "assistant", "content": "Hello!"},
        "done": True
    }
    with patch("httpx.Client") as mock_client:
        mock_client.return_value.__enter__.return_value.post.return_value = mock_resp
        resp = client.post("/chat",
            json={"messages": [{"role": "user", "content": "Hi"}]},
            headers={"X-API-Key": "test-key"}
        )
    assert resp.status_code == 200

def test_no_api_key_returns_401(client):
    resp = client.post("/chat", json={"messages": []})
    # Only fails if API_KEY is set
    # assert resp.status_code == 401

The Flask test client sends requests directly to the application without going through the network, making tests fast and reliable. Patching httpx at the module level intercepts all HTTP calls to Ollama and returns fixed responses, keeping the tests isolated from external services. Run with pytest and the suite completes in under a second regardless of whether Ollama is installed on the test machine.

Adding Model Selection

Expose a /models endpoint that lists available Ollama models, and let callers specify a model per request rather than using only the default. This makes the API more flexible — callers can switch between a fast small model for quick queries and a larger model for complex analysis without any server-side configuration changes. Fetch the model list from Ollama’s /api/tags endpoint and return the names. In the chat endpoint, read the model field from the request body and pass it to the OllamaClient, falling back to the default model if none is specified. Add validation that rejects unknown model names to prevent callers from specifying non-existent models and getting cryptic Ollama error messages in return.

Blueprint Structure for Larger Projects

As your Flask Ollama API grows, organise it with Blueprints to keep related routes together and make the codebase easier to navigate. Create a blueprints/ directory and split routes into files by function: chat.py for conversation endpoints, embed.py for embedding endpoints, admin.py for model management and health checks. Register each Blueprint on the app factory with a URL prefix — /api/chat, /api/embed, /api/admin — and the route definitions within each Blueprint use relative paths. This structure scales cleanly from a handful of routes to dozens without the single-file app becoming unmanageable. Each Blueprint can also have its own error handlers, before-request hooks, and middleware, keeping concerns separated at the Blueprint level rather than mixing them all at the application level.

What Flask Gives You Over Raw httpx

You could call Ollama directly from any script with httpx and skip Flask entirely. Flask earns its place when you need to expose Ollama over HTTP to multiple clients — a web frontend, a mobile app, a browser extension, a CLI tool, or another service. The Flask layer adds the HTTP server, routing, authentication, rate limiting, CORS headers, error formatting, and logging that you would otherwise have to build yourself. It also gives you a single stable URL that abstracts away the Ollama server address and port, so if you move Ollama to a different machine or change its configuration, only the Flask server’s environment variables need updating — all clients continue pointing at the same Flask URL without any changes.

For teams where different developers use different languages — some Python, some JavaScript, some Go — a Flask API around Ollama gives everyone a consistent HTTP interface regardless of whether their language has a good Ollama client library. The streaming SSE endpoint works from any language that supports HTTP streaming. The JSON endpoints work from any HTTP client. The API key authentication and rate limiting protect the shared Ollama instance from being overwhelmed by any single consumer. Flask turns Ollama from a tool you run locally into a shared service your whole team can use.

Leave a Comment