How to Stream Ollama Responses over WebSockets

Ollama’s streaming API sends tokens as newline-delimited JSON as they are generated. Exposing this stream to browser clients via WebSocket gives users real-time token display without polling. This guide covers building a Python WebSocket server (using FastAPI + WebSockets) that proxies Ollama streaming to connected browser clients.

Why WebSockets for Ollama

HTTP streaming (Server-Sent Events) works for one-way server-to-client token delivery, but WebSockets enable bidirectional communication — the client can send a message, the server starts inference and streams tokens back, and the client can cancel mid-stream or send a follow-up before generation completes. This bidirectional model is a better fit for interactive chat interfaces than SSE, particularly when you want features like stop-generation buttons or real-time typing indicators. WebSockets also maintain a persistent connection that avoids HTTP handshake overhead for every message, which matters for low-latency chat applications.

FastAPI WebSocket Server

pip install fastapi uvicorn ollama websockets
# server.py
import asyncio
import json
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from fastapi.responses import HTMLResponse
import ollama

app = FastAPI()

@app.websocket("/ws/chat")
async def chat_websocket(websocket: WebSocket):
    await websocket.accept()
    try:
        while True:
            data = await websocket.receive_json()
            message = data.get("message", "")
            model = data.get("model", "llama3.2")

            # Stream tokens back to client
            loop = asyncio.get_event_loop()
            stream = await loop.run_in_executor(
                None,
                lambda: ollama.chat(
                    model=model,
                    messages=[{"role": "user", "content": message}],
                    stream=True
                )
            )
            for chunk in stream:
                token = chunk["message"]["content"]
                await websocket.send_json({"type": "token", "content": token})

            await websocket.send_json({"type": "done"})

    except WebSocketDisconnect:
        pass

# uvicorn server:app --reload

Async Streaming with httpx

import httpx
import asyncio

@app.websocket("/ws/chat/async")
async def chat_ws_async(websocket: WebSocket):
    await websocket.accept()
    try:
        while True:
            data = await websocket.receive_json()
            async with httpx.AsyncClient(timeout=120) as client:
                async with client.stream(
                    "POST",
                    "http://localhost:11434/api/chat",
                    json={
                        "model": data.get("model", "llama3.2"),
                        "messages": [{"role": "user", "content": data["message"]}],
                        "stream": True
                    }
                ) as response:
                    async for line in response.aiter_lines():
                        if not line:
                            continue
                        chunk = json.loads(line)
                        token = chunk.get("message", {}).get("content", "")
                        if token:
                            await websocket.send_json({"type": "token", "content": token})
                        if chunk.get("done"):
                            await websocket.send_json({"type": "done"})
                            break
    except WebSocketDisconnect:
        pass

Browser Client (Vanilla JS)

<!-- index.html -->
<div id="output"></div>
<input id="msg" type="text" placeholder="Ask something..." />
<button onclick="send()">Send</button>
<button onclick="ws.close()">Stop</button>

<script>
const ws = new WebSocket("ws://localhost:8000/ws/chat");
const output = document.getElementById("output");

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  if (data.type === "token") {
    output.textContent += data.content;
  } else if (data.type === "done") {
    output.textContent += "\n\n";
  }
};

function send() {
  const message = document.getElementById("msg").value;
  output.textContent = "";
  ws.send(JSON.stringify({ message, model: "llama3.2" }));
}
</script>

Connection Manager for Multiple Clients

class ConnectionManager:
    def __init__(self):
        self.active: list[WebSocket] = []

    async def connect(self, ws: WebSocket):
        await ws.accept()
        self.active.append(ws)

    def disconnect(self, ws: WebSocket):
        self.active.remove(ws)

    async def broadcast(self, message: dict):
        for ws in self.active:
            await ws.send_json(message)

manager = ConnectionManager()

@app.websocket("/ws/shared")
async def shared_chat(websocket: WebSocket):
    await manager.connect(websocket)
    try:
        while True:
            data = await websocket.receive_json()
            # Broadcast each token to all connected clients
            async with httpx.AsyncClient(timeout=120) as client:
                async with client.stream("POST", "http://localhost:11434/api/chat",
                    json={"model": "llama3.2",
                          "messages": [{"role": "user", "content": data["message"]}],
                          "stream": True}) as resp:
                    async for line in resp.aiter_lines():
                        if line:
                            chunk = json.loads(line)
                            token = chunk.get("message", {}).get("content", "")
                            if token:
                                await manager.broadcast({"type": "token", "content": token})
    except WebSocketDisconnect:
        manager.disconnect(websocket)

WebSocket vs Server-Sent Events for Ollama Streaming

Both WebSockets and Server-Sent Events (SSE) deliver streaming tokens to the browser, but they differ in important ways for AI chat applications. SSE is simpler — it is a one-way HTTP stream from server to browser, with built-in reconnection and no additional protocol overhead. WebSockets are bidirectional — the same persistent connection carries both the user’s messages to the server and the model’s token stream back to the browser. For simple single-turn Q&A (user sends a question, gets a streamed answer), SSE is the simpler choice. For interactive chat with conversation history, real-time cancellation of in-flight responses, typing indicators, or multi-user shared sessions, WebSockets provide a cleaner architecture. The FastAPI WebSocket approach in this article handles all of these patterns with the same underlying connection.

Cancellation Mid-Stream

import asyncio

@app.websocket("/ws/chat/cancellable")
async def cancellable_chat(websocket: WebSocket):
    await websocket.accept()
    stream_task = None
    try:
        while True:
            data = await websocket.receive_json()

            # Cancel any in-flight stream
            if stream_task and not stream_task.done():
                stream_task.cancel()

            if data.get("type") == "cancel":
                await websocket.send_json({"type": "cancelled"})
                continue

            async def stream_response(message: str):
                async with httpx.AsyncClient(timeout=120) as client:
                    async with client.stream("POST",
                        "http://localhost:11434/api/chat",
                        json={"model": "llama3.2",
                              "messages": [{"role": "user", "content": message}],
                              "stream": True}) as resp:
                        async for line in resp.aiter_lines():
                            if line:
                                chunk = json.loads(line)
                                token = chunk.get("message", {}).get("content", "")
                                if token:
                                    await websocket.send_json({"type": "token", "content": token})
                        await websocket.send_json({"type": "done"})

            stream_task = asyncio.create_task(stream_response(data["message"]))
    except WebSocketDisconnect:
        if stream_task:
            stream_task.cancel()

React Frontend with WebSocket Hook

// hooks/useOllamaChat.ts
import { useState, useEffect, useRef } from 'react';

export function useOllamaChat(url: string) {
  const [tokens, setTokens] = useState('');
  const [isStreaming, setIsStreaming] = useState(false);
  const ws = useRef(null);

  useEffect(() => {
    ws.current = new WebSocket(url);
    ws.current.onmessage = (e) => {
      const data = JSON.parse(e.data);
      if (data.type === 'token') setTokens(prev => prev + data.content);
      else if (data.type === 'done') setIsStreaming(false);
    };
    return () => ws.current?.close();
  }, [url]);

  const send = (message: string) => {
    setTokens('');
    setIsStreaming(true);
    ws.current?.send(JSON.stringify({ message }));
  };

  const cancel = () => {
    ws.current?.send(JSON.stringify({ type: 'cancel' }));
    setIsStreaming(false);
  };

  return { tokens, isStreaming, send, cancel };
}

Authentication and Security

WebSocket connections do not automatically include cookies or Authorization headers in the initial handshake in all browsers. For authenticated WebSocket connections, pass a token as a query parameter on the initial connection URL (ws://localhost:8000/ws/chat?token=...) and validate it in the FastAPI WebSocket handler before accept(). In production, always use wss:// (WebSocket Secure) rather than plain ws:// — configure TLS termination at your reverse proxy (NGINX, Caddy) and forward to the FastAPI application on plain WebSocket. Restrict WebSocket origins using FastAPI middleware to prevent cross-site WebSocket hijacking from malicious pages that open a WebSocket to your API without authorisation.

Performance and Scaling

A single FastAPI process handles many concurrent WebSocket connections efficiently via asyncio — unlike synchronous frameworks where each connection would occupy a thread. The bottleneck for concurrent WebSocket AI sessions is not the Python event loop but Ollama’s inference capacity: a single Ollama instance can serve OLLAMA_NUM_PARALLEL simultaneous inference requests (default 1). For multi-user WebSocket deployments, either increase OLLAMA_NUM_PARALLEL (needs more VRAM), deploy multiple Ollama instances with a load balancer, or implement a server-side queue that serialises inference requests while maintaining individual WebSocket connections to all clients. The connection manager pattern from this article is the foundation for the queued approach — extend it to hold pending requests and process them sequentially when the previous inference completes.

Getting Started

Install FastAPI, uvicorn, and httpx, copy the async WebSocket endpoint from this article, and run uvicorn server:app --reload. Open the HTML client in a browser, type a question, and watch tokens stream back in real time via WebSocket. Add the cancellation pattern when you need stop-generation functionality, and the React hook when integrating with a React frontend. The WebSocket server approach works with any Ollama model and any frontend framework — swap the model name in the server and build the UI as needed for your application.

Deployment with NGINX WebSocket Proxy

Production WebSocket deployments require a reverse proxy configured to forward WebSocket upgrade requests correctly. NGINX handles this with specific directives:

server {
    listen 443 ssl;
    server_name your-app.example.com;

    location /ws/ {
        proxy_pass http://localhost:8000;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_read_timeout 300s;  # Allow long inference
        proxy_send_timeout 300s;
    }

    location / {
        proxy_pass http://localhost:8000;
    }
}

The critical headers are Upgrade: websocket and Connection: upgrade — without these, NGINX will close the connection after the initial HTTP response rather than upgrading it to a WebSocket. The extended proxy timeouts prevent NGINX from killing long-running inference streams mid-generation.

Reconnection Handling in the Browser

class ReconnectingWebSocket {
  constructor(url, options = {}) {
    this.url = url;
    this.retryDelay = options.retryDelay || 2000;
    this.maxRetries = options.maxRetries || 5;
    this.retries = 0;
    this.ontoken = options.ontoken || (() => {});
    this.ondone = options.ondone || (() => {});
    this.connect();
  }

  connect() {
    this.ws = new WebSocket(this.url);
    this.ws.onmessage = (e) => {
      const data = JSON.parse(e.data);
      if (data.type === 'token') this.ontoken(data.content);
      else if (data.type === 'done') this.ondone();
    };
    this.ws.onclose = () => {
      if (this.retries < this.maxRetries) {
        this.retries++;
        setTimeout(() => this.connect(), this.retryDelay);
      }
    };
    this.ws.onopen = () => { this.retries = 0; };
  }

  send(message) {
    if (this.ws.readyState === WebSocket.OPEN) {
      this.ws.send(JSON.stringify({ message }));
    }
  }
}

const chat = new ReconnectingWebSocket('wss://your-app.example.com/ws/chat', {
  ontoken: (t) => document.getElementById('output').textContent += t,
  ondone: () => console.log('Stream complete')
});

WebSocket vs Polling: When to Use Each

WebSockets are the right choice for interactive chat interfaces, collaborative features, and any case where you want immediate cancellation of in-flight responses. For simpler use cases — a form that submits text and waits for a response, a background processing status indicator — long-polling or SSE may be simpler and sufficient. The rule of thumb: if users are watching tokens appear as the model generates them in a chat-like interface, use WebSockets. If AI processing is a background step users are waiting for (like a document analysis), SSE or even a simple poll-for-completion pattern may be adequate and easier to implement. Choose the simplest mechanism that meets your UX requirements.

WebSocket Patterns for Specific AI Use Cases

Different AI chat features benefit from different WebSocket message designs. For a simple single-user chat, the patterns in this article — send a message, receive tokens, receive done — are sufficient. For collaborative AI features where multiple users see the same AI response in real time (pair programming with AI, shared brainstorming), the broadcast connection manager extends this to multiple simultaneous clients. For agent-style interactions where the AI performs multiple steps with intermediate updates, add message types for step announcements ({type: 'step', description: 'Searching documentation...'}) alongside token chunks. For voice-to-text-to-AI pipelines, add audio chunk reception and Whisper transcription before sending to Ollama. The WebSocket message protocol is entirely under your control — design it around your UX requirements and extend it as your application evolves.

Scaling WebSocket Connections

A single FastAPI process with asyncio handles hundreds to thousands of concurrent WebSocket connections efficiently — the ASGI model is non-blocking, so idle connections waiting for user input cost minimal resources. The limiting factor is inference parallelism: a single Ollama instance serves a limited number of simultaneous inference requests (controlled by OLLAMA_NUM_PARALLEL, default 1). For a multi-user WebSocket chat application with many simultaneous active conversations, implement a server-side inference queue: WebSocket connections are cheap and can wait, but inference requests must be serialised or limited to the number Ollama can serve in parallel. The connection manager pattern in this article is the foundation — extend it with a asyncio.Queue for pending inference requests and workers that drain the queue at Ollama’s rate. This architecture handles arbitrary numbers of WebSocket clients while respecting Ollama’s inference capacity, with each client experiencing the natural queue latency rather than errors or timeouts.

WebSocket vs SSE: A Practical Decision Guide

For most Ollama streaming use cases, Server-Sent Events are simpler and sufficient — they require less code, have built-in browser reconnection, and work through HTTP proxies without special configuration. Use WebSockets when you specifically need bidirectional communication: sending follow-up messages while a response is streaming, cancelling in-flight generation, typing indicators, or multi-user shared sessions. If your use case is simply ‘user submits a prompt and watches tokens appear’, SSE is the correct tool. If your use case involves richer interaction — cancellation buttons, real-time collaboration, mid-stream follow-up questions — WebSockets are worth the additional complexity. The connection manager and cancellation patterns in this article handle all the WebSocket-specific use cases cleanly, providing a solid foundation to build on without over-engineering a simple streaming interface.

Conversation History over WebSockets

Maintaining multi-turn conversation history over a persistent WebSocket connection is straightforward — store the history server-side in memory keyed by connection, and include it in every Ollama request:

from collections import defaultdict

histories: dict[str, list] = defaultdict(list)

@app.websocket("/ws/chat/history")
async def chat_with_history(websocket: WebSocket):
    await websocket.accept()
    conn_id = id(websocket)  # Unique ID per connection
    try:
        while True:
            data = await websocket.receive_json()
            if data.get("type") == "clear":
                histories[conn_id] = []
                await websocket.send_json({"type": "cleared"})
                continue

            histories[conn_id].append({"role": "user", "content": data["message"]})
            full_response = ""

            async with httpx.AsyncClient(timeout=120) as client:
                async with client.stream("POST",
                    "http://localhost:11434/api/chat",
                    json={"model": "llama3.2",
                          "messages": histories[conn_id],
                          "stream": True}) as resp:
                    async for line in resp.aiter_lines():
                        if line:
                            chunk = json.loads(line)
                            token = chunk.get("message", {}).get("content", "")
                            if token:
                                full_response += token
                                await websocket.send_json({"type": "token", "content": token})

            histories[conn_id].append({"role": "assistant", "content": full_response})
            # Trim to last 20 messages
            if len(histories[conn_id]) > 20:
                histories[conn_id] = histories[conn_id][-20:]
            await websocket.send_json({"type": "done"})
    except WebSocketDisconnect:
        del histories[conn_id]  # Clean up on disconnect

Testing WebSocket Endpoints

FastAPI’s TestClient supports WebSocket testing via with client.websocket_connect('/ws/chat') as ws. For integration tests, you need Ollama running with a small fast model. For unit tests, mock the httpx streaming call to return canned responses. The asyncio event loop management in WebSocket tests can be tricky — use pytest-asyncio with @pytest.mark.anyio or @pytest.mark.asyncio for async test functions, and be aware that FastAPI’s WebSocket TestClient is synchronous despite the async endpoint. Test the connection, message flow, and disconnection handling separately to keep tests focused and fast.

The WebSocket + Ollama Stack in Production

The FastAPI + WebSocket + Ollama combination is a proven, production-viable stack for real-time AI chat interfaces. FastAPI handles WebSocket connections efficiently with asyncio, supports authentication and middleware, and deploys cleanly behind NGINX or any reverse proxy. The httpx async streaming approach avoids blocking the event loop during inference, allowing many concurrent connections to be managed by a single FastAPI process. Ollama provides the local inference layer with a straightforward streaming API. The three components are independently deployable, testable, and replaceable — you can swap Ollama for another OpenAI-compatible backend, or FastAPI for another ASGI framework, without rewriting the WebSocket protocol or client code. This separation of concerns is what makes the stack maintainable as your application grows beyond a prototype into a production system with real users — a stack that earns its complexity through the significantly better UX it delivers for AI chat features compared to any polling-based alternative.

WebSocket vs REST for AI APIs

When building an AI feature for a web application, the choice between WebSocket and REST (with polling or SSE) affects both the user experience and the backend architecture. REST endpoints are simpler to implement, cache, and debug; they work well for fire-and-forget AI tasks where the result is short or where the user is willing to wait for a complete response. WebSockets add connection management complexity but enable the streaming UX that users expect from chat interfaces — seeing tokens appear as they are generated rather than waiting 15–30 seconds for a complete response. The streaming UX is not just cosmetic; research consistently shows that users perceive streaming responses as faster and more responsive than equivalent-latency complete responses, because progress feedback reduces perceived wait time. For any AI feature where users are actively watching the response generate, WebSocket streaming is worth the additional implementation effort. For background processing, reporting, and classification tasks where users do not watch generation, REST is the right choice.

WebSocket streaming is the foundation for building AI applications that feel fast and interactive. The async FastAPI approach in this article scales to hundreds of simultaneous connections with a single Python process, and the reconnection and authentication patterns make it production-ready. Build the simple version first, validate it with real users, and add the complexity of cancellation and multi-client broadcasting only when your use case requires it — and the patterns in this article provide a solid foundation for whatever complexity you ultimately need.

The WebSocket pattern in this article handles the core use case cleanly — bidirectional streaming with cancellation support and multi-client broadcast capabilities that cover most production AI chat requirements without unnecessary complexity.

Leave a Comment