How to Build a Local AI API with Ollama and Express

Express is the most widely used Node.js web framework — minimal, flexible, and with a massive ecosystem of middleware. Building an Ollama API with Express gives you a lightweight HTTP layer that any client can consume: a web frontend, a mobile app, a CLI tool, or another service. This guide walks through a production-ready Express API for Ollama covering chat endpoints, streaming with Server-Sent Events, API key authentication, rate limiting, and deployment. The patterns here complement the Svelte and SvelteKit guides on this site — Express is a natural backend for any frontend framework that needs a server-side Ollama proxy.

Setup

mkdir ollama-express && cd ollama-express
npm init -y
npm install express node-fetch dotenv express-rate-limit cors helmet
npm install -D nodemon

Create a .env file with OLLAMA_BASE_URL=http://localhost:11434, OLLAMA_MODEL=llama3.2, API_KEY=your-secret-key, and PORT=3001. Run during development with npx nodemon app.js.

Basic Chat Endpoint

Here is a minimal Express server with a chat endpoint, auth middleware, and proper error handling:

require("dotenv").config();
const express = require("express");
const fetch = require("node-fetch");
const cors = require("cors");
const helmet = require("helmet");

const app = express();
app.use(express.json());
app.use(helmet());
app.use(cors({ origin: process.env.ALLOWED_ORIGINS?.split(",") || "*" }));

const OLLAMA = process.env.OLLAMA_BASE_URL;
const MODEL = process.env.OLLAMA_MODEL;
const API_KEY = process.env.API_KEY;

const auth = (req, res, next) => {
  if (!API_KEY || req.headers["x-api-key"] === API_KEY) return next();
  res.status(401).json({ error: "Unauthorized" });
};

app.post("/api/chat", auth, async (req, res) => {
  const { messages, model } = req.body;
  if (!messages?.length) return res.status(400).json({ error: "messages required" });
  try {
    const resp = await fetch(`${OLLAMA}/api/chat`, {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ model: model || MODEL, messages, stream: false })
    });
    res.json(await resp.json());
  } catch (err) {
    const status = err.code === "ECONNREFUSED" ? 503 : 500;
    res.status(status).json({ error: err.message });
  }
});

app.listen(process.env.PORT || 3001);

The helmet() middleware adds security headers including X-Content-Type-Options, X-Frame-Options, and Strict-Transport-Security automatically. The ECONNREFUSED check returns a 503 when Ollama is not running, giving clients a meaningful status rather than a generic 500.

Streaming with SSE

Add a streaming endpoint that forwards tokens to clients as they arrive:

app.post("/api/chat/stream", auth, async (req, res) => {
  const { messages, model } = req.body;
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("X-Accel-Buffering", "no");
  res.flushHeaders();
  try {
    const resp = await fetch(`${OLLAMA}/api/chat`, {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ model: model || MODEL, messages, stream: true })
    });
    for await (const chunk of resp.body) {
      const lines = chunk.toString().split("\n").filter(Boolean);
      for (const line of lines) {
        try {
          const data = JSON.parse(line);
          const token = data?.message?.content ?? "";
          if (token) res.write(`data: ${JSON.stringify({ token })}\n\n`);
          if (data.done) { res.write("data: [DONE]\n\n"); res.end(); return; }
        } catch { /* skip malformed lines */ }
      }
    }
  } catch (err) {
    res.write(`data: ${JSON.stringify({ error: err.message })}\n\n`);
    res.end();
  }
});

The flushHeaders() call sends HTTP headers immediately, establishing the SSE connection before any data arrives. Without it, some HTTP clients buffer the connection until data arrives, causing a noticeable delay. The X-Accel-Buffering: no header disables nginx response buffering when deployed behind a reverse proxy.

Rate Limiting and Additional Endpoints

Add rate limiting to protect against abuse:

const rateLimit = require("express-rate-limit");
const limiter = rateLimit({ windowMs: 60000, max: 10, standardHeaders: true });
app.use("/api/", limiter);

The standardHeaders: true option includes RateLimit-* headers in responses so clients know their current limit and remaining quota. Add embeddings and health check endpoints to complete the API surface. The health endpoint uses AbortSignal.timeout(5000) — available natively in Node.js 18+ — to give Ollama 5 seconds to respond before returning a 503. Expose a /api/models endpoint that proxies Ollama’s /api/tags response so frontends can populate model dropdowns dynamically without hardcoding model names — when a user pulls a new model, it appears automatically.

Request Logging and Deployment

Add lightweight per-request logging using a finish hook:

app.use((req, res, next) => {
  const start = Date.now();
  res.on("finish", () =>
    console.log(`${req.method} ${req.path} ${res.statusCode} ${Date.now()-start}ms`)
  );
  next();
});

Deploy persistently with PM2: pm2 start app.js --name ollama-api && pm2 save && pm2 startup. PM2 restarts on crash, rotates logs, and starts on boot. Put Nginx in front for HTTPS and set proxy_buffering off for the streaming endpoint. With 2 to 4 PM2 cluster workers, Express handles concurrent connections efficiently — all requests still queue on Ollama’s side, but the Express layer stays responsive and can serve health checks and other fast endpoints while Ollama is busy with a long generation request.

TypeScript Support

The Express ecosystem has excellent TypeScript support via @types/express and @types/node. Adding TypeScript to the project gives you typed request and response objects, autocomplete for Express middleware, and type-safe environment variable access. Use ts-node or tsx for development and compile to JavaScript for production with tsc. Define interfaces for your request body types — interface ChatRequest { messages: Message[]; model?: string } — and use them as generic parameters on the route handler: app.post<{}, {}, ChatRequest>("/api/chat", auth, async (req, res). TypeScript catches mismatches between what the route expects and what the client sends at compile time rather than at runtime, reducing a common class of bugs in API development.

Why Express for Ollama

Express occupies a specific niche in the Ollama ecosystem: it is the right backend when you are already working in JavaScript or TypeScript and want a thin, familiar HTTP layer without learning a new language or framework. For a React, Svelte, or Vue frontend that needs a server-side Ollama proxy, an Express backend means the same language and toolchain throughout the project. For a Node.js CLI tool that occasionally needs to expose an API, Express is trivial to add. For a team that knows Node.js and needs to ship quickly, Express is the fastest path from Ollama to a documented, secured, rate-limited API that other services can consume reliably.

Middleware Architecture

Express’s middleware stack is its defining feature — every request passes through a sequence of functions, each of which can inspect, modify, or terminate the request-response cycle. For an Ollama API, the middleware stack gives you a clean place to add cross-cutting concerns without touching the route handlers themselves. Authentication runs before every route handler and either passes through or returns 401. Rate limiting runs after authentication and either passes through or returns 429. Request logging runs in a finish hook after the response. Error handling runs last, catching any unhandled errors thrown by route handlers. Each middleware function has a single responsibility and can be added, removed, or reordered without modifying any other middleware or any route handler.

For an Ollama-specific middleware addition, consider adding a request validation layer that checks whether the specified model is available before forwarding to Ollama. If a client requests llama3.2:70b but only llama3.2 is pulled, the validation middleware can return a 400 with a helpful message — “model llama3.2:70b not available, available models: llama3.2” — rather than forwarding to Ollama and waiting 120 seconds for a timeout. Fetch the available models once on server startup and refresh periodically, caching the list in memory so the validation check is a simple Set lookup rather than an API call on every request.

Conversation Sessions

For applications where multiple users each maintain their own conversation history, use Express sessions to store per-user message history server-side. The express-session middleware with a Redis store (via connect-redis) gives you scalable, persistent sessions that survive server restarts. Each session stores the message history array, and the chat endpoint reads from and writes to the session on every request. Set a reasonable session expiry — 24 hours is a sensible default — so inactive conversations are cleaned up automatically without manual management.

For single-user local deployments where sessions are not needed, simply accept the full conversation history as part of the request body — the client sends all previous messages with each new request, and the server passes them directly to Ollama without any server-side state. This stateless approach is simpler to implement and debug, and it is the right architecture for an API consumed by a single frontend application that manages conversation state itself.

WebSocket Support for Bidirectional Streaming

Server-Sent Events work well for Ollama streaming because the communication is one-directional — the server streams tokens to the client, and the client sends new messages via a separate POST request. For use cases that need true bidirectional communication — a voice assistant where audio and text flow in both directions, a multi-agent system where the client and server exchange intermediate results — upgrade to WebSockets using the ws library alongside Express.

Adding WebSocket support to an Express server is straightforward: create a WebSocket server that shares the Express HTTP server, handle incoming messages to start Ollama streaming calls, and emit tokens back over the WebSocket connection as they arrive. The WebSocket approach adds complexity compared to SSE — connection management, reconnection logic, and message framing — but provides a richer communication model when the use case requires it. For most Ollama chat applications, SSE is simpler and sufficient.

Caching Repeated Responses

At low temperature settings, Ollama’s responses to identical prompts are deterministic or nearly so. Adding a cache layer in Express can dramatically improve response time for common queries — frequently asked questions, standard code generation prompts, or analysis of frequently referenced documents. Use the node-cache package for in-memory caching with TTL support: generate a cache key from a hash of the messages array and model name, check the cache before calling Ollama, and store the result after a successful call. Set the TTL based on how frequently the content changes — an hour for conversational responses, a day for document analysis, indefinitely for deterministic code generation prompts.

Cache only non-streaming responses — streaming responses are harder to cache correctly because the cached content needs to be replayed in the SSE format. For streaming endpoints, consider implementing a “warm” cache that pre-generates responses for common prompts in the background, storing the complete response text. When a cached response is available for a streaming request, replay it token by token with artificial delays to simulate streaming — this preserves the streaming UX while serving cached content at a fraction of the inference cost.

Monitoring and Observability

For a production Express Ollama API, add metrics collection to understand how the service is behaving over time. Track request counts by endpoint and status code, response time percentiles (p50, p95, p99), Ollama error rates, and model usage distribution. The prom-client package makes it straightforward to expose these metrics in Prometheus format at a /metrics endpoint, and from there they flow into Grafana dashboards through the standard Prometheus scraping mechanism described in the separate Prometheus and Grafana monitoring guide on this site.

Even without a full metrics stack, structured logging with a library like pino gives you queryable logs that are far more useful than console.log output. Log each request with its method, path, status code, duration, model name, and input token count estimate. Log Ollama errors with the full error message and the model that was requested. These logs become invaluable when debugging production issues — understanding whether a spike in errors is caused by a specific model, a specific endpoint, or a specific time of day requires the structured data that console.log cannot provide.

Testing Express Ollama Routes

Test Express routes without a running Ollama instance using Jest or Vitest with Supertest for HTTP assertions and Jest mocks for the fetch calls. The standard pattern is to export the Express app from app.js without calling listen(), import it in test files, and use Supertest’s request(app).post("/api/chat") to make test requests directly against the app without starting a server. Mock node-fetch at the module level to return fixed JSON responses, and write tests that verify the route logic — correct status codes, proper error handling for missing fields, authentication rejection for wrong API keys, and correct response structure for successful calls.

For integration tests that call a real Ollama instance, use a separate test configuration that only runs when the OLLAMA_INTEGRATION_TESTS=1 environment variable is set. These tests verify that the full end-to-end flow works — the Express middleware chain, the Ollama call, and the response formatting — on a machine where Ollama is actually running. Keeping integration tests separate from unit tests means the fast unit test suite runs on every commit regardless of infrastructure, while the integration tests run in dedicated CI environments where Ollama is available.

The Express Ollama API in this guide is a solid foundation for any team building AI-powered applications in JavaScript or TypeScript. The middleware stack handles the operational concerns — authentication, rate limiting, logging, CORS — while keeping route handlers clean and focused on the Ollama interaction. Add TypeScript, structured logging, metrics, and session management as your application grows, and you have a production-grade local AI API that scales with your needs without requiring a cloud provider or per-query pricing.

Start with the basic chat endpoint, get it working against your Ollama instance, then add authentication and rate limiting before sharing the URL with anyone else. Add the streaming endpoint when your frontend is ready to consume SSE responses. Add caching and structured logging when you have enough traffic to benefit from them. Each addition is independent and can be layered in without restructuring what you have already built — which is the Express way, and the reason it has remained the standard Node.js web framework for over a decade.

Express has stayed relevant for over a decade because it does one thing very well: it gives you a minimal, composable HTTP layer and gets out of your way. For an Ollama API that will evolve as your AI capabilities grow, that simplicity is a feature, not a limitation.

Leave a Comment