How to Use Ollama with JavaScript and Node.js

The official Ollama JavaScript library makes it straightforward to call local models from Node.js applications, browser-side code, and any JavaScript runtime. It supports chat completions, text generation, embeddings, streaming, and model management — the same feature set as the Python library but idiomatic JavaScript with full TypeScript support. This guide covers installation, the core API, streaming in a web context, and practical patterns for building Node.js applications on top of local LLMs.

Installation

npm install ollama
# or
pnpm add ollama
# or
yarn add ollama

Basic Chat

import ollama from 'ollama';

// Simple chat completion
const response = await ollama.chat({
  model: 'llama3.2',
  messages: [{ role: 'user', content: 'Explain async/await in JavaScript in 2 sentences.' }]
});
console.log(response.message.content);

// With system prompt
const response2 = await ollama.chat({
  model: 'llama3.2',
  messages: [
    { role: 'system', content: 'You are a concise technical assistant. No markdown.' },
    { role: 'user', content: 'What is a closure?' }
  ],
  options: { temperature: 0.3 }
});
console.log(response2.message.content);

Streaming Responses

import ollama from 'ollama';

// Stream tokens as they generate
const stream = await ollama.chat({
  model: 'llama3.2',
  messages: [{ role: 'user', content: 'Write a short poem about JavaScript.' }],
  stream: true
});

for await (const chunk of stream) {
  process.stdout.write(chunk.message.content);
}
console.log(); // newline at end

Text Generation (Non-Chat)

// Raw completion — useful for classification and structured output
const result = await ollama.generate({
  model: 'llama3.2',
  prompt: 'Classify the sentiment as positive, negative, or neutral: "This product is amazing!"\nSentiment:',
  stream: false,
  options: { temperature: 0.0, stop: ['\n'] }
});
console.log(result.response.trim()); // 'positive'

Embeddings

import ollama from 'ollama';

// Single embedding
const result = await ollama.embeddings({
  model: 'nomic-embed-text',
  prompt: 'The quick brown fox'
});
console.log(`Embedding dimension: ${result.embedding.length}`);

// Cosine similarity helper
function cosineSimilarity(a, b) {
  const dot = a.reduce((sum, val, i) => sum + val * b[i], 0);
  const normA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
  const normB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
  return dot / (normA * normB);
}

const [e1, e2] = await Promise.all([
  ollama.embeddings({ model: 'nomic-embed-text', prompt: 'machine learning' }),
  ollama.embeddings({ model: 'nomic-embed-text', prompt: 'deep learning' })
]);
console.log('Similarity:', cosineSimilarity(e1.embedding, e2.embedding).toFixed(3));

Model Management

import ollama from 'ollama';

// List available models
const { models } = await ollama.list();
models.forEach(m => console.log(m.name, `${(m.size/1e9).toFixed(1)}GB`));

// Show model details
const details = await ollama.show({ model: 'llama3.2' });
console.log(details.modelfile);

// Pull a model with progress
const stream = await ollama.pull({ model: 'llama3.2', stream: true });
for await (const progress of stream) {
  if (progress.total) {
    const pct = ((progress.completed / progress.total) * 100).toFixed(1);
    process.stdout.write(`\rPulling: ${pct}%`);
  }
}
console.log('\nDone.');

Streaming in an Express Server

import express from 'express';
import ollama from 'ollama';

const app = express();
app.use(express.json());

app.post('/chat', async (req, res) => {
  const { messages, model = 'llama3.2' } = req.body;

  // Set headers for Server-Sent Events
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');

  try {
    const stream = await ollama.chat({ model, messages, stream: true });
    for await (const chunk of stream) {
      res.write(`data: ${JSON.stringify({ content: chunk.message.content })}\n\n`);
    }
    res.write('data: [DONE]\n\n');
  } catch (err) {
    res.write(`data: ${JSON.stringify({ error: err.message })}\n\n`);
  } finally {
    res.end();
  }
});

app.listen(3000, () => console.log('Server running on :3000'));

Consuming the Stream from the Browser

// Frontend JavaScript — consume SSE stream from the Express server above
async function chat(messages) {
  const response = await fetch('/chat', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ messages })
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();
  const output = document.getElementById('output');

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    const lines = decoder.decode(value).split('\n');
    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const data = line.slice(6);
        if (data === '[DONE]') return;
        try {
          const { content } = JSON.parse(data);
          output.textContent += content;
        } catch {}
      }
    }
  }
}

chat([{ role: 'user', content: 'Hello!' }]);

Custom Ollama Client (Different Host)

import { Ollama } from 'ollama';

// Connect to a remote Ollama server (team server, Docker, etc.)
const client = new Ollama({ host: 'http://192.168.1.100:11434' });

const response = await client.chat({
  model: 'llama3.2',
  messages: [{ role: 'user', content: 'Hello from a remote server!' }]
});
console.log(response.message.content);

Multi-Turn Conversation with History

import ollama from 'ollama';
import * as readline from 'readline/promises';

const rl = readline.createInterface({ input: process.stdin, output: process.stdout });
const history = [];

console.log('Chat with Llama 3.2 (type "quit" to exit)');

while (true) {
  const input = await rl.question('You: ');
  if (input.toLowerCase() === 'quit') break;

  history.push({ role: 'user', content: input });

  const stream = await ollama.chat({
    model: 'llama3.2',
    messages: history,
    stream: true
  });

  process.stdout.write('Assistant: ');
  let assistantReply = '';
  for await (const chunk of stream) {
    process.stdout.write(chunk.message.content);
    assistantReply += chunk.message.content;
  }
  console.log();
  history.push({ role: 'assistant', content: assistantReply });
}

rl.close();

Using with TypeScript

The ollama package ships TypeScript types. Import the Message, ChatRequest, and GenerateRequest types for full type safety:

import ollama, { Message, ChatResponse } from 'ollama';

async function chat(messages: Message[]): Promise {
  return ollama.chat({ model: 'llama3.2', messages });
}

const messages: Message[] = [
  { role: 'system', content: 'You are a helpful assistant.' },
  { role: 'user', content: 'What is TypeScript?' }
];

const response = await chat(messages);
console.log(response.message.content);

When to Use the JS Library vs the OpenAI SDK

Use the native Ollama JS library when you need model management operations (pull, list, show, delete), want the native streaming format with Ollama-specific metadata, or are building a Node.js tool specifically around Ollama. Use the OpenAI SDK pointed at Ollama’s /v1 endpoint when you want your code to be portable between Ollama and cloud providers, or when you are integrating with a framework (LangChain.js, LlamaIndex.TS) that already targets the OpenAI API. For pure inference in a new project, either works well — the native library is slightly leaner and has better streaming ergonomics in JavaScript, while the OpenAI SDK has broader ecosystem compatibility.

Why JavaScript for Local LLMs?

JavaScript is an unusual choice for LLM integration — Python dominates the ML tooling ecosystem. But for web developers and Node.js backend engineers, calling local LLMs from JavaScript is the natural fit. You can integrate Ollama directly into your existing Express or Fastify server, call it from a Next.js API route, use it in a Bun or Deno script, or consume it directly from browser code via fetch. The official Ollama JavaScript library is maintained by the Ollama team and tracks the native API closely, so it supports everything the Python library supports — including the same streaming format, model management operations, and keep-alive settings.

The practical benefit for web developers is that you can build a complete local AI application — frontend, backend, and inference — without switching languages or runtimes. A Next.js application with an Ollama integration is a single codebase in a single language, which is meaningfully simpler to maintain than a hybrid Python backend with a JavaScript frontend.

Performance Characteristics

The JavaScript library communicates with Ollama over HTTP — the same mechanism as the Python library and curl. There is no meaningful performance difference between languages at the client level because the bottleneck is always Ollama’s inference speed, not HTTP client overhead. A Node.js application calling Ollama will see exactly the same tokens-per-second and first-token latency as a Python application calling the same model on the same hardware. The choice of JavaScript versus Python does not affect inference performance in any practical sense.

What does differ is the concurrency model. Node.js’s event loop handles concurrent requests naturally without threads — if you have multiple users calling your Ollama-backed API simultaneously, Node.js queues them efficiently without the complexity of Python’s threading model. For web applications where the main concern is handling concurrent users, Node.js is an excellent fit for the Ollama integration layer.

Error Handling Patterns

The Ollama library throws errors for connection failures, missing models, and server errors. Wrapping calls in try-catch and providing meaningful error messages is important for production applications:

import ollama from 'ollama';

async function safeChat(model, messages) {
  try {
    const response = await ollama.chat({ model, messages });
    return { ok: true, content: response.message.content };
  } catch (err) {
    if (err.message.includes('ECONNREFUSED')) {
      return { ok: false, error: 'Ollama is not running. Start it with: ollama serve' };
    }
    if (err.message.includes('model not found')) {
      return { ok: false, error: `Model '${model}' not found. Pull it with: ollama pull ${model}` };
    }
    return { ok: false, error: err.message };
  }
}

const result = await safeChat('llama3.2', [{ role: 'user', content: 'Hello' }]);
if (result.ok) {
  console.log(result.content);
} else {
  console.error('Error:', result.error);
}

Structured JSON Output

Getting reliable JSON from a local model requires a good system prompt and low temperature. The Ollama JS library’s generate endpoint is slightly more predictable for structured output than chat, because you have direct control over the full prompt without the model’s chat template affecting the output format:

import ollama from 'ollama';

async function extractJson(text, schema) {
  const prompt = `Extract the following information from this text and return ONLY valid JSON, no other text:\nSchema: ${JSON.stringify(schema)}\nText: ${text}\nJSON:`;
  const result = await ollama.generate({
    model: 'llama3.2',
    prompt,
    stream: false,
    options: { temperature: 0.0 }
  });
  try {
    return JSON.parse(result.response.trim());
  } catch {
    // Try to extract JSON from the response if the model added extra text
    const match = result.response.match(/\{[\s\S]*\}/);
    return match ? JSON.parse(match[0]) : null;
  }
}

const data = await extractJson(
  'John Smith called at 3pm on Tuesday about the Q4 report delay.',
  { name: 'string', time: 'string', topic: 'string' }
);
console.log(data); // { name: 'John Smith', time: '3pm Tuesday', topic: 'Q4 report delay' }

Building a Reusable Ollama Client Class

import { Ollama } from 'ollama';

export class LocalLLM {
  constructor({ host = 'http://localhost:11434', model = 'llama3.2', systemPrompt = null } = {}) {
    this.client = new Ollama({ host });
    this.model = model;
    this.systemPrompt = systemPrompt;
    this.history = [];
    if (systemPrompt) this.history.push({ role: 'system', content: systemPrompt });
  }

  async chat(userMessage, { keepHistory = true } = {}) {
    const messages = keepHistory
      ? [...this.history, { role: 'user', content: userMessage }]
      : [{ role: 'user', content: userMessage }];
    const response = await this.client.chat({ model: this.model, messages });
    if (keepHistory) {
      this.history.push({ role: 'user', content: userMessage });
      this.history.push({ role: 'assistant', content: response.message.content });
    }
    return response.message.content;
  }

  async *stream(userMessage) {
    this.history.push({ role: 'user', content: userMessage });
    const gen = await this.client.chat({ model: this.model, messages: this.history, stream: true });
    let full = '';
    for await (const chunk of gen) {
      yield chunk.message.content;
      full += chunk.message.content;
    }
    this.history.push({ role: 'assistant', content: full });
  }

  clearHistory() {
    this.history = this.systemPrompt ? [{ role: 'system', content: this.systemPrompt }] : [];
  }
}

// Usage
const llm = new LocalLLM({ model: 'llama3.2', systemPrompt: 'You are a concise assistant.' });
console.log(await llm.chat('What is Node.js?'));
console.log(await llm.chat('What are its main use cases?')); // has context from previous

Choosing the Right Model for JS Applications

For Node.js backend applications where the server calls Ollama on behalf of users, model choice follows the same logic as Python applications — use the best model your server hardware supports. For browser-side or Electron applications where the LLM runs on the end user’s machine, model size matters more: assume at most 8GB of available RAM on the target machine and choose accordingly. Llama 3.2 3B is the safe default for end-user hardware; Llama 3.2 8B works well for users with 16GB+ RAM. For applications where you control the hardware (server deployments), the full model range is available.

Using Ollama JS with Popular Frameworks

The Ollama JavaScript library works in any Node.js-compatible environment. Here are the integration patterns for the most common setups developers actually use.

Next.js API Route (App Router): Create a route handler that streams responses back to the client, enabling real-time token display in your React components without a separate backend server:

// app/api/chat/route.js
import ollama from 'ollama';

export async function POST(request) {
  const { messages } = await request.json();
  const encoder = new TextEncoder();

  const stream = new ReadableStream({
    async start(controller) {
      const ollamaStream = await ollama.chat({
        model: 'llama3.2', messages, stream: true
      });
      for await (const chunk of ollamaStream) {
        controller.enqueue(encoder.encode(chunk.message.content));
      }
      controller.close();
    }
  });

  return new Response(stream, {
    headers: { 'Content-Type': 'text/plain; charset=utf-8' }
  });
}

Fastify route with streaming:

import Fastify from 'fastify';
import ollama from 'ollama';

const app = Fastify();

app.post('/chat', async (request, reply) => {
  const { messages } = request.body;
  reply.type('text/event-stream');
  const stream = await ollama.chat({ model: 'llama3.2', messages, stream: true });
  for await (const chunk of stream) {
    reply.raw.write(`data: ${chunk.message.content}`);
  }
  reply.raw.end();
});

await app.listen({ port: 3000 });

Working with the Ollama JS Library in Deno and Bun

The Ollama library is compatible with both Deno and Bun without modification. In Deno, import it from npm:

// Deno
import ollama from 'npm:ollama';
const response = await ollama.chat({
  model: 'llama3.2',
  messages: [{ role: 'user', content: 'Hello from Deno!' }]
});
console.log(response.message.content);

In Bun, install and import as normal — Bun’s npm compatibility handles it transparently. Bun’s faster startup time makes it particularly useful for Ollama-backed CLI scripts where you want quick startup without the overhead of a full Node.js process initialisation.

Rate Limiting and Queue Management for Web Applications

When building a web application backed by Ollama, you need to account for the fact that Ollama processes requests sequentially by default. If multiple users send messages simultaneously, they queue up — the second user’s request does not get its first token until the first user’s response completes. For small teams or personal tools this is acceptable. For public-facing applications, implement a per-user queue with a maximum wait time and a clear message to users when the server is busy:

import PQueue from 'p-queue';
import ollama from 'ollama';

// Limit to 1 concurrent Ollama call (matches Ollama's default sequential processing)
const queue = new PQueue({ concurrency: 1 });

export async function queuedChat(messages, timeoutMs = 30000) {
  return queue.add(
    () => ollama.chat({ model: 'llama3.2', messages }),
    { timeout: timeoutMs }
  );
}

The p-queue library gives you concurrency control, timeouts, and queue size limits in a few lines. For higher-traffic applications, increase Ollama’s OLLAMA_NUM_PARALLEL setting and match the queue concurrency accordingly — but remember that each parallel Ollama request multiplies VRAM usage proportionally.

Getting Started

The shortest path to a working Ollama JavaScript integration is three steps: run npm install ollama, ensure Ollama is running with a model pulled (ollama pull llama3.2), and copy the basic chat example from this article. The library handles connection management, JSON serialisation, and streaming transparently — you interact with it through clean async JavaScript rather than raw HTTP. From that starting point, add streaming for real-time display, error handling for production robustness, and the queue management pattern if you need to handle concurrent users. The full API surface — chat, generate, embeddings, model management — is covered by the library with consistent, idiomatic JavaScript patterns throughout.

Testing Ollama-Backed JavaScript Code

Testing code that calls Ollama requires either a running Ollama instance (integration tests) or mocking the library (unit tests). For unit tests, mock the ollama module to return predictable responses without needing a live Ollama server. This makes tests fast and runnable in CI environments where Ollama is not installed:

// Using Jest
import { jest } from '@jest/globals';

// Mock the ollama module
jest.mock('ollama', () => ({
  default: {
    chat: jest.fn().mockResolvedValue({
      message: { role: 'assistant', content: 'Mocked response' }
    }),
    embeddings: jest.fn().mockResolvedValue({
      embedding: new Array(768).fill(0.1)
    })
  }
}));

import { LocalLLM } from './local-llm.js';

test('chat returns response', async () => {
  const llm = new LocalLLM({ model: 'llama3.2' });
  const response = await llm.chat('Hello');
  expect(response).toBe('Mocked response');
});

For integration tests that test against a real Ollama instance, use a lightweight model (moondream2 or a 1B-class model) and mark the tests as integration tests that are skipped in normal CI runs. This pattern — fast unit tests with mocks plus slower integration tests run separately — is the most practical approach for Ollama-backed applications.

Practical Considerations for Production Node.js Apps

A few things to keep in mind when deploying a Node.js application that calls Ollama in production. First, set appropriate timeouts on your HTTP requests — very long generations can take minutes, and a client that disconnects while waiting should not leave a dangling Ollama request consuming resources indefinitely. The Ollama library does not automatically cancel requests when the calling code is interrupted, so implement AbortController-based cancellation for long requests in server environments. Second, log the load_duration from Ollama responses to monitor cold starts and model swaps in production. Third, if your application serves multiple users, implement a health check endpoint that verifies Ollama is running and the required model is available — a 200 OK from your application health check should guarantee that Ollama is responding correctly, not just that your Node.js process started successfully. These operational habits separate a demo-grade integration from a production-ready one.

The JavaScript Ecosystem Advantage

One underappreciated benefit of the Ollama JavaScript library is access to the broader npm ecosystem alongside your LLM integration. Need to process PDFs before passing them to the model? Use pdf-parse. Need to extract web content? Use cheerio or playwright. Need a vector store for RAG? Use vectra (a pure TypeScript in-memory vector store) or the JavaScript clients for Chroma, Qdrant, or Pinecone. All of these integrate naturally with the Ollama JS library in a single Node.js application, without the language boundary that exists in Python-based stacks where the JS frontend must communicate with a Python backend via API. For full-stack JavaScript developers, this means building a complete local AI application — document ingestion, embedding, retrieval, generation, and streaming frontend display — entirely in JavaScript, with no Python runtime required and no language boundary to cross. The Ollama JS library is the inference piece that completes this stack, and its close parity with the Python library means that patterns and examples from the much larger Python LLM community translate straightforwardly to JavaScript with minimal adaptation.

The Ollama JavaScript library’s active development and the growing local AI ecosystem mean that JavaScript is increasingly a first-class choice for local LLM applications — not a second-best option forced on web developers who cannot use Python. The patterns in this article — streaming, error handling, structured output, queue management, and testing — cover the practical requirements of real production deployments, and the reusable client class provides a solid starting point that can be adapted to any specific application’s needs without reimplementing the boilerplate from scratch. Install the library, pull a model, and you have a working local AI integration in the same language you use for the rest of your stack — no context switching, no inter-process communication, and no cloud dependency required to get started. The combination of Ollama’s zero-configuration local inference and the npm ecosystem’s breadth of supporting libraries makes JavaScript an increasingly compelling choice for developers building the next generation of local AI applications. With model quality improving each quarter and hardware costs falling, the case for self-hosted AI in JavaScript applications grows stronger with every release cycle.

For teams evaluating whether to adopt local LLMs, the JavaScript library lowers the barrier further — any Node.js developer can integrate Ollama in an afternoon without learning new tooling, making it straightforward to run internal proof-of-concepts before committing to a production deployment. The barrier to entry has never been lower.

Leave a Comment