Building a React or Next.js frontend that streams responses from a local Ollama model gives you a complete local AI application with a polished UI — no cloud API, no monthly subscription, and all data staying on your machine. This guide covers the key patterns: streaming token display in React, a Next.js API route that proxies Ollama, and a complete chat interface component.
The CORS Problem and the Solution
Browsers block direct fetch calls from a web page to a different port (localhost:3000 → localhost:11434) due to CORS restrictions. Ollama does not enable CORS by default for browser clients. The clean solution is to set the OLLAMA_ORIGINS environment variable before starting Ollama:
# Allow your dev server to call Ollama directly
OLLAMA_ORIGINS=http://localhost:3000 ollama serve
# Or allow all origins (dev only — not for production)
OLLAMA_ORIGINS=* ollama serve
Alternatively, proxy requests through your Next.js API route (shown below), which runs server-side where CORS does not apply.
Option 1: Direct Browser Call (with CORS enabled)
// React component — calls Ollama directly from the browser
import { useState } from 'react';
export default function Chat() {
const [messages, setMessages] = useState([]);
const [input, setInput] = useState('');
const [loading, setLoading] = useState(false);
async function sendMessage() {
if (!input.trim() || loading) return;
const userMsg = { role: 'user', content: input };
const newMessages = [...messages, userMsg];
setMessages(newMessages);
setInput('');
setLoading(true);
// Add empty assistant message to stream into
setMessages(prev => [...prev, { role: 'assistant', content: '' }]);
const response = await fetch('http://localhost:11434/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'llama3.2',
messages: newMessages,
stream: true
})
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const lines = decoder.decode(value).split('\n').filter(Boolean);
for (const line of lines) {
const chunk = JSON.parse(line);
if (chunk.message?.content) {
setMessages(prev => {
const updated = [...prev];
updated[updated.length - 1] = {
...updated[updated.length - 1],
content: updated[updated.length - 1].content + chunk.message.content
};
return updated;
});
}
}
}
setLoading(false);
}
return (
{messages.map((m, i) => (
{m.content}
))}
setInput(e.target.value)}
onKeyDown={e => e.key === 'Enter' && sendMessage()}
placeholder="Type a message..."
style={{ flex: 1, padding: '8px 12px', borderRadius: 8, border: '1px solid #ddd' }}
disabled={loading}
/>
);
}
Option 2: Next.js API Route Proxy
// app/api/chat/route.js (Next.js App Router)
export async function POST(request) {
const { messages, model = 'llama3.2' } = await request.json();
const ollamaResponse = await fetch('http://localhost:11434/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ model, messages, stream: true })
});
// Stream the response directly to the client
return new Response(ollamaResponse.body, {
headers: {
'Content-Type': 'text/plain; charset=utf-8',
'Transfer-Encoding': 'chunked'
}
});
}
// Frontend — call /api/chat instead of Ollama directly
// Replace the fetch URL in the component above:
const response = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ messages: newMessages, model: 'llama3.2' })
});
// The rest of the streaming code stays the same
Using the AI SDK (Vercel)
Vercel’s AI SDK provides React hooks for streaming AI responses with much less boilerplate:
npm install ai @ai-sdk/openai
// app/api/chat/route.js
import { createOpenAI } from '@ai-sdk/openai';
import { streamText } from 'ai';
const ollama = createOpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama'
});
export async function POST(req) {
const { messages } = await req.json();
const result = streamText({ model: ollama('llama3.2'), messages });
return result.toDataStreamResponse();
}
// React component using the useChat hook
import { useChat } from 'ai/react';
export default function Chat() {
const { messages, input, handleInputChange, handleSubmit, isLoading } = useChat();
return (
{messages.map(m => (
{m.role}: {m.content}
))}
);
}
The AI SDK’s useChat hook handles streaming, message history, loading states, and error handling — replacing ~50 lines of manual streaming code with a few lines of hook usage. This is the recommended approach for production Next.js applications.
Why Build a Custom UI Instead of Using Open WebUI?
Open WebUI is the quickest path to a polished local AI interface — install it, point it at Ollama, and you have a production-quality chat UI in minutes. Building a custom React or Next.js frontend makes sense when you need something different: a domain-specific UI embedded in an existing application, a workflow-specific interface that integrates AI with other app features, a customer-facing tool where you need full control over the UX, or a developer tool where the AI is one feature among many. The custom approach is more work but gives you complete control over what the user sees and how AI integrates with the rest of your product.
The Streaming Architecture
Streaming is not optional for AI UIs — it is the difference between an interface that feels responsive and one that feels broken. A 200-word response at 40 tokens/sec takes 15 seconds. Without streaming, the user sees a spinner for 15 seconds then the full response appears. With streaming, the user starts reading after the first token — under a second — and the response feels almost instantaneous even though total generation time is the same.
The implementation uses the browser’s Streams API (response.body.getReader()) supported in all modern browsers. Each chunk from Ollama’s streaming response is a JSON object on its own line. The functional state updater pattern — setMessages(prev => ...) — ensures concurrent updates from rapid token arrivals are applied correctly without race conditions in React state.
Model Selector Component
function ModelSelector({ selected, onChange }) {
const [models, setModels] = useState([]);
useEffect(() => {
fetch('http://localhost:11434/api/tags')
.then(r => r.json()).then(d => setModels(d.models || []))
.catch(() => {});
}, []);
return (
);
}
When the user switches models, clear conversation history — mixing messages from different models produces unpredictable results since different models use different chat templates.
Abort / Stop Generation
const [ctrl, setCtrl] = useState(null);
async function send() {
const ac = new AbortController();
setCtrl(ac);
const res = await fetch('/api/chat', {
method: 'POST', body: JSON.stringify({ messages }),
signal: ac.signal
});
// ... streaming code
setCtrl(null);
}
const stop = () => { ctrl?.abort(); setCtrl(null); setLoading(false); };
// In JSX
{loading && }
When to Use the AI SDK vs Manual Streaming
Vercel’s AI SDK useChat hook handles all streaming complexity, provides sensible defaults, and is actively maintained — it is the right choice for most Next.js applications. The manual streaming approach is useful when you need more control: custom message formats, Ollama-specific features the SDK does not expose, or when building a React app without Next.js where the SDK’s server components are not available. Start with the AI SDK and only drop to manual streaming if you hit a specific limitation.
Deployment
For personal use, run next dev alongside Ollama — hot reloading works perfectly for daily use. For a shared team tool, build and serve the Next.js app as a production build alongside the Ollama systemd service on the same server. The Next.js app proxies AI requests to Ollama, keeping all inference local while giving team members a clean URL to access from their browsers. This combination — a few hundred lines of React, a Next.js API route, and Ollama for inference — gives you a complete local AI application you own end-to-end, with no recurring cloud costs and no data leaving your infrastructure.
Error Handling for Production
The basic chat component above handles the happy path — what happens when Ollama responds successfully. Production UIs need to handle failure cases: Ollama not running (connection refused), model not found (wrong model name), context window exceeded (prompt too long), and slow hardware causing timeouts. Each of these produces a different user experience if left unhandled:
A connection refused error (Ollama not running) should show a clear message like “Cannot connect to the AI server — make sure Ollama is running” rather than a generic network error. A model not found error should suggest running ollama pull [model]. A timeout (hardware taking longer than expected) should reassure the user that generation is ongoing rather than appearing broken. Build error handling that maps these failure modes to helpful, specific messages rather than generic “Something went wrong” text.
async function sendMessage() {
setLoading(true); setError(null);
try {
const res = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ messages }),
signal: AbortSignal.timeout(120000)
});
if (!res.ok) throw new Error(await res.text());
// ... streaming
} catch (err) {
const msg = err.name === 'TimeoutError'
? 'Request timed out — model may be loading, try again'
: err.message.includes('fetch')
? 'Cannot connect to Ollama. Is it running?'
: err.message;
setError(msg);
setMessages(prev => prev.slice(0, -1)); // remove empty assistant message
} finally { setLoading(false); }
}
Persisting Conversations
The basic component keeps conversation history only in React state — it resets on page refresh. For a tool you use regularly, persisting conversations to localStorage or to a backend database makes the chat history available across sessions. The simplest approach is localStorage:
// Load persisted messages on mount
useEffect(() => {
const saved = localStorage.getItem('chat-history');
if (saved) setMessages(JSON.parse(saved));
}, []);
// Persist on every message change
useEffect(() => {
if (messages.length > 0)
localStorage.setItem('chat-history', JSON.stringify(messages));
}, [messages]);
// Clear button
const clearHistory = () => {
setMessages([]);
localStorage.removeItem('chat-history');
};
For multi-conversation support (multiple named chats), store conversations in an indexed structure with timestamps. The messages are plain JSON objects that serialise cleanly, making localStorage a natural fit for moderate amounts of conversation history. For very long histories (hundreds of exchanges across many sessions), consider a simple SQLite backend — Next.js’s API routes make it straightforward to add better-sqlite3 for persistent storage without a full database server.
The Full Stack: What You Are Actually Building
When you put together the pieces from this article — the React chat component, the Next.js API route, CORS handling, error states, model selection, abort, and persistence — you have built a complete local AI application. It runs entirely on your own hardware, costs nothing per query, keeps all conversation data on your machine, and gives you a polished interface that is indistinguishable from a commercial AI chat product to anyone who uses it.
The total code — excluding node_modules — is under 300 lines. The infrastructure is a single Next.js process and Ollama running on the same machine. The operational complexity is minimal: two processes, zero cloud dependencies, zero API keys to manage, and zero per-token costs that scale with usage. For developers building tools for themselves or their team, this is arguably the best ratio of capability to complexity available in the current AI tooling landscape. The barrier to entry has never been lower, and the results have never been better.
Performance Considerations
React renders on every state update, and streaming responses trigger many rapid state updates — one per token, which at 40–80 tokens/sec means 40–80 React re-renders per second during generation. For most applications this is fine — React’s reconciliation is fast and modern browsers handle this smoothly. If you notice jank during fast streaming (more likely with complex UIs that render many elements), debounce the state updates by batching tokens before updating state, or use a ref for the accumulating response and only flush to state every 100ms. The simplest optimisation is wrapping the messages list in React.memo so individual message components only re-render when their specific content changes, rather than all messages re-rendering on every token arrival.
Markdown rendering is the other performance consideration. If your assistant messages contain Markdown (code blocks, bullet points, headers), rendering them with a library like react-markdown during streaming is computationally expensive because the Markdown AST is reparsed on every token. A practical pattern is to display raw text during streaming and switch to rendered Markdown only when generation is complete — the done flag in the final Ollama streaming chunk tells you when to switch. This gives you the best of both worlds: fast streaming display and formatted final output.
Getting Started in 20 Minutes
The minimum path to a working Ollama React app: create a Next.js project (npx create-next-app@latest), create the API route at app/api/chat/route.js with the proxy code from this article, paste the streaming chat component into a page, and run next dev. With Ollama running and a model pulled, you will have a streaming chat interface backed by a local model in under 20 minutes. Add the model selector when you want to switch models from the UI. Add localStorage persistence when you want history across sessions. Add error handling when you are ready to share the tool with others who might encounter unexpected states. The iterative approach — get something working first, then improve it — is the fastest path to a useful tool, and the building blocks in this article are designed to be added one at a time rather than all at once.
Connecting to a Remote Ollama Server
If Ollama runs on a different machine than your Next.js app — a team server, a home lab machine, or a beefy workstation — update the fetch URL in the API route from http://localhost:11434 to the server’s IP address or hostname. Store the Ollama endpoint in an environment variable rather than hardcoding it:
# .env.local
OLLAMA_BASE_URL=http://192.168.1.100:11434
OLLAMA_MODEL=llama3.2
// app/api/chat/route.js
const OLLAMA = process.env.OLLAMA_BASE_URL || 'http://localhost:11434';
const MODEL = process.env.OLLAMA_MODEL || 'llama3.2';
export async function POST(request) {
const { messages } = await request.json();
const res = await fetch(`${OLLAMA}/api/chat`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ model: MODEL, messages, stream: true })
});
return new Response(res.body, {
headers: { 'Content-Type': 'text/plain; charset=utf-8' }
});
}
This environment-variable pattern also makes it trivial to switch between a local development Ollama and a team server in production — just change the environment variable and redeploy. The frontend code is entirely unchanged regardless of where Ollama runs, since all requests go through the Next.js API route which abstracts the Ollama location.
What This Unlocks
The ability to build React and Next.js applications on top of local Ollama inference means that any web developer can now build AI-powered tools without cloud API credentials, without per-token billing, and without data leaving their infrastructure. The patterns in this article — streaming, error handling, model selection, persistence, remote endpoint configuration — are the building blocks of every serious local AI web application. Master them once, and you can ship a new AI-powered tool in an afternoon. The local inference ecosystem has matured to the point where the main constraint is the developer’s imagination rather than the tooling available, and that shift represents a genuine change in what individual developers and small teams can build and deploy on their own terms.
The React and Next.js ecosystem’s support for local LLM integration has matured rapidly — the Vercel AI SDK, the official Ollama npm package, and the broader tooling around streaming responses have all stabilised in 2025–2026 in ways that make building reliable AI interfaces straightforward for any JavaScript developer. What required significant custom engineering a year ago is now a handful of npm installs and a few hundred lines of code, and the quality of the resulting experience is genuinely competitive with commercial AI products. The investment in learning these patterns pays dividends across every AI-powered tool you build going forward — and the best time to start building is with the simplest working version today.