Dart and Flutter have matured into a serious cross-platform development stack. If you are building a Flutter app and want to add AI capabilities without depending on a cloud API — no monthly bill, no data leaving the device or the local network — Ollama gives you a simple HTTP interface that any Dart application can call. This guide walks through everything from basic HTTP calls to streaming token output in a Flutter chat UI, with production-ready patterns throughout.
The approach works for Flutter apps on desktop (macOS, Windows, Linux) where Ollama runs on the same machine, and for mobile apps (iOS, Android) where Ollama runs on a nearby machine on the local network. The Dart code is identical in both cases — only the base URL changes.
Prerequisites
You will need Ollama installed and running with at least one model pulled — ollama pull llama3.2 gets you a solid general-purpose model. On the Dart side, you need the Flutter SDK or the standalone Dart SDK if you are building a pure Dart application. The only external package we will use is http, which is the standard HTTP client for Dart and Flutter projects.
Adding the http Package
Add http to your pubspec.yaml and run flutter pub get:
dependencies:
flutter:
sdk: flutter
http: ^1.2.0If you are building a pure Dart CLI app rather than a Flutter app, the same package works — just omit the Flutter SDK line and run dart pub get instead.
Basic Chat Completion
Here is the minimal Dart code to send a message to Ollama and get a full response:
import 'dart:convert';
import 'package:http/http.dart' as http;
const _baseUrl = 'http://localhost:11434';
const _model = 'llama3.2';
Future<String> chat(String prompt) async {
final response = await http.post(
Uri.parse('$_baseUrl/api/chat'),
headers: {'Content-Type': 'application/json'},
body: jsonEncode({
'model': _model,
'messages': [{'role': 'user', 'content': prompt}],
'stream': false,
}),
);
if (response.statusCode != 200) {
throw Exception('Ollama error: ${response.statusCode}');
}
final data = jsonDecode(response.body) as Map<String, dynamic>;
return data['message']['content'] as String;
}The http.post call waits for the full response before returning. For longer generations the UI will appear frozen during the wait. That is why streaming matters in a Flutter context — it lets you show tokens as they arrive and keeps the interface responsive throughout the generation.
Streaming Responses
Ollama streams responses as newline-delimited JSON — each line is a JSON object containing one token and a done flag. Dart’s http package supports streaming via http.Request and send(), giving you access to the response body as a stream of bytes:
Stream<String> chatStream(String prompt) async* {
final client = http.Client();
try {
final request = http.Request('POST', Uri.parse('$_baseUrl/api/chat'));
request.headers['Content-Type'] = 'application/json';
request.body = jsonEncode({
'model': _model,
'messages': [{'role': 'user', 'content': prompt}],
'stream': true,
});
final response = await client.send(request);
final lines = response.stream
.transform(utf8.decoder)
.transform(const LineSplitter());
await for (final line in lines) {
if (line.trim().isEmpty) continue;
final chunk = jsonDecode(line) as Map<String, dynamic>;
final token = chunk['message']?['content'] as String? ?? '';
if (token.isNotEmpty) yield token;
if (chunk['done'] == true) break;
}
} finally {
client.close();
}
}The async* and yield keywords make this a generator function producing a Stream<String>. Each yielded value is a single token. The LineSplitter transformer splits the raw byte stream into individual lines and utf8.decoder converts bytes to strings. Closing the client in a finally block ensures the connection is always released even if the stream is cancelled mid-generation.
Building a Flutter Chat UI
With the streaming function in place, wiring it into a Flutter widget is straightforward. The key is updating state incrementally as tokens arrive. Here is a complete chat screen that appends tokens to the last message as they stream in:
class ChatScreen extends StatefulWidget {
const ChatScreen({super.key});
@override
State<ChatScreen> createState() => _ChatScreenState();
}
class _ChatScreenState extends State<ChatScreen> {
final _messages = <Map<String, String>>[];
final _controller = TextEditingController();
bool _loading = false;
Future<void> _send() async {
final text = _controller.text.trim();
if (text.isEmpty || _loading) return;
_controller.clear();
setState(() {
_messages.add({'role': 'user', 'content': text});
_messages.add({'role': 'assistant', 'content': ''});
_loading = true;
});
final idx = _messages.length - 1;
try {
await for (final token in chatStream(text)) {
setState(() {
_messages[idx]['content'] = (_messages[idx]['content'] ?? '') + token;
});
}
} catch (e) {
setState(() { _messages[idx]['content'] = 'Error: $e'; });
} finally {
setState(() => _loading = false);
}
}
@override
Widget build(BuildContext context) {
return Scaffold(
appBar: AppBar(title: const Text('Local AI Chat')),
body: Column(
children: [
Expanded(
child: ListView.builder(
itemCount: _messages.length,
itemBuilder: (_, i) {
final msg = _messages[i];
final isUser = msg['role'] == 'user';
return Align(
alignment: isUser ? Alignment.centerRight : Alignment.centerLeft,
child: Container(
margin: const EdgeInsets.all(4),
padding: const EdgeInsets.all(10),
decoration: BoxDecoration(
color: isUser ? Colors.blue[100] : Colors.grey[200],
borderRadius: BorderRadius.circular(12),
),
child: Text(msg['content'] ?? ''),
),
);
},
),
),
Padding(
padding: const EdgeInsets.all(8),
child: Row(
children: [
Expanded(
child: TextField(
controller: _controller,
onSubmitted: (_) => _send(),
decoration: const InputDecoration(hintText: 'Ask something...'),
),
),
IconButton(
onPressed: _loading ? null : _send,
icon: const Icon(Icons.send),
),
],
),
),
],
),
);
}
}The assistant message is added to the list immediately with empty content, and each token from the stream appends to it via setState. This gives the typewriter effect users expect from modern AI chat interfaces. The send button is disabled while a response is streaming, preventing overlapping requests.
Building an OllamaService Class
For anything beyond a simple demo, extract the Ollama logic into a service class that can be injected, tested, and reused across widgets:
class OllamaService {
OllamaService({this.baseUrl = 'http://localhost:11434', this.model = 'llama3.2'});
final String baseUrl;
final String model;
final _client = http.Client();
Future<String> chat(List<Map<String, String>> messages) async {
final r = await _client.post(
Uri.parse('$baseUrl/api/chat'),
headers: {'Content-Type': 'application/json'},
body: jsonEncode({'model': model, 'messages': messages, 'stream': false}),
);
if (r.statusCode != 200) throw Exception('HTTP ${r.statusCode}');
return jsonDecode(r.body)['message']['content'] as String;
}
Stream<String> chatStream(List<Map<String, String>> messages) async* {
final req = http.Request('POST', Uri.parse('$baseUrl/api/chat'));
req.headers['Content-Type'] = 'application/json';
req.body = jsonEncode({'model': model, 'messages': messages, 'stream': true});
final res = await _client.send(req);
final lines = res.stream.transform(utf8.decoder).transform(const LineSplitter());
await for (final line in lines) {
if (line.trim().isEmpty) continue;
final chunk = jsonDecode(line) as Map<String, dynamic>;
final token = chunk['message']?['content'] as String? ?? '';
if (token.isNotEmpty) yield token;
if (chunk['done'] == true) break;
}
}
void dispose() => _client.close();
}This class reuses a single http.Client instance across all requests, which is more efficient than creating a new client per call — the client maintains an internal connection pool. Call dispose() when the service is no longer needed, typically in the widget’s dispose() method or when the app closes.
Multi-Turn Conversation History
To support multi-turn conversations, pass the full message history to Ollama rather than just the latest user message. The simplest approach is a list of message maps that grows with each exchange. Wrap this in a ConversationManager to keep the widget code clean:
class ConversationManager {
ConversationManager({String? systemPrompt})
: _history = systemPrompt != null
? [{'role': 'system', 'content': systemPrompt}]
: [];
final List<Map<String, String>> _history;
List<Map<String, String>> get messages => List.unmodifiable(_history);
void addUser(String content) => _history.add({'role': 'user', 'content': content});
void addAssistant(String content) => _history.add({'role': 'assistant', 'content': content});
void updateLast(String content) => _history.last['content'] = content;
void reset({String? systemPrompt}) {
_history.clear();
if (systemPrompt != null) _history.add({'role': 'system', 'content': systemPrompt});
}
}In the widget, create a ConversationManager as an instance variable, call addUser before sending and addAssistant after streaming completes. Pass manager.messages to OllamaService.chatStream to include the full context with every request. The system prompt is stored as the first message and preserved across resets, maintaining the model’s persona without consuming the user-visible message list.
Connecting from a Mobile Device
When running the Flutter app on a physical iOS or Android device, localhost refers to the device itself, not your development machine. To connect to Ollama running on your desktop, use the machine’s local network IP address. Find it with ip addr on Linux, ipconfig on Windows, or System Preferences → Network on macOS.
You also need to tell Ollama to listen on all network interfaces. Set the environment variable OLLAMA_HOST=0.0.0.0 before starting Ollama, or add it to your systemd service file. Make sure your firewall permits inbound connections on port 11434 from devices on your local network.
Android Network Security Configuration
On Android, HTTP (non-HTTPS) requests to local network addresses require an explicit declaration. Create android/app/src/main/res/xml/network_security_config.xml:
<?xml version="1.0" encoding="utf-8"?>
<network-security-config>
<domain-config cleartextTrafficPermitted="true">
<domain includeSubdomains="false">192.168.1.100</domain>
</domain-config>
</network-security-config>Reference it in your AndroidManifest.xml inside the application tag by adding android:networkSecurityConfig="@xml/network_security_config". Replace the IP address with your machine’s actual local IP. On iOS, App Transport Security exceptions for local IPs can be added in Info.plist if needed, though iOS is generally more permissive for local network addresses.
Generating Embeddings
Ollama’s embeddings endpoint works exactly like the chat endpoint. Pull an embedding model with ollama pull nomic-embed-text, then call /api/embed:
Future<List<double>> embed(String text) async {
final response = await _client.post(
Uri.parse('$baseUrl/api/embed'),
headers: {'Content-Type': 'application/json'},
body: jsonEncode({'model': 'nomic-embed-text', 'input': text}),
);
final data = jsonDecode(response.body) as Map<String, dynamic>;
return (data['embeddings'] as List).first.cast<double>();
}Embeddings are useful in Flutter apps for semantic search within a local document collection, finding similar notes, or clustering user-generated content. The returned vector is a plain List<double> which you can store in memory, persist to SQLite with sqflite, or compare against other vectors using cosine similarity computed directly in Dart.
Error Handling and Timeouts
HTTP requests to Ollama can fail for several reasons — Ollama is not running, the model has not been pulled, or the device is not reachable over the network. Handle these cases explicitly to give users useful feedback:
Future<String> safeChat(String prompt) async {
try {
return await chat([{'role': 'user', 'content': prompt}])
.timeout(const Duration(seconds: 60));
} on TimeoutException {
return 'Request timed out. Is Ollama running and the model loaded?';
} on http.ClientException catch (e) {
return 'Could not reach Ollama: ${e.message}';
} catch (e) {
return 'Unexpected error: $e';
}
}The .timeout() extension on Future throws a TimeoutException if the request takes longer than the specified duration. Sixty seconds is a reasonable ceiling for most use cases — if Ollama has not started responding in that time, something has likely gone wrong. For streaming requests, apply the timeout to the initial connection rather than the full stream, since streaming responses can legitimately take longer than a minute for very long generations.
Desktop vs Mobile Considerations
On desktop Flutter (macOS, Windows, Linux), Ollama can run on the same machine as the app. This is the most capable setup — you have access to the full GPU and all available RAM, and the app connects to http://localhost:11434 with no network configuration needed. Desktop Flutter apps have no HTTP security restrictions equivalent to Android’s network security config, so plain HTTP connections work out of the box.
On mobile, the constraint is network latency and the need for Ollama to be reachable over WiFi. Keep the model small enough that generation feels responsive — llama3.2:3b typically generates 30 to 50 tokens per second on a modern desktop with a mid-range GPU, which feels fast even over a local network connection. Larger models produce better output but slower responses, so the right trade-off depends on your use case.
Using Riverpod for State Management
For larger Flutter apps using Riverpod, expose the OllamaService as a provider and manage conversation state in an AsyncNotifier. Define a provider at the top level of your app:
final ollamaProvider = Provider((ref) => OllamaService());
final chatProvider =
AsyncNotifierProvider<ChatNotifier, List<Map<String, String>>>(ChatNotifier.new);
class ChatNotifier extends AsyncNotifier<List<Map<String, String>>> {
@override
Future<List<Map<String, String>>> build() async => [];
Future<void> send(String userMessage) async {
final svc = ref.read(ollamaProvider);
final current = state.value ?? [];
final msgs = [...current, {'role': 'user', 'content': userMessage}, {'role': 'assistant', 'content': ''}];
state = AsyncData(msgs);
final idx = msgs.length - 1;
await for (final token in svc.chatStream(msgs.sublist(0, idx))) {
final updated = List<Map<String, String>>.from(state.value!);
updated[idx] = {'role': 'assistant', 'content': (updated[idx]['content'] ?? '') + token};
state = AsyncData(updated);
}
}
}This pattern keeps all AI state outside the widget tree, making it easy to share conversation state across screens and to test the chat logic independently of the UI. Widgets simply watch chatProvider and rebuild whenever the message list updates, which happens on every token during streaming.
Production Considerations
For production Flutter apps distributed to end users, embedding Ollama as a local dependency is not yet practical — model weights are too large and the runtime setup is too complex for a typical app install flow. The practical sweet spot today is internal tools, developer utilities, home automation dashboards, and applications targeting technically sophisticated users who are comfortable running local services.
For consumer-facing apps, a hybrid approach works well: use Ollama for latency-sensitive features like autocomplete or on-device draft generation, and fall back to a remote endpoint for complex queries requiring a larger model. The OllamaService abstraction makes switching base URLs straightforward — you can even make the endpoint runtime-configurable via a settings screen, letting power users point the app at their own Ollama instance while everyone else uses a shared server.
Testing Your Dart Ollama Client
Unit testing Dart HTTP code is straightforward with the http package’s MockClient. You can pass a mock client into your OllamaService constructor and return fixed JSON responses without needing a running Ollama instance. This keeps your tests fast and deterministic on CI machines that do not have Ollama installed.
For integration testing a Flutter chat UI, use Flutter’s built-in WidgetTester to pump the widget, enter text into the TextField, tap the send button, and then pump enough frames to let the async stream complete. The key is providing the OllamaService via a constructor parameter or using a dependency injection solution like Riverpod’s ProviderScope override, so the test can substitute a mock service that returns a fixed stream of tokens without touching the network. This pattern lets you test the full streaming UI behaviour — including the typewriter effect and the loading state — entirely in isolation.
Choosing a Model for Flutter Applications
Model choice for a Flutter application depends primarily on where Ollama is running and what the app needs to do. For a desktop Flutter app running Ollama on the same machine with a dedicated GPU, you have significant flexibility — models up to 8B parameters run comfortably and generate tokens fast enough for a good user experience. For mobile clients connecting over WiFi, the bottleneck shifts from generation speed to network latency, so smaller models that generate faster are generally preferable even if a larger model is available on the host machine.
For general-purpose chat in a Flutter app, llama3.2:3b is a reliable default. It handles conversation, summarisation, and basic reasoning well, and generates quickly enough that the streaming UI feels responsive rather than slow. For coding assistance features within a developer tool built with Flutter, qwen2.5-coder:7b produces significantly better code output at a similar size. For document summarisation or analysis where quality matters more than speed, llama3.2:8b is worth the extra generation time. You can make the model name a configuration option in your OllamaService and let users switch between models from a settings screen.
Structured JSON Output
For features that need to parse the model’s output programmatically — extracting entities from text, classifying user input, generating structured data — Ollama’s JSON schema mode forces the model to produce output conforming to a schema you specify. Pass a format field alongside your messages in the request body containing a JSON Schema object. The response content will always be valid JSON matching that schema, which means you can call jsonDecode on it and cast directly to a typed Dart map without wrapping in a try-catch for parse errors.
This is meaningfully more reliable than prompting the model to respond in JSON format. Prompt-based JSON generation works most of the time, but occasionally produces markdown fences around the JSON, extra explanatory prose, or subtly malformed output. Schema-constrained generation eliminates all of these failure modes, making it the right choice whenever downstream Dart code depends on parsing the model’s response into a structured form. Define your schema as a Dart map, encode it to JSON with jsonEncode, and decode the response with jsonDecode — no additional packages required.