How to Use Ollama with Rust

Rust is an increasingly popular choice for systems programming, CLI tools, and high-performance web services. If you are building a Rust application and want to add local LLM capabilities without a cloud dependency, Ollama exposes a straightforward HTTP API that any Rust HTTP client can call. This guide covers everything from basic chat completions to streaming responses and building a reusable async client — all using idiomatic Rust with Tokio and the Reqwest library.

Rust’s ownership model and async ecosystem make it a particularly good fit for working with streaming AI responses. You get fine-grained control over buffering, backpressure, and error handling without the runtime overhead of a garbage collector. The patterns here are production-ready and work whether you are building a CLI tool, a REST API, or an embedded service.

Prerequisites

You will need Ollama installed and running with at least one model pulled — ollama pull llama3.2 is a good starting point. On the Rust side, you need a recent stable Rust toolchain (1.75 or later) and Cargo. The dependencies we will use are reqwest for HTTP, tokio for the async runtime, serde and serde_json for JSON serialisation, and futures-util for working with async streams.

Cargo.toml Setup

Add the following to your Cargo.toml:

[dependencies]
reqwest = { version = "0.12", features = ["json", "stream"] }
tokio = { version = "1", features = ["full"] }
serde = { version = "1", features = ["derive"] }
serde_json = "1"
futures-util = "0.3"

The stream feature on Reqwest is required for reading streaming HTTP responses incrementally. Without it, Reqwest buffers the full response body before returning, which defeats the purpose of streaming from Ollama.

Defining the Types

Start by defining the structs that map to Ollama’s JSON request and response shapes. Using serde‘s derive macros keeps the serialisation boilerplate minimal:

use serde::{Deserialize, Serialize};

#[derive(Serialize, Clone)]
pub struct Message {
    pub role: String,
    pub content: String,
}

#[derive(Serialize)]
struct ChatRequest {
    model: String,
    messages: Vec<Message>,
    stream: bool,
}

#[derive(Deserialize)]
struct ChatResponse {
    message: Message,
    done: bool,
}

#[derive(Deserialize)]
struct StreamChunk {
    message: Message,
    done: bool,
}

Both ChatResponse and StreamChunk share the same shape in Ollama’s API — the difference is how you consume them. When streaming is disabled you receive one JSON object for the entire response; when streaming is enabled you receive one JSON object per line as tokens are generated. Keeping them as separate structs makes the intent clear at each call site.

Basic Chat Completion

Here is a minimal async function that sends a prompt to Ollama and returns the full response as a String:

use reqwest::Client;

const BASE_URL: &str = "http://localhost:11434";
const MODEL: &str = "llama3.2";

async fn chat(client: &Client, prompt: &str) -> anyhow::Result<String> {
    let request = ChatRequest {
        model: MODEL.to_string(),
        messages: vec![Message {
            role: "user".to_string(),
            content: prompt.to_string(),
        }],
        stream: false,
    };

    let response: ChatResponse = client
        .post(format!("{BASE_URL}/api/chat"))
        .json(&request)
        .send()
        .await?
        .json()
        .await?;

    Ok(response.message.content)
}

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let client = Client::new();
    let reply = chat(&client, "What is Rust's ownership model?").await?;
    println!("{reply}");
    Ok(())
}

We pass the Client by reference rather than creating a new one per call. Reqwest’s Client manages an internal connection pool — reusing it across calls avoids the overhead of establishing a new TCP connection for every request, which matters when you are making many sequential calls to Ollama.

Streaming Responses

Streaming from Ollama in Rust means reading the response body line by line as bytes arrive. Reqwest’s bytes_stream() method returns an async stream of byte chunks, which you can split on newlines and deserialise incrementally:

use futures_util::StreamExt;

async fn chat_stream(client: &Client, prompt: &str) -> anyhow::Result<String> {
    let request = ChatRequest {
        model: MODEL.to_string(),
        messages: vec![Message {
            role: "user".to_string(),
            content: prompt.to_string(),
        }],
        stream: true,
    };

    let response = client
        .post(format!("{BASE_URL}/api/chat"))
        .json(&request)
        .send()
        .await?;

    let mut stream = response.bytes_stream();
    let mut buf = String::new();
    let mut full_response = String::new();

    while let Some(chunk) = stream.next().await {
        let chunk = chunk?;
        buf.push_str(&String::from_utf8_lossy(&chunk));

        while let Some(pos) = buf.find('\n') {
            let line = buf[..pos].trim().to_string();
            buf = buf[pos + 1..].to_string();

            if line.is_empty() { continue; }

            if let Ok(chunk) = serde_json::from_str::<StreamChunk>(&line) {
                print!("{}", chunk.message.content);
                full_response.push_str(&chunk.message.content);
                if chunk.done { break; }
            }
        }
    }
    println!();
    Ok(full_response)
}

The manual line-splitting logic is necessary because TCP does not guarantee that each network chunk aligns with a newline boundary. A single chunk from bytes_stream() might contain half a JSON object, a full object, or multiple objects. The buffer accumulates bytes until a newline is found, then slices out the complete line and attempts to deserialise it. This is the standard pattern for consuming newline-delimited JSON streams in Rust.

Building a Reusable OllamaClient

For real applications, wrap the HTTP logic in a struct that holds configuration and the shared Reqwest client:

pub struct OllamaClient {
    client: Client,
    base_url: String,
    model: String,
}

impl OllamaClient {
    pub fn new(base_url: impl Into<String>, model: impl Into<String>) -> Self {
        Self {
            client: Client::builder()
                .timeout(std::time::Duration::from_secs(120))
                .build()
                .expect("Failed to build HTTP client"),
            base_url: base_url.into(),
            model: model.into(),
        }
    }

    pub async fn chat(&self, messages: &[Message]) -> anyhow::Result<String> {
        let req = ChatRequest {
            model: self.model.clone(),
            messages: messages.to_vec(),
            stream: false,
        };
        let res: ChatResponse = self
            .client
            .post(format!("{}/api/chat", self.base_url))
            .json(&req)
            .send()
            .await?
            .error_for_status()?
            .json()
            .await?;
        Ok(res.message.content)
    }

    pub async fn chat_stream(
        &self,
        messages: &[Message],
        mut on_token: impl FnMut(&str),
    ) -> anyhow::Result<String> {
        let req = ChatRequest {
            model: self.model.clone(),
            messages: messages.to_vec(),
            stream: true,
        };
        let response = self
            .client
            .post(format!("{}/api/chat", self.base_url))
            .json(&req)
            .send()
            .await?
            .error_for_status()?;

        let mut stream = response.bytes_stream();
        let mut buf = String::new();
        let mut full = String::new();

        while let Some(chunk) = stream.next().await {
            buf.push_str(&String::from_utf8_lossy(&chunk?));
            while let Some(pos) = buf.find('\n') {
                let line = buf[..pos].trim().to_string();
                buf = buf[pos + 1..].to_string();
                if line.is_empty() { continue; }
                if let Ok(c) = serde_json::from_str::<StreamChunk>(&line) {
                    on_token(&c.message.content);
                    full.push_str(&c.message.content);
                    if c.done { break; }
                }
            }
        }
        Ok(full)
    }
}

The chat_stream method accepts a callback closure on_token that is called for each token as it arrives. This is more ergonomic than returning an async stream from a method on a struct, which requires working around Rust’s lifetime rules. The callback pattern lets callers print tokens, push them to a channel, or accumulate them however they need without any lifetime gymnastics.

The error_for_status() call converts 4xx and 5xx HTTP responses into errors automatically. Without it, a 404 (model not found) or 500 (Ollama internal error) would deserialise as an empty or unexpected JSON body and produce a confusing error message downstream.

Multi-Turn Conversation

Multi-turn conversations work by maintaining a Vec<Message> and passing it in full with each request. A simple conversation struct manages the history:

pub struct Conversation {
    client: OllamaClient,
    history: Vec<Message>,
}

impl Conversation {
    pub fn new(client: OllamaClient, system_prompt: Option<&str>) -> Self {
        let mut history = Vec::new();
        if let Some(prompt) = system_prompt {
            history.push(Message { role: "system".into(), content: prompt.into() });
        }
        Self { client, history }
    }

    pub async fn send(&mut self, user_message: &str) -> anyhow::Result<String> {
        self.history.push(Message { role: "user".into(), content: user_message.into() });
        let reply = self.client.chat(&self.history).await?;
        self.history.push(Message { role: "assistant".into(), content: reply.clone() });
        Ok(reply)
    }

    pub fn reset(&mut self, keep_system: bool) {
        if keep_system && self.history.first().map(|m| m.role == "system").unwrap_or(false) {
            self.history.truncate(1);
        } else {
            self.history.clear();
        }
    }
}

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let client = OllamaClient::new("http://localhost:11434", "llama3.2");
    let mut convo = Conversation::new(client, Some("You are a concise Rust expert."));
    println!("{}", convo.send("What is a lifetime in Rust?").await?);
    println!("{}", convo.send("How does it relate to borrowing?").await?);
    Ok(())
}

The reset method optionally preserves the system prompt at index 0 when clearing history. Truncating to length 1 rather than clearing and re-pushing avoids an allocation and keeps the system prompt exactly as it was originally set. This is a small optimisation but reflects the kind of thinking that makes Rust code idiomatic.

Using Ollama in an Axum Web Service

If you are building a web API with Axum, you can share the OllamaClient across request handlers using Axum’s state injection. Wrap it in an Arc so it can be cloned cheaply across threads:

use axum::{
    extract::State,
    routing::post,
    Json, Router,
};
use std::sync::Arc;

#[derive(Clone)]
struct AppState {
    ollama: Arc<OllamaClient>,
}

#[derive(Deserialize)]
struct ChatInput { prompt: String }

#[derive(Serialize)]
struct ChatOutput { reply: String }

async fn handle_chat(
    State(state): State<AppState>,
    Json(input): Json<ChatInput>,
) -> Json<ChatOutput> {
    let messages = vec![Message { role: "user".into(), content: input.prompt }];
    let reply = state.ollama.chat(&messages).await.unwrap_or_else(|e| e.to_string());
    Json(ChatOutput { reply })
}

#[tokio::main]
async fn main() {
    let state = AppState {
        ollama: Arc::new(OllamaClient::new("http://localhost:11434", "llama3.2")),
    };
    let app = Router::new()
        .route("/chat", post(handle_chat))
        .with_state(state);
    let listener = tokio::net::TcpListener::bind("0.0.0.0:8080").await.unwrap();
    axum::serve(listener, app).await.unwrap();
}

The Arc<OllamaClient> is cloned for each request handler invocation, but the underlying data — the Reqwest client and its connection pool — is shared. This means all concurrent requests share the same pool of HTTP connections to Ollama, which is both memory-efficient and correct. Tokio schedules each handler on its thread pool, so concurrent requests are handled without blocking.

Generating Embeddings

Ollama’s /api/embed endpoint returns a vector representation of any input text. Add these types and a method to your client:

#[derive(Serialize)]
struct EmbedRequest { model: String, input: String }

#[derive(Deserialize)]
struct EmbedResponse { embeddings: Vec<Vec<f32>> }

impl OllamaClient {
    pub async fn embed(&self, text: &str) -> anyhow::Result<Vec<f32>> {
        let req = EmbedRequest {
            model: "nomic-embed-text".into(),
            input: text.into(),
        };
        let res: EmbedResponse = self
            .client
            .post(format!("{}/api/embed", self.base_url))
            .json(&req)
            .send()
            .await?
            .error_for_status()?
            .json()
            .await?;
        res.embeddings.into_iter().next()
            .ok_or_else(|| anyhow::anyhow!("No embedding returned"))
    }
}

Pull the model first with ollama pull nomic-embed-text. Using f32 rather than f64 for the embedding values halves the memory footprint of each vector with negligible precision loss for retrieval tasks — cosine similarity and dot product comparisons work equally well at 32-bit float precision.

Error Handling with anyhow

The examples above use anyhow::Result for ergonomic error handling. In a library crate you would typically define a custom error type with thiserror instead, so callers can match on specific error variants. For a binary or CLI tool, anyhow is the right choice — it propagates errors with ? and attaches context with .context("what we were doing"):

let reply = client
    .chat(&messages)
    .await
    .context("Failed to get response from Ollama")?;

When the error propagates to main, anyhow prints the full chain of context messages, making it immediately clear what the program was attempting when the error occurred. This is significantly more useful than a bare reqwest::Error that just says “connection refused” with no indication of which part of your application triggered it.

Testing with a Mock Server

For unit tests that do not require a running Ollama instance, use the wiremock crate to spin up a mock HTTP server that returns fixed JSON responses. Add wiremock = "0.6" to your [dev-dependencies], then write a test like this:

#[cfg(test)]
mod tests {
    use super::*;
    use wiremock::matchers::{method, path};
    use wiremock::{Mock, MockServer, ResponseTemplate};

    #[tokio::test]
    async fn test_chat_parses_response() {
        let server = MockServer::start().await;
        Mock::given(method("POST"))
            .and(path("/api/chat"))
            .respond_with(ResponseTemplate::new(200).set_body_json(serde_json::json!({
                "model": "llama3.2",
                "message": {"role": "assistant", "content": "Hello from Rust!"},
                "done": true
            })))
            .mount(&server)
            .await;

        let client = OllamaClient::new(server.uri(), "llama3.2");
        let messages = vec![Message { role: "user".into(), content: "Hi".into() }];
        let reply = client.chat(&messages).await.unwrap();
        assert_eq!(reply, "Hello from Rust!");
    }
}

The mock server starts on a random port and its URI is injected into the OllamaClient constructor. This keeps the test completely isolated from any running Ollama instance and ensures it passes consistently on CI. The MockServer shuts down automatically when it goes out of scope at the end of the test function.

Structured Output

Ollama supports JSON schema-constrained generation, which forces the model to produce output conforming to a schema you specify. This is particularly useful in Rust because you can deserialise the constrained output directly into a typed struct without any manual parsing. Build your request body with a format field containing a JSON Schema object, set stream: false, and deserialise response.message.content with serde_json::from_str::<YourStruct>. Since the schema constraint guarantees the output matches your struct’s shape, the deserialisation will succeed unless there is a bug in your schema definition.

This pairs well with Rust’s type system — define a struct with #[derive(Deserialize)], generate a JSON Schema from it using the schemars crate, pass that schema to Ollama, and deserialise the response directly. The entire pipeline from struct definition to typed response involves no manual string parsing, which is exactly the kind of end-to-end type safety that makes Rust a pleasure to work with for data-intensive AI features.

Performance Considerations

Rust’s async runtime adds very little overhead to the Ollama integration. The main performance variables are model size, hardware, and whether the model is already loaded in Ollama’s memory. The first request to a model triggers loading from disk, which can take several seconds for larger models. Subsequent requests reuse the loaded model and are significantly faster. You can control how long Ollama keeps a model in memory with the keep_alive parameter in the request body — set it to -1 to keep the model loaded indefinitely, or 0 to unload it immediately after each request.

For a Rust web service handling concurrent users, Ollama’s default behaviour of processing one request at a time means requests are queued. Tokio will not block — each handler awaits its Ollama response asynchronously while other handlers can run concurrently. But the actual generation is sequential on Ollama’s side. For small teams or low-traffic internal tools this is fine. For higher throughput, running multiple Ollama instances on different ports and distributing requests across them is the most straightforward scaling path.

Choosing the Right Model

Model selection for a Rust application follows the same logic as any other Ollama client. For a CLI tool where you are the only user, the model can be as large as your GPU can handle — there is no latency pressure from concurrent users. For a web service handling multiple users, smaller models that generate faster reduce queue depth and keep response times predictable. A model like llama3.2:3b generates tokens fast enough that streaming feels responsive even under moderate load, while a model like llama3.2:8b produces noticeably better output at the cost of slower generation.

For code-focused Rust tooling — generating boilerplate, explaining compiler errors, suggesting refactors — code-specialised models like qwen2.5-coder:7b significantly outperform general-purpose models of the same size. They understand Rust-specific idioms like lifetimes, trait bounds, and the borrow checker, and their suggestions are more likely to compile without modification. Pull the model with ollama pull qwen2.5-coder:7b and pass the model name as a parameter when constructing your OllamaClient.

Building a CLI Chat Tool

One of the most practical uses of a Rust Ollama client is a terminal chat application. Rust’s ecosystem has excellent CLI libraries — clap for argument parsing, rustyline for readline-style input with history, and indicatif for progress spinners while waiting for a response. Combined with the streaming chat_stream method and a Conversation struct, you can build a fully-featured local AI chat tool in under 200 lines of Rust that starts instantly, consumes minimal memory, and handles Ctrl+C gracefully via Tokio’s signal handling utilities.

The resulting binary is a single self-contained executable with no runtime dependencies beyond Ollama itself. This is one of the practical advantages of Rust for local AI tooling — you can distribute the binary to colleagues or install it on a server without worrying about Python virtual environments, Node version managers, or JVM installations. The binary starts in milliseconds and the only setup required is running ollama serve.

One final practical note: because Ollama’s API follows the OpenAI chat format closely, switching between Ollama and a remote OpenAI-compatible endpoint later requires changing only the base URL and model name in your OllamaClient constructor. None of your business logic, conversation management, or streaming handling needs to change. This makes Ollama an excellent choice for local development and testing even if you plan to deploy against a different provider in production — the Rust code you write against Ollama today will work unchanged against any OpenAI-compatible API tomorrow.