How to Use Ollama with Kotlin

Kotlin has become the language of choice for Android development and is increasingly popular on the server side thanks to frameworks like Ktor and Spring Boot. If you are building a Kotlin application and want to add local LLM capabilities without routing requests through a cloud API, Ollama is the most straightforward way to do it. You get a local HTTP API that any Kotlin HTTP client can talk to, with no external dependencies or billing involved.

This guide covers everything you need to connect a Kotlin application to Ollama — from basic chat completions using simple HTTP calls, through streaming responses with coroutines, to building a reusable client that works in both Ktor server applications and Android projects. All the examples use idiomatic Kotlin with coroutines and are production-ready starting points rather than throwaway demos.

Prerequisites

You will need Ollama installed and running on your machine with at least one model pulled. The examples use llama3.2 but any model works. On the Kotlin side you need JDK 17 or later and either Gradle or Maven for dependency management. The HTTP client we will use is ktor-client, which is the natural choice for Kotlin projects — it is coroutine-native, multiplatform-compatible, and has first-class support for streaming responses.

Project Setup with Gradle

Create a new Kotlin project and add the following to your build.gradle.kts:

plugins {
    kotlin("jvm") version "2.0.0"
    kotlin("plugin.serialization") version "2.0.0"
    application
}

dependencies {
    implementation("io.ktor:ktor-client-core:2.3.12")
    implementation("io.ktor:ktor-client-cio:2.3.12")
    implementation("io.ktor:ktor-client-content-negotiation:2.3.12")
    implementation("io.ktor:ktor-serialization-kotlinx-json:2.3.12")
    implementation("org.jetbrains.kotlinx:kotlinx-coroutines-core:1.8.1")
    implementation("org.jetbrains.kotlinx:kotlinx-serialization-json:1.7.1")
}

application {
    mainClass.set("MainKt")
}

We are using the CIO engine (Coroutine I/O) for Ktor client because it is pure Kotlin, has no native dependencies, and handles streaming well. If you are building an Android app, swap ktor-client-cio for ktor-client-android — the rest of the code stays identical.

Data Classes for the Ollama API

Before making any requests, define the data classes that map to Ollama’s JSON structures. Using kotlinx.serialization keeps everything type-safe and eliminates manual JSON parsing.

import kotlinx.serialization.SerialName
import kotlinx.serialization.Serializable

@Serializable
data class Message(
    val role: String,
    val content: String
)

@Serializable
data class ChatRequest(
    val model: String,
    val messages: List,
    val stream: Boolean = false
)

@Serializable
data class ChatResponse(
    val model: String,
    val message: Message,
    val done: Boolean,
    @SerialName("done_reason") val doneReason: String? = null
)

@Serializable
data class StreamChunk(
    val model: String,
    val message: Message,
    val done: Boolean
)

The @SerialName annotation on doneReason maps the snake_case JSON field name to a camelCase Kotlin property. This is the only place you need to handle the naming mismatch — everywhere else in your code you use idiomatic Kotlin naming.

Basic Chat Completion

Here is a minimal function that sends a prompt to Ollama and returns the full response as a string:

val client = HttpClient(CIO) {
    install(ContentNegotiation) {
        json(Json { ignoreUnknownKeys = true })
    }
}

suspend fun chat(prompt: String, model: String = "llama3.2"): String {
    val response: ChatResponse = client.post("http://localhost:11434/api/chat") {
        contentType(ContentType.Application.Json)
        setBody(ChatRequest(
            model = model,
            messages = listOf(Message(role = "user", content = prompt)),
            stream = false
        ))
    }.body()
    return response.message.content
}

fun main() = runBlocking {
    val reply = chat("What is the Kotlin coroutines dispatcher?")
    println(reply)
}

The ignoreUnknownKeys = true setting on the JSON serializer is important — Ollama’s API response includes several fields beyond what we have modelled, and without this flag the deserializer would throw on any unrecognised key. This is the standard approach when consuming external APIs with kotlinx.serialization.

Streaming Responses with Coroutines

Streaming is where Kotlin’s coroutine model really shines. Rather than waiting for the full response before doing anything, you can process each token as it arrives using a Flow. This integrates naturally with Ktor’s streaming support and gives you a clean, backpressure-aware pipeline.

fun chatStream(prompt: String, model: String = "llama3.2"): Flow = flow {
    client.preparePost("http://localhost:11434/api/chat") {
        contentType(ContentType.Application.Json)
        setBody(ChatRequest(
            model = model,
            messages = listOf(Message(role = "user", content = prompt)),
            stream = true
        ))
    }.execute { response ->
        val channel: ByteReadChannel = response.bodyAsChannel()
        while (!channel.isClosedForRead) {
            val line = channel.readUTF8Line() ?: break
            if (line.isBlank()) continue
            val chunk = json.decodeFromString(line)
            emit(chunk.message.content)
            if (chunk.done) break
        }
    }
}

fun main() = runBlocking {
    chatStream("Explain Kotlin sealed classes").collect { token ->
        print(token)
        System.out.flush()
    }
    println()
}

The Flow emits each token string as it arrives from Ollama. The System.out.flush() call after each print is necessary because the JVM buffers stdout by default — without it you will not see tokens appear incrementally, defeating the purpose of streaming. In a server context you would typically collect the flow and send each chunk to the client over a WebSocket or SSE connection rather than printing to stdout.

Building a Reusable OllamaClient

For real applications you want a proper client class that encapsulates the HTTP logic, manages the Ktor client lifecycle, and is easy to configure. Here is a clean implementation that covers both blocking and streaming use cases:

class OllamaClient(
    private val baseUrl: String = "http://localhost:11434",
    private val defaultModel: String = "llama3.2"
) {
    private val httpClient = HttpClient(CIO) {
        install(ContentNegotiation) {
            json(Json { ignoreUnknownKeys = true })
        }
        engine { requestTimeout = 120_000 }
    }
    private val json = Json { ignoreUnknownKeys = true }

    suspend fun chat(messages: List, model: String = defaultModel): String {
        val response: ChatResponse = httpClient.post("$baseUrl/api/chat") {
            contentType(ContentType.Application.Json)
            setBody(ChatRequest(model = model, messages = messages, stream = false))
        }.body()
        return response.message.content
    }

    fun chatStream(messages: List, model: String = defaultModel): Flow = flow {
        httpClient.preparePost("$baseUrl/api/chat") {
            contentType(ContentType.Application.Json)
            setBody(ChatRequest(model = model, messages = messages, stream = true))
        }.execute { response ->
            val channel = response.bodyAsChannel()
            while (!channel.isClosedForRead) {
                val line = channel.readUTF8Line() ?: break
                if (line.isBlank()) continue
                val chunk = json.decodeFromString(line)
                emit(chunk.message.content)
                if (chunk.done) break
            }
        }
    }

    fun close() = httpClient.close()
}

The requestTimeout = 120_000 setting gives Ollama up to two minutes to start responding before Ktor times out the request. This matters for larger models that take a few seconds to load into memory on first use. You can tune this based on your hardware — machines with the model already loaded rarely need more than 10 to 15 seconds.

Multi-Turn Conversation

Multi-turn conversations work by accumulating the message history and sending the full list with each request. A simple conversation manager class handles this cleanly:

class Conversation(
    private val client: OllamaClient,
    private val systemPrompt: String? = null,
    private val model: String = "llama3.2"
) {
    private val history = mutableListOf()

    init {
        systemPrompt?.let { history.add(Message(role = "system", content = it)) }
    }

    suspend fun send(userMessage: String): String {
        history.add(Message(role = "user", content = userMessage))
        val reply = client.chat(history, model)
        history.add(Message(role = "assistant", content = reply))
        return reply
    }

    fun reset() {
        history.clear()
        systemPrompt?.let { history.add(Message(role = "system", content = it)) }
    }
}

fun main() = runBlocking {
    val client = OllamaClient()
    val convo = Conversation(
        client = client,
        systemPrompt = "You are a concise Kotlin expert. Keep answers brief."
    )
    println(convo.send("What is a data class?"))
    println(convo.send("How does it differ from a regular class?"))
    client.close()
}

The system prompt is added once during initialisation and preserved across resets. When reset() is called the history is cleared but the system prompt is immediately re-added, so the model’s persona is maintained even after a conversation is cleared. This is the right pattern for applications where users can start fresh without the underlying configuration changing.

Using Ollama in a Ktor Server

If you are building a backend with Ktor, exposing Ollama through an HTTP endpoint is straightforward. Here is a minimal Ktor application that proxies chat requests to Ollama and streams the response back using Server-Sent Events:

fun main() {
    val ollama = OllamaClient()
    embeddedServer(Netty, port = 8080) {
        install(ContentNegotiation) { json() }
        routing {
            post("/chat") {
                val body = call.receive()
                val reply = ollama.chat(body.messages, body.model)
                call.respond(mapOf("reply" to reply))
            }
            post("/chat/stream") {
                val body = call.receive()
                call.respondTextWriter(contentType = ContentType.Text.EventStream) {
                    ollama.chatStream(body.messages, body.model).collect { token ->
                        write("data: $token\n\n")
                        flush()
                    }
                }
            }
        }
    }.start(wait = true)
}

The /chat/stream endpoint uses SSE format — each token is prefixed with data: and followed by two newlines, which is the standard SSE framing that any browser’s native EventSource API can consume. The flush() call after each write is essential — without it Ktor buffers the response and the client sees nothing until generation is complete.

Using Ollama on Android

The same OllamaClient class works on Android with minimal changes. Swap the CIO engine for ktor-client-android in your dependencies, and call the client from a coroutine scope — never from the main thread. A typical ViewModel looks like this:

class ChatViewModel : ViewModel() {
    private val ollama = OllamaClient(baseUrl = "http://192.168.1.100:11434")
    private val convo = Conversation(ollama)

    val messages = MutableStateFlow>>(emptyList())
    val isLoading = MutableStateFlow(false)

    fun send(userInput: String) {
        viewModelScope.launch {
            isLoading.value = true
            val reply = convo.send(userInput)
            messages.value = messages.value + (userInput to reply)
            isLoading.value = false
        }
    }

    override fun onCleared() = ollama.close()
}

On Android, Ollama runs on a separate machine on your local network — the device itself does not have the resources to run a local model. Point baseUrl at your desktop or server’s local IP address, make sure Ollama is bound to 0.0.0.0 rather than just localhost by setting OLLAMA_HOST=0.0.0.0 before starting it, and ensure your firewall allows traffic on port 11434.

Error Handling

Production code needs to handle network failures and unexpected responses gracefully. Wrapping Ollama calls in runCatching and mapping exceptions to meaningful messages keeps error handling explicit without forcing callers to use try-catch everywhere:

suspend fun safeChat(prompt: String): Result = runCatching {
    chat(listOf(Message("user", prompt)))
}.onFailure { e ->
    when (e) {
        is java.net.ConnectException -> println("Ollama is not running. Start with: ollama serve")
        is io.ktor.client.plugins.HttpRequestTimeoutException -> println("Timed out — model may still be loading")
        else -> println("Unexpected error: ${e.message}")
    }
}

Using Result from the Kotlin standard library keeps error handling explicit. The caller can check result.isSuccess or call result.getOrDefault("Sorry, something went wrong.") to handle the failure case cleanly in the UI layer without any additional exception handling boilerplate.

Generating Embeddings

Ollama also exposes an embeddings endpoint that returns a vector representation of any input text. This is useful for semantic search, clustering, or feeding into a RAG pipeline. Add these data classes and a method to your client:

@Serializable
data class EmbedRequest(val model: String, val input: String)

@Serializable
data class EmbedResponse(val embeddings: List>)

suspend fun embed(text: String, model: String = "nomic-embed-text"): List {
    val response: EmbedResponse = httpClient.post("$baseUrl/api/embed") {
        contentType(ContentType.Application.Json)
        setBody(EmbedRequest(model = model, input = text))
    }.body()
    return response.embeddings.first()
}

Pull the embedding model first with ollama pull nomic-embed-text. It is a lightweight model optimised for retrieval tasks and produces 768-dimensional vectors. If you need higher-dimensional embeddings for better semantic fidelity, mxbai-embed-large produces 1024-dimensional vectors and is also available through Ollama.

Testing with MockEngine

Ktor’s MockEngine makes it easy to unit test your Ollama client without running a real Ollama instance. Refactor OllamaClient to accept an optional HttpClient parameter in its constructor, then inject a mock client in tests:

val mockEngine = MockEngine { _ ->
    respond(
        content = """{"model":"llama3.2","message":{"role":"assistant","content":"Hello!"},"done":true}""",
        status = HttpStatusCode.OK,
        headers = headersOf(HttpHeaders.ContentType, "application/json")
    )
}

@Test
fun testChatParsesResponse() = runTest {
    val mockClient = HttpClient(mockEngine) {
        install(ContentNegotiation) { json(Json { ignoreUnknownKeys = true }) }
    }
    val ollama = OllamaClient(httpClient = mockClient)
    val result = ollama.chat(listOf(Message("user", "Hi")))
    assertEquals("Hello!", result)
}

This pattern keeps your production code unchanged while making the client fully testable without any running infrastructure. The mock engine intercepts all HTTP requests and returns fixed responses, so your tests run fast and deterministically regardless of whether Ollama is installed on the CI machine.

Choosing the Right Model for Your Use Case

Not every Kotlin application needs the same model. For a developer assistant integrated into a Ktor backend where latency matters, a smaller model like llama3.2:3b responds quickly and handles most coding and question-answering tasks well. For an Android app on a local network, the bottleneck is usually the WiFi connection to the machine running Ollama rather than generation speed, so you can afford a larger model without noticeably impacting the user experience.

For code-specific tasks — explaining Kotlin functions, generating boilerplate, reviewing snippets — models like qwen2.5-coder:7b tend to outperform general-purpose models of the same size. They are fine-tuned on code and produce more accurate, idiomatic output for programming tasks. Pull them with ollama pull qwen2.5-coder:7b and pass the model name as a parameter to your OllamaClient calls.

If your application serves multiple users simultaneously — for example a Ktor API used by a team — keep in mind that Ollama processes one request at a time by default. Requests are queued internally, so concurrent calls will not fail, but they will wait. For a small team this is usually acceptable. For higher concurrency, consider running multiple Ollama instances behind a load balancer, or switching to a model small enough that generation time stays under a second or two per request.

Structured Output with JSON Schema

Ollama supports constrained generation using JSON schema, which forces the model to produce output that matches a specific structure. This is useful when you need reliable, parseable output — for example extracting structured data from text, classifying content, or generating configuration objects. Pass the schema in the format field of your request body alongside the messages. With JSON schema mode enabled, the response content will always be valid JSON conforming to your schema, which means you can call Json.decodeFromString on it without wrapping in a try-catch for parse errors.

This is significantly more reliable than asking the model to respond in JSON format via the prompt, which works most of the time but occasionally produces malformed output or extra prose around the JSON. For production applications where downstream code depends on parsing the model’s output, using schema-constrained generation is the right approach.

Keeping the Integration Maintainable

As your Ollama integration grows, a few architectural habits keep the code maintainable. Keep the OllamaClient as a single shared instance rather than creating a new one per request — the underlying Ktor client manages a connection pool and creating a new instance per request wastes resources and adds latency. In a Spring Boot application, register it as a @Bean; in Ktor, create it at application startup and pass it through to your route handlers via dependency injection or a simple top-level property.

Define your prompts as constants or load them from files rather than embedding them as strings directly in your business logic. This makes it easy to iterate on prompts without touching application code, and if you want to move to a different model later, you can test prompt variations in isolation. A simple approach is a prompts/ directory in your resources folder with one file per prompt template, loaded at startup with javaClass.getResourceAsStream.

Finally, consider adding request logging at the OllamaClient level using Ktor’s Logging plugin during development. Seeing exactly what messages are sent to Ollama and what comes back makes debugging prompt issues much faster than trying to infer what happened from the model’s output alone. Disable verbose logging in production and keep only error-level logs to avoid leaking user data into application logs.

Next Steps

With a working OllamaClient in place, the most productive next step is adding model selection to your application so users or configuration can switch between models without a code change. Store the model name in your application config — a application.conf file in Ktor, or an environment variable — and pass it into OllamaClient at startup. This lets you experiment with different models for different tasks: a small fast model for low-latency autocomplete, a larger model for complex reasoning, and a code-tuned model for anything involving source code.

If you are building a Kotlin backend that other services or a frontend will consume, consider adding a thin caching layer in front of your Ollama calls for prompts that are frequently repeated with the same input. A simple ConcurrentHashMap keyed by the prompt hash works for development; for production, a short-lived Redis cache avoids redundant generation and keeps response times consistent even when the same question is asked repeatedly by different users.

Leave a Comment