How to Use Ollama with Scala

Scala is a powerful language that sits at the intersection of object-oriented and functional programming, widely used for data engineering, distributed systems, and backend services. Ollama exposes a simple HTTP API that any Scala HTTP client can call, and Scala’s strong type system lets you model the request and response shapes precisely with case classes. This guide covers connecting Scala applications to Ollama using Sttp and Circe, streaming responses, building a reusable client, and patterns for Cats Effect and ZIO codebases.

Scala’s functional programming idioms pair well with LLM API design — use Either for error handling without exceptions, model streaming token output as a composable pipeline, and express the full request lifecycle as a chain of pure transformations. The patterns here work across Scala 2.13 and Scala 3 with only minor syntax differences.

Project Setup with sbt

Add the following to your build.sbt. We use Sttp for HTTP, Circe for JSON, and the Sttp Circe integration to wire them together:

libraryDependencies ++= Seq(
  "com.softwaremill.sttp.client4" %% "core"           % "4.0.0",
  "com.softwaremill.sttp.client4" %% "circe"          % "4.0.0",
  "com.softwaremill.sttp.client4" %% "okhttp-backend" % "4.0.0",
  "io.circe"                      %% "circe-generic"  % "0.14.9",
  "io.circe"                      %% "circe-parser"   % "0.14.9"
)

Sttp’s OkHttp backend handles both synchronous and streaming requests on the JVM without requiring Akka or other heavyweight runtimes. For applications already using Cats Effect, swap to sttp.client4.httpclient.cats.HttpClientCatsBackend; for ZIO, use sttp.client4.httpclient.zio.HttpClientZioBackend. The request and response code stays identical regardless of backend.

Defining Types with Circe

Define case classes for the API shapes. Circe’s semi-automatic derivation generates JSON encoders and decoders with a single import:

import io.circe.generic.semiauto._
import io.circe.{Decoder, Encoder}

case class Message(role: String, content: String)
object Message {
  implicit val enc: Encoder[Message] = deriveEncoder
  implicit val dec: Decoder[Message] = deriveDecoder
}

case class ChatRequest(model: String, messages: List[Message], stream: Boolean)
object ChatRequest { implicit val enc: Encoder[ChatRequest] = deriveEncoder }

case class ChatResponse(model: String, message: Message, done: Boolean)
object ChatResponse { implicit val dec: Decoder[ChatResponse] = deriveDecoder }

In Scala 3 you can use the derives keyword instead: case class Message(role: String, content: String) derives Encoder.AsObject, Decoder. Both produce identical runtime behaviour — the choice is a matter of Scala version and team style preference.

Basic Chat Completion

Here is a synchronous chat function using Sttp’s OkHttp backend:

import sttp.client4._
import sttp.client4.circe._
import sttp.client4.okhttp.OkHttpSyncBackend
import scala.concurrent.duration._

val backend = OkHttpSyncBackend()
val baseUrl = "http://localhost:11434"

def chat(prompt: String, model: String = "llama3.2"): Either[String, String] = {
  val req = ChatRequest(model, List(Message("user", prompt)), stream = false)
  val response = basicRequest
    .post(uri"$baseUrl/api/chat")
    .body(req)
    .response(asJson[ChatResponse])
    .readTimeout(120.seconds)
    .send(backend)
  response.body.map(_.message.content).left.map(_.getMessage)
}

@main def run(): Unit =
  chat("What is Scala's type system?") match {
    case Right(reply) => println(reply)
    case Left(err)    => println(s"Error: $err")
  }

The asJson[ChatResponse] handler deserialises the response body using Circe automatically. The 120-second read timeout accommodates large models — Sttp’s default is much shorter and causes spurious failures with slow models. The .left.map(_.getMessage) converts Sttp’s ResponseException to a plain string, giving callers a clean Either[String, String] to pattern match on.

Streaming Responses

For streaming, switch to Sttp’s streaming response handler and process each newline-delimited JSON chunk as it arrives:

import sttp.client4.okhttp.OkHttpSyncBackend
import io.circe.parser.decode
import scala.io.Source

def chatStream(prompt: String, onToken: String => Unit, model: String = "llama3.2"): Unit = {
  val req = ChatRequest(model, List(Message("user", prompt)), stream = true)
  val reqBody = io.circe.syntax.EncoderOps(req).asJson.noSpaces

  val connection = new java.net.URL(s"$baseUrl/api/chat").openConnection()
      .asInstanceOf[java.net.HttpURLConnection]
  connection.setRequestMethod("POST")
  connection.setRequestProperty("Content-Type", "application/json")
  connection.setDoOutput(true)
  connection.setReadTimeout(120000)
  connection.getOutputStream.write(reqBody.getBytes("UTF-8"))

  val source = Source.fromInputStream(connection.getInputStream, "UTF-8")
  try {
    source.getLines().foreach { line =>
      if (line.nonEmpty) {
        decode[ChatResponse](line).foreach { chunk =>
          onToken(chunk.message.content)
        }
      }
    }
  } finally {
    source.close()
    connection.disconnect()
  }
}

@main def stream(): Unit =
  chatStream("Explain Scala implicits", token => { print(token); Console.flush() })

This uses Java’s built-in HttpURLConnection for streaming, which avoids adding a streaming-capable backend dependency. Source.fromInputStream reads the response line by line, and decode[ChatResponse] from Circe’s parser module returns a Right for valid JSON and a Left for anything else — silently ignoring parse failures handles empty lines and the final summary object cleanly.

Building a Reusable OllamaClient

Wrap the HTTP logic into a class that holds configuration and the shared backend:

class OllamaClient(baseUrl: String = "http://localhost:11434",
                   defaultModel: String = "llama3.2") {

  private val backend = OkHttpSyncBackend()

  def chat(messages: List[Message], model: String = defaultModel): Either[String, String] = {
    val req = ChatRequest(model, messages, stream = false)
    basicRequest
      .post(uri"$baseUrl/api/chat")
      .body(req)
      .response(asJson[ChatResponse])
      .readTimeout(120.seconds)
      .send(backend)
      .body
      .map(_.message.content)
      .left.map(_.getMessage)
  }

  def embed(text: String, model: String = "nomic-embed-text"): Either[String, List[Double]] = {
    case class EmbedReq(model: String, input: String)
    case class EmbedResp(embeddings: List[List[Double]])
    import io.circe.generic.auto._
    basicRequest
      .post(uri"$baseUrl/api/embed")
      .body(EmbedReq(model, text))
      .response(asJson[EmbedResp])
      .readTimeout(60.seconds)
      .send(backend)
      .body
      .map(_.embeddings.headOption.getOrElse(Nil))
      .left.map(_.getMessage)
  }

  def close(): Unit = backend.close()
}

Using io.circe.generic.auto._ for the inline case classes avoids the boilerplate of companion object instances for types that are only used in one place. The shared backend pools connections across requests, so multiple sequential calls reuse the same TCP connection rather than opening a new one each time.

Multi-Turn Conversation

Wrap the client in a conversation class that tracks history:

class Conversation(client: OllamaClient,
                   systemPrompt: Option[String] = None,
                   model: String = "llama3.2") {

  private var history: List[Message] =
    systemPrompt.map(p => Message("system", p)).toList

  def send(userMessage: String): Either[String, String] = {
    history = history :+ Message("user", userMessage)
    client.chat(history, model) match {
      case Right(reply) =>
        history = history :+ Message("assistant", reply)
        Right(reply)
      case Left(err) =>
        history = history.dropRight(1) // remove failed user message
        Left(err)
      }
  }

  def reset(): Unit =
    history = systemPrompt.map(p => Message("system", p)).toList
}

@main def converse(): Unit = {
  val client = new OllamaClient()
  val convo  = new Conversation(client, Some("You are a concise Scala expert."))
  println(convo.send("What is a type class?"))
  println(convo.send("Give me a concrete example."))
  client.close()
}

Rolling back the history on error — dropping the user message if the chat call fails — keeps the history in a consistent state so the next call does not send a dangling user turn with no assistant response. This is an important detail for multi-turn reliability: a failed request should leave the conversation in the same state as before the request was made.

Using Cats Effect

For applications built on Cats Effect, wrap the client in IO and use the async Sttp backend:

import cats.effect._
import sttp.client4.httpclient.cats.HttpClientCatsBackend

object OllamaCats extends IOApp.Simple {

  def chat(prompt: String): IO[Either[String, String]] =
    HttpClientCatsBackend.resource[IO]().use { backend =>
      val req = ChatRequest("llama3.2", List(Message("user", prompt)), stream = false)
      basicRequest
        .post(uri"http://localhost:11434/api/chat")
        .body(req)
        .response(asJson[ChatResponse])
        .readTimeout(120.seconds)
        .send(backend)
        .map(_.body.map(_.message.content).left.map(_.getMessage))
    }

  def run: IO[Unit] =
    chat("Explain Cats Effect IO").flatMap {
      case Right(reply) => IO.println(reply)
      case Left(err)    => IO.println(s"Error: $err")
    }
}

The resource pattern ensures the HTTP backend is properly closed when the IO completes, even if an exception is thrown. For production use, create the backend once at application startup with Resource and share it across all Ollama calls via dependency injection or a Ref rather than creating a new backend per request.

Using ZIO

The ZIO version follows the same structure with ZIO-specific types. Use HttpClientZioBackend and return ZIO[Any, String, String] to make the error type explicit in the signature:

import zio._
import sttp.client4.httpclient.zio.HttpClientZioBackend

object OllamaZio extends ZIOAppDefault {

  def chat(prompt: String): ZIO[Any, String, String] =
    HttpClientZioBackend.scoped().flatMap { backend =>
      val req = ChatRequest("llama3.2", List(Message("user", prompt)), stream = false)
      basicRequest
        .post(uri"http://localhost:11434/api/chat")
        .body(req)
        .response(asJson[ChatResponse])
        .readTimeout(120.seconds)
        .send(backend)
        .mapError(_.getMessage)
        .flatMap(resp => ZIO.fromEither(
          resp.body.map(_.message.content).left.map(_.getMessage)
        ))
    }

  def run: ZIO[Any, Any, Unit] =
    chat("What is ZIO?").foldZIO(
      err   => Console.printLine(s"Error: $err"),
      reply => Console.printLine(reply)
    )
}

Encoding the error type as String in the ZIO signature rather than using Throwable keeps error handling explicit and forces callers to handle failure cases at the type level. For more complex applications with multiple error types, define a sealed trait for Ollama errors and use ZIO[Any, OllamaError, String] to make the full error vocabulary visible in the type signature.

Testing with ScalaTest and WireMock

Use WireMock to stub the Ollama HTTP endpoint in tests so your suite runs without a real Ollama instance:

// build.sbt
libraryDependencies += "com.github.tomakehurst" % "wiremock-jre8" % "2.35.2" % Test

// OllamaClientSpec.scala
import com.github.tomakehurst.wiremock.WireMockServer
import com.github.tomakehurst.wiremock.client.WireMock._
import org.scalatest.funsuite.AnyFunSuite
import org.scalatest.BeforeAndAfterAll

class OllamaClientSpec extends AnyFunSuite with BeforeAndAfterAll {
  val wm = new WireMockServer(8089)

  override def beforeAll(): Unit = {
    wm.start()
    wm.stubFor(post(urlEqualTo("/api/chat"))
      .willReturn(okJson(
        """{"model":"llama3.2","message":{"role":"assistant","content":"Hello!"},"done":true}"""
      )))
  }

  override def afterAll(): Unit = wm.stop()

  test("chat returns response content") {
    val client = new OllamaClient(baseUrl = "http://localhost:8089")
    val result = client.chat(List(Message("user", "Hi")))
    assert(result == Right("Hello!"))
    client.close()
  }
}

Passing the WireMock server’s URL as baseUrl to OllamaClient means no code changes are needed in the client to make it testable — the URL is the only injection point needed. WireMock starts on a fixed port, stubs the exact response shape Ollama returns, and verifies the request was made with the correct method and path. The test suite runs in under a second and passes on any CI machine regardless of whether Ollama is installed.

Structured Output

Ollama’s JSON schema mode forces the model to produce output conforming to a schema, which pairs naturally with Scala’s case classes. Pass a JSON Schema object in the format field alongside your messages, then decode the response content directly into your target type using Circe. Because schema-constrained output always matches the schema, decode[YourType](response.message.content) will succeed as long as your schema matches your case class structure — no error handling needed for the parse step. This is one of the most practically useful patterns for Scala services that use Ollama to extract or classify data from unstructured text, giving you fully typed results with no manual parsing code.

Performance Considerations

Scala applications running on the JVM start up more slowly than native binaries, but once running they benefit from JIT compilation that makes repeated Ollama calls fast. For long-running services — a Scala HTTP API, a Spark job processing many documents, or an Akka actor system — the JVM warm-up cost is paid once and amortised across thousands of requests. For short-lived CLI tools or scripts, consider using GraalVM native image to compile your Scala application to a native binary that starts in milliseconds, which makes it practical to invoke from shell scripts or other automation contexts where JVM startup time would otherwise be a problem.

Using Ollama in a Spark Job

One of the most compelling Scala use cases for Ollama is batch processing large document collections in a Spark job. Each executor can maintain its own OllamaClient instance created lazily to avoid serialisation issues, and call Ollama on the worker node. This works well when each worker has access to an Ollama instance — either running locally on GPU-equipped worker nodes or accessible over the local network.

For best performance, use Spark’s mapPartitions to create one client per partition, process all rows in the partition with that client, and close it after the partition is complete. This reduces connection overhead from one setup per row to one setup per partition — typically orders of magnitude fewer connections for large datasets. Combined with Ollama’s fast inference for small models, this pattern makes it practical to run LLM classification or summarisation over millions of documents in a Spark batch job without needing to call a cloud API.

Choosing Between Effect Systems

The choice between plain Scala, Cats Effect, and ZIO for your Ollama integration depends primarily on what the rest of your codebase uses. If you are writing a standalone script or a simple service, plain synchronous Scala with the OkHttp backend is the fastest path — no effect system ceremony, straightforward error handling with Either, and easy to understand for team members who are less familiar with functional programming.

For Cats Effect codebases, the IO-based approach integrates cleanly with your existing resource management patterns. The Resource[IO, Backend] pattern ensures HTTP connections are always properly released, and composing multiple Ollama calls with parMapN for concurrent embedding or classification requests fits naturally into the Cats Effect model. For ZIO codebases, the explicit error channel in ZIO[R, E, A] is particularly valuable — it forces every caller to handle network failures and model errors explicitly rather than letting them propagate as unhandled exceptions. ZIO’s built-in retry combinators are also directly useful: wrapping an Ollama call with .retry(Schedule.recurs(3)) adds automatic retries for transient failures with a single line of code.

Model Selection for Scala Use Cases

Model choice for Scala applications follows the same logic as other languages but with one Scala-specific consideration: Scala developers often work with complex, verbose codebases — large Spring or Akka applications, intricate type class hierarchies, complex SBT build configurations — and general-purpose models sometimes struggle to give accurate advice about Scala-specific patterns. Code-tuned models like qwen2.5-coder:7b handle Scala significantly better than general models of the same size, particularly for questions about implicits, type classes, effect system patterns, and build tooling.

For data processing use cases — classifying documents in a Spark job, extracting entities from text, generating structured summaries — the quality difference between a 7B and a 70B model is more noticeable than it is for code tasks. If you have the hardware to run a larger model, the improvement in extraction accuracy is often worth the slower throughput. For interactive coding assistance in an IDE, the 7B model strikes the best balance between speed and quality for most Scala development tasks.

Keeping the Integration Maintainable

As your Scala Ollama integration grows, a few design habits keep it maintainable. Define a trait for the Ollama client interface and provide the real HTTP implementation plus a mock implementation for testing — this is the standard Scala dependency injection pattern and it means your tests never need to start a real Ollama instance. Keep prompt templates in a dedicated object or resource files rather than embedding them as string literals in your business logic, so they can be reviewed, versioned, and changed independently of application code. And model the possible failure modes of each Ollama call explicitly in your error type — distinguishing between a network failure, a timeout, and a model not found error lets callers implement appropriate fallback behaviour rather than treating all failures the same way. These patterns are not specific to Ollama, but they matter more when the underlying dependency is a local service that can be unavailable, slow, or misconfigured in ways that cloud APIs typically are not.

Scala’s combination of a powerful type system, a rich ecosystem of HTTP and JSON libraries, and support for both object-oriented and functional programming styles makes it a strong foundation for building Ollama integrations that are correct, composable, and easy to extend. Whether you are adding local AI to an existing Akka or Play application, processing documents in a Spark pipeline, or building a new service from scratch with Cats Effect or ZIO, the patterns in this guide give you everything you need to get started and scale confidently.

Leave a Comment