How to Use Ollama with Haskell

Haskell is an unusual choice for AI integration work — but that is part of what makes it interesting. Its strong type system, purely functional model, and lazy evaluation make it excellent for building reliable data pipelines, and Ollama’s simple HTTP API is easy to call from any language with an HTTP client. This guide covers connecting a Haskell application to Ollama for chat completions, streaming responses, and embeddings, using idiomatic Haskell with the http-conduit and aeson libraries.

Haskell’s type safety pairs well with LLM integration because you can encode the structure of Ollama’s API as types, getting compile-time guarantees that your request and response handling is correct. Errors that would surface at runtime in a dynamically typed language become type errors that the compiler catches before you ever run the code.

Prerequisites

You need Ollama running with at least one model pulled, and GHC with Cabal or Stack for Haskell. Add the following dependencies to your .cabal file or package.yaml:

build-depends:
    base
  , aeson
  , http-conduit
  , http-client
  , bytestring
  , text
  , conduit
  , conduit-extra

Defining Types with Aeson

Start by defining the Haskell data types that map to Ollama’s API shapes. Using DeriveGeneric and aeson‘s generic deriving keeps serialisation boilerplate minimal:

{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}

module Ollama where

import Data.Aeson (FromJSON, ToJSON, encode, decode)
import Data.Text (Text)
import GHC.Generics (Generic)

data Message = Message
  { role    :: Text
  , content :: Text
  } deriving (Show, Generic)

instance FromJSON Message
instance ToJSON   Message

data ChatRequest = ChatRequest
  { model    :: Text
  , messages :: [Message]
  , stream   :: Bool
  } deriving (Show, Generic)

instance ToJSON ChatRequest

data ChatResponse = ChatResponse
  { message :: Message
  , done    :: Bool
  } deriving (Show, Generic)

instance FromJSON ChatResponse

The Generic deriving instructs GHC to automatically generate the structural representation needed by aeson. The FromJSON and ToJSON instances are derived without any manual field mapping — aeson uses the record field names directly as JSON keys. If Ollama’s field names use snake_case and your Haskell record uses camelCase, configure the deriving with defaultOptions { fieldLabelModifier = camelToSnake } to handle the conversion automatically.

Basic Chat Completion

Here is a function that sends a prompt to Ollama and returns the response text:

import Network.HTTP.Simple
import qualified Data.ByteString.Lazy as BL

baseUrl :: String
baseUrl = "http://localhost:11434"

chat :: Text -> IO (Either String Text)
chat prompt = do
  let req = ChatRequest
        { model    = "llama3.2"
        , messages = [Message { role = "user", content = prompt }]
        , stream   = False
        }
  request <- parseRequest (baseUrl ++ "/api/chat")
  let request' = setRequestMethod "POST"
               $ setRequestBodyJSON req
               $ setRequestHeader "Content-Type" ["application/json"]
               $ request
  response <- httpLBS request'
  case decode (getResponseBody response) :: Maybe ChatResponse of
    Nothing   -> return (Left "Failed to decode response")
    Just resp -> return (Right (content (message resp)))

main :: IO ()
main = do
  result <- chat "What is Haskell's type system?"
  case result of
    Left err  -> putStrLn ("Error: " ++ err)
    Right txt -> putStrLn (show txt)

The httpLBS function performs a synchronous HTTP request and returns the response body as a lazy ByteString. We decode it with aeson‘s decode, which returns Maybe ChatResponse — Nothing if decoding fails, Just resp if it succeeds. Returning Either String Text from chat makes the error case explicit at the call site rather than throwing an exception, which is idiomatic Haskell error handling.

Streaming Responses with Conduit

Streaming from Ollama in Haskell is most naturally expressed using conduit, which provides composable streaming data pipelines. The response body arrives as a stream of byte chunks which we split on newlines and decode as JSON objects:

import Conduit
import Data.Conduit.Binary (lines)
import Network.HTTP.Client
import qualified Data.ByteString.Char8 as BC
import Prelude hiding (lines)
import System.IO (hFlush, stdout)

chatStream :: Text -> IO ()
chatStream prompt = do
  manager <- newManager defaultManagerSettings
  let body = ChatRequest "llama3.2" [Message "user" prompt] True
  initReq <- parseRequest (baseUrl ++ "/api/chat")
  let req = initReq
        { method      = "POST"
        , requestBody = RequestBodyLBS (encode body)
        , requestHeaders = [("Content-Type", "application/json")]
        }
  withResponse req manager $ \resp ->
    runConduit
      $ bodyReaderSource (responseBody resp)
      .| Data.Conduit.Binary.lines
      .| mapM_C (\line ->
           case decode (BL.fromStrict line) :: Maybe ChatResponse of
             Just chunk -> putStr (show (content (message chunk))) >> hFlush stdout
             Nothing    -> return ()
         )
  putStrLn ""

The conduit pipeline reads bytes from the response body, splits them on newlines, and for each line attempts to decode a ChatResponse. Successful decodes print the token content immediately and flush stdout so output appears incrementally. Failed decodes are silently skipped — this handles empty lines and the final summary object Ollama sends after done: true without crashing the pipeline.

Building a Reusable Client

For production use, wrap configuration and the shared HTTP manager in a record type:

data OllamaConfig = OllamaConfig
  { ollamaBaseUrl :: String
  , ollamaModel   :: Text
  , ollamaManager :: Manager
  }

newOllamaConfig :: String -> Text -> IO OllamaConfig
newOllamaConfig url mdl = do
  mgr <- newManager defaultManagerSettings
        { managerResponseTimeout = responseTimeoutMicro 120000000 }
  return OllamaConfig
    { ollamaBaseUrl = url
    , ollamaModel   = mdl
    , ollamaManager = mgr
    }

ollamaChat :: OllamaConfig -> [Message] -> IO (Either String Text)
ollamaChat cfg msgs = do
  let body = ChatRequest (ollamaModel cfg) msgs False
  initReq <- parseRequest (ollamaBaseUrl cfg ++ "/api/chat")
  let req = initReq
        { method         = "POST"
        , requestBody    = RequestBodyLBS (encode body)
        , requestHeaders = [("Content-Type", "application/json")]
        }
  resp <- httpLbs req (ollamaManager cfg)
  case decode (responseBody resp) :: Maybe ChatResponse of
    Nothing -> return (Left "Failed to decode Ollama response")
    Just r  -> return (Right (content (message r)))

The 120-second response timeout accommodates large models that take time to generate lengthy responses. Reusing the Manager across calls lets http-client pool and reuse TCP connections, avoiding the cost of a new connection handshake for every request.

Multi-Turn Conversation

Use an IORef to hold mutable conversation history while keeping the rest of the code purely functional:

import Data.IORef
import Data.String (fromString)

runConversation :: OllamaConfig -> Maybe Text -> IO ()
runConversation cfg systemPrompt = do
  let initial = maybe [] (\p -> [Message "system" p]) systemPrompt
  histRef <- newIORef initial
  let loop = do
        putStr "You: " >> hFlush stdout
        userInput <- getLine
        hist <- readIORef histRef
        let hist' = hist ++ [Message "user" (fromString userInput)]
        writeIORef histRef hist'
        result <- ollamaChat cfg hist'
        case result of
          Left err  -> putStrLn ("Error: " ++ err)
          Right txt -> do
            putStrLn ("Assistant: " ++ show txt)
            modifyIORef histRef (++ [Message "assistant" txt])
        loop
  loop

The IORef holding history is updated after each exchange — user messages are appended before the request so they are included in context, and assistant replies are appended after the response arrives. The recursive loop builds up the full conversation history with every exchange. This is a clean interactive REPL pattern that works well for a command-line Ollama client.

Generating Embeddings

Ollama’s embeddings endpoint follows the same pattern as the chat endpoint. Define the request and response types and add a function to OllamaConfig:

data EmbedRequest = EmbedRequest
  { embedModel :: Text
  , input      :: Text
  } deriving (Generic)

instance ToJSON EmbedRequest where
  toJSON r = object ["model" .= embedModel r, "input" .= input r]

data EmbedResponse = EmbedResponse
  { embeddings :: [[Double]]
  } deriving (Show, Generic)

instance FromJSON EmbedResponse

ollamaEmbed :: OllamaConfig -> Text -> IO (Either String [Double])
ollamaEmbed cfg txt = do
  let body = EmbedRequest "nomic-embed-text" txt
  initReq <- parseRequest (ollamaBaseUrl cfg ++ "/api/embed")
  let req = initReq
        { method         = "POST"
        , requestBody    = RequestBodyLBS (encode body)
        , requestHeaders = [("Content-Type", "application/json")]
        }
  resp <- httpLbs req (ollamaManager cfg)
  case decode (responseBody resp) :: Maybe EmbedResponse of
    Nothing -> return (Left "Failed to decode embedding response")
    Just r  -> case embeddings r of
      (v:_) -> return (Right v)
      []    -> return (Left "Empty embeddings list")

Pull the embedding model first with ollama pull nomic-embed-text. The function returns the first embedding vector as a [Double] list. In Haskell you can compute cosine similarity between two vectors with a simple list fold — no external linear algebra library needed for basic semantic search.

Testing with Hspec and Mock Responses

Test your Ollama integration without a running instance by parameterising the HTTP request function and providing a mock in tests. The cleanest approach is to pass an IO action that makes the request, so tests can substitute a function that returns a fixed ByteString:

import Test.Hspec
import qualified Data.ByteString.Lazy.Char8 as BLC

mockOllamaResponse :: BLC.ByteString
mockOllamaResponse = BLC.pack
  "{\"model\":\"llama3.2\",\"message\":{\"role\":\"assistant\",\"content\":\"Hello!\"},\"done\":true}"

spec :: Spec
spec = describe "Ollama client" $ do
  it "decodes a valid response" $ do
    let result = decode mockOllamaResponse :: Maybe ChatResponse
    result `shouldSatisfy` (\r -> fmap (content . message) r == Just "Hello!")

This tests the JSON parsing logic — the most likely source of breakage when Ollama’s response format changes. For integration tests that make real HTTP calls, use hspec‘s before hook to check that Ollama is reachable and skip the suite if it is not, keeping CI green even on machines without Ollama installed.

Error Handling with ExceptT

For applications that chain multiple Ollama calls — embed a document, compare it to a stored vector, then generate a contextualised response — use ExceptT to thread error handling through the pipeline without nested case expressions:

import Control.Monad.Trans.Except

ragPipeline :: OllamaConfig -> Text -> Text -> ExceptT String IO Text
ragPipeline cfg query context = do
  _queryVec <- ExceptT (ollamaEmbed cfg query)
  let prompt = "Context: " <> context <> "\n\nQuestion: " <> query
  ExceptT (ollamaChat cfg [Message "user" prompt])

runRag :: OllamaConfig -> Text -> Text -> IO ()
runRag cfg query context = do
  result <- runExceptT (ragPipeline cfg query context)
  case result of
    Left err  -> putStrLn ("Pipeline failed: " ++ err)
    Right txt -> putStrLn (show txt)

ExceptT String IO sequences IO actions while propagating the first Left error — if the embedding call fails, the chat call is never made and the error bubbles up to runRag. This is Haskell’s idiomatic alternative to exception-based error propagation, and it makes the failure modes of your pipeline explicit in the type signature rather than implicit in the runtime behaviour.

When Haskell Makes Sense for Ollama Integration

Haskell is not the obvious first choice for LLM integration, but it earns its place in specific contexts. If you are building a data processing pipeline where correctness matters — extracting structured data from documents, classifying text according to a schema, or transforming content at scale — Haskell’s type system catches structural errors at compile time and its lazy evaluation handles large datasets efficiently. The conduit library’s streaming model is a natural fit for processing Ollama’s streaming responses as part of a larger data pipeline.

For CLI tools built with libraries like optparse-applicative, Haskell produces small, fast, standalone binaries with no runtime dependencies. A Haskell-based Ollama CLI that compiles to a single binary with rich argument parsing, prompt templating, and conversation history management is a legitimate and practical target. The development experience is different from Python or JavaScript — the compiler is strict and the iteration cycle slower — but the resulting binary is fast, memory-efficient, and correct by construction in ways that dynamic languages cannot match.

For web services, Haskell frameworks like Servant and Yesod are capable of handling the Ollama proxy patterns described in this guide, and the type-level API specification in Servant means your API contract is checked at compile time. If your team already uses Haskell, integrating Ollama into an existing Haskell backend is straightforward with the patterns shown here. If you are choosing Haskell specifically for the Ollama integration, the learning curve is steeper than alternatives — but the payoff in type-level correctness and composable stream processing is real for the right class of problem.

Working with the Servant Web Framework

If you are building an HTTP API in Haskell to expose Ollama to other services, Servant is the most type-safe option available in the Haskell ecosystem. With Servant, your API is defined as a type, and the compiler verifies that your handler implementations match the declared routes exactly. Mismatches between route declarations and handler types become compile errors rather than runtime 404s, which is a meaningful safety guarantee for production services.

A minimal Servant API wrapping Ollama would declare a POST /chat endpoint that accepts a ChatRequest body and returns a ChatResponse, with the types wired through ReqBody and JSON combinators. The handler calls ollamaChat, lifts the Either result into a Handler using throwError for the error case, and returns the response directly. The entire API contract — method, path, request body type, response type — is encoded in one type-level expression that Servant uses to derive both the server implementation signature and, if you use servant-client, the client bindings for other Haskell services that need to call your API.

Using Wreq as an Alternative HTTP Client

For simpler use cases where the full power of http-conduit is not needed, the wreq library provides a more concise interface. Its lens-based API for setting headers and reading response bodies is ergonomic for straightforward request-response patterns:

import Network.Wreq
import Control.Lens
import qualified Data.ByteString.Lazy as BL

chatWreq :: Text -> IO (Either String Text)
chatWreq prompt = do
  let opts  = defaults & header "Content-Type" .~ ["application/json"]
      body  = encode (ChatRequest "llama3.2" [Message "user" prompt] False)
  resp <- postWith opts (baseUrl ++ "/api/chat") body
  case decode (resp ^. responseBody) :: Maybe ChatResponse of
    Nothing -> return (Left "Decode failed")
    Just r  -> return (Right (content (message r)))

The lens operators & and .~ for setting options feel natural once you are comfortable with Haskell lenses. wreq handles redirects and response decompression automatically, and its API is considerably more concise than http-conduit for non-streaming use cases. For streaming, stick with http-conduit — wreq does not expose the incremental body reading primitives needed to process Ollama’s newline-delimited stream.

Parallel Embedding with Async

For batch embedding tasks — indexing a collection of documents for semantic search, for example — Haskell’s async library makes it easy to run multiple embedding requests concurrently. Ollama queues concurrent requests internally, so you are not going to overwhelm it, but sending requests in parallel rather than sequentially means you get results as fast as Ollama can process them:

import Control.Concurrent.Async (mapConcurrently)

embedDocuments :: OllamaConfig -> [Text] -> IO [Either String [Double]]
embedDocuments cfg docs = mapConcurrently (ollamaEmbed cfg) docs

mapConcurrently runs each ollamaEmbed call in its own lightweight thread and collects the results in order. For a list of 50 documents, this can reduce wall-clock time by 3 to 5 times compared to sequential embedding, depending on how quickly Ollama can process requests on your hardware. The result list preserves the order of the input list, so you can zip it with the original documents to build your index without any additional bookkeeping.

Practical Tips for Haskell and Ollama

A few practical notes that save time when getting started. First, enable OverloadedStrings at the top of every module that deals with the API — without it, string literals in Haskell are String rather than Text, and you will spend time adding pack calls everywhere. Second, when decoding Ollama responses with aeson, prefer eitherDecode over decode in production code — it returns Either String a with a descriptive error message rather than just Nothing, making it much easier to diagnose response format issues when Ollama returns an unexpected shape.

Third, Ollama’s API returns additional fields in its responses beyond what is modelled here — timing metadata, token counts, and context information. By default aeson‘s generic deriving ignores unknown fields during decoding, so these extra fields cause no issues. If you want to access them, add them to your ChatResponse type as Maybe fields — they will be populated when present and set to Nothing when absent, handling both older and newer Ollama versions gracefully.

Finally, consider using Data.Aeson.KeyMap when you need to inspect raw response fields without fully modelled types — it gives you a dynamic key-value map you can query directly, which is useful for exploratory work or when you want to log the full raw response alongside your parsed result for debugging purposes.

With these patterns in place, a Haskell application has everything it needs to integrate Ollama across the full range of use cases — from simple one-shot completions in a script to streaming responses in a web service, multi-turn conversations in a CLI tool, and batch embedding pipelines for semantic search. The aeson and http-conduit ecosystem is stable and well-maintained, so the code you write against Ollama today will continue to compile and run as both libraries and Ollama itself evolve.