Ollama provides an official Go library that wraps the REST API with idiomatic Go types, streaming support, and full model management capabilities. If you are building a Go application that needs local LLM inference — a CLI tool, a backend service, a data processing pipeline — the Go library is the cleanest path to Ollama integration without dropping to raw HTTP calls.
Installation
go get github.com/ollama/ollama/api
Basic Chat
package main
import (
"context"
"fmt"
"log"
"github.com/ollama/ollama/api"
)
func main() {
client, err := api.ClientFromEnvironment()
if err != nil {
log.Fatal(err)
}
messages := []api.Message{
{Role: "user", Content: "Why is Go popular for backend services?"},
}
req := &api.ChatRequest{
Model: "llama3.2",
Messages: messages,
}
var fullResponse string
err = client.Chat(context.Background(), req, func(resp api.ChatResponse) error {
fmt.Print(resp.Message.Content)
fullResponse += resp.Message.Content
return nil
})
if err != nil {
log.Fatal(err)
}
fmt.Println()
}
Non-Streaming Response
func chatSync(client *api.Client, model, prompt string) (string, error) {
messages := []api.Message{{Role: "user", Content: prompt}}
req := &api.ChatRequest{Model: model, Messages: messages}
var result string
err := client.Chat(context.Background(), req, func(resp api.ChatResponse) error {
if resp.Done {
result = resp.Message.Content
}
return nil
})
return result, err
}
Generate (Raw Completion)
func generate(client *api.Client, model, prompt string) (string, error) {
stream := false
req := &api.GenerateRequest{
Model: model,
Prompt: prompt,
Stream: &stream,
}
var output string
err := client.Generate(context.Background(), req, func(resp api.GenerateResponse) error {
output += resp.Response
return nil
})
return output, err
}
Embeddings
func embed(client *api.Client, model, text string) ([]float64, error) {
req := &api.EmbeddingRequest{
Model: model,
Prompt: text,
}
resp, err := client.Embeddings(context.Background(), req)
if err != nil {
return nil, err
}
return resp.Embedding, nil
}
// Usage:
embed, _ := embed(client, "nomic-embed-text", "Go is a statically typed language")
fmt.Printf("Dimensions: %d\n", len(embed))
Model Management
// List available models
models, err := client.List(context.Background())
for _, m := range models.Models {
fmt.Printf("%-40s %.1f GB\n", m.Name, float64(m.Size)/1e9)
}
// Pull a model with progress
pullReq := &api.PullRequest{Model: "llama3.2"}
client.Pull(context.Background(), pullReq, func(resp api.ProgressResponse) error {
if resp.Total > 0 {
pct := float64(resp.Completed) / float64(resp.Total) * 100
fmt.Printf("\rDownloading: %.1f%%", pct)
}
return nil
})
// Show running models
ps, err := client.List(context.Background())
_ = ps
Custom Client (Remote Ollama)
import "net/url"
remoteURL, _ := url.Parse("http://192.168.1.100:11434")
client := api.NewClient(remoteURL, &http.Client{})
Building a CLI Tool
package main
import (
"bufio"
"context"
"fmt"
"os"
"github.com/ollama/ollama/api"
)
func main() {
client, _ := api.ClientFromEnvironment()
scanner := bufio.NewScanner(os.Stdin)
var history []api.Message
fmt.Println("Chat with llama3.2 (type 'quit' to exit)")
for {
fmt.Print("You: ")
if !scanner.Scan() {
break
}
input := scanner.Text()
if input == "quit" {
break
}
history = append(history, api.Message{Role: "user", Content: input})
req := &api.ChatRequest{Model: "llama3.2", Messages: history}
var reply string
fmt.Print("Assistant: ")
client.Chat(context.Background(), req, func(resp api.ChatResponse) error {
fmt.Print(resp.Message.Content)
reply += resp.Message.Content
return nil
})
fmt.Println()
history = append(history, api.Message{Role: "assistant", Content: reply})
}
}
Why Use Go for Ollama Integrations?
Go is a natural fit for Ollama integrations that need to be distributed as standalone binaries — CLI tools, background daemons, system utilities, and microservices. A Go binary that wraps Ollama can be compiled for any target platform (Linux, macOS, Windows, ARM) and distributed as a single executable with no runtime dependency. Compare this to a Python script that requires a Python installation and virtual environment, or a Node.js tool that requires Node.js and node_modules. For tooling that you want to share with colleagues or deploy on servers without managing language runtimes, Go’s single-binary distribution model is a significant practical advantage.
Go’s concurrency primitives — goroutines and channels — also pair naturally with Ollama’s streaming API. A Go service that processes multiple Ollama requests concurrently, manages timeouts with contexts, and handles graceful shutdown fits cleanly into Go’s concurrency model without the callback complexity that concurrent streaming creates in other languages. The standard library’s context package integrates directly with Ollama’s API — every Ollama client method accepts a context, so request cancellation, timeouts, and deadline propagation all work through Go’s standard patterns.
Error Handling Pattern
package main
import (
"context"
"errors"
"fmt"
"net"
"time"
"github.com/ollama/ollama/api"
)
func chatWithTimeout(model, prompt string) (string, error) {
client, err := api.ClientFromEnvironment()
if err != nil {
return "", fmt.Errorf("failed to create client: %w", err)
}
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
defer cancel()
messages := []api.Message{{Role: "user", Content: prompt}}
req := &api.ChatRequest{Model: model, Messages: messages}
var result string
err = client.Chat(ctx, req, func(resp api.ChatResponse) error {
result += resp.Message.Content
return nil
})
if err != nil {
var netErr *net.OpError
if errors.As(err, &netErr) {
return "", fmt.Errorf("ollama not running: %w", err)
}
if errors.Is(ctx.Err(), context.DeadlineExceeded) {
return "", fmt.Errorf("request timed out after 2 minutes")
}
return "", err
}
return result, nil
}
Concurrent Batch Processing
func batchProcess(client *api.Client, model string, prompts []string) []string {
results := make([]string, len(prompts))
sem := make(chan struct{}, 3) // max 3 concurrent requests
var wg sync.WaitGroup
for i, prompt := range prompts {
wg.Add(1)
go func(idx int, p string) {
defer wg.Done()
sem <- struct{}{}
defer func() { <-sem }()
messages := []api.Message{{Role: "user", Content: p}}
req := &api.ChatRequest{Model: model, Messages: messages}
var out string
client.Chat(context.Background(), req, func(r api.ChatResponse) error {
out += r.Message.Content
return nil
})
results[idx] = out
}(i, prompt)
}
wg.Wait()
return results
}
Structured Output
import (
"encoding/json"
"github.com/ollama/ollama/api"
)
type ContactInfo struct {
Name string `json:"name"`
Email string `json:"email"`
Phone string `json:"phone,omitempty"`
}
func extractContact(client *api.Client, text string) (*ContactInfo, error) {
schema := map[string]any{
"type": "object",
"properties": map[string]any{
"name": map[string]string{"type": "string"},
"email": map[string]string{"type": "string"},
"phone": map[string]string{"type": "string"},
},
"required": []string{"name", "email"},
}
messages := []api.Message{{Role: "user", Content: "Extract contact info: " + text}}
req := &api.ChatRequest{
Model: "llama3.2", Messages: messages,
Format: schema,
Options: map[string]any{"temperature": 0},
}
var raw string
client.Chat(context.Background(), req, func(r api.ChatResponse) error {
raw += r.Message.Content
return nil
})
var contact ContactInfo
if err := json.Unmarshal([]byte(raw), &contact); err != nil {
return nil, err
}
return &contact, nil
}
HTTP Server Wrapping Ollama
package main
import (
"encoding/json"
"net/http"
"github.com/ollama/ollama/api"
)
type ChatRequest struct {
Message string `json:"message"`
Model string `json:"model"`
}
func chatHandler(ollamaClient *api.Client) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
var req ChatRequest
json.NewDecoder(r.Body).Decode(&req)
if req.Model == "" { req.Model = "llama3.2" }
w.Header().Set("Content-Type", "text/plain; charset=utf-8")
w.Header().Set("Transfer-Encoding", "chunked")
messages := []api.Message{{Role: "user", Content: req.Message}}
ollamaClient.Chat(r.Context(), &api.ChatRequest{
Model: req.Model, Messages: messages,
}, func(resp api.ChatResponse) error {
if _, err := fmt.Fprint(w, resp.Message.Content); err != nil {
return err
}
w.(http.Flusher).Flush()
return nil
})
}
}
func main() {
client, _ := api.ClientFromEnvironment()
http.HandleFunc("/chat", chatHandler(client))
http.ListenAndServe(":8080", nil)
}
Getting Started
Run go get github.com/ollama/ollama/api in any Go module and import the package. The api.ClientFromEnvironment() constructor reads the OLLAMA_HOST environment variable (defaulting to localhost:11434), so you can point your application at a remote Ollama server by setting that variable without code changes. All API methods accept a context for cancellation and timeout control, fitting naturally into Go's standard patterns for service development. For CLI tools, the streaming callback pattern gives you real-time output without buffering the full response, which is important for long generations that users are watching in a terminal.
Why Go for Ollama Integrations
Go's single-binary distribution model makes it uniquely attractive for Ollama-powered tooling. A CLI tool that wraps Ollama compiles to a self-contained executable for Linux, macOS, or Windows ARM — no runtime dependencies, no package manager, no virtual environment. Compare this to distributing a Python script (requires Python, pip, a venv) or a Node.js tool (requires Node.js, npm, node_modules). For tools you want to share with non-developer colleagues or deploy on servers without managing language runtimes, Go's compilation model is a meaningful practical advantage that compounds over time as you accumulate multiple tools.
Go's concurrency primitives also pair naturally with Ollama's streaming API. The Chat method's callback function runs in a goroutine, and the context package integrates directly — you get request cancellation, timeouts, and deadline propagation through Go's standard patterns without additional abstraction. A Go service that processes multiple Ollama requests concurrently, manages graceful shutdown, and exposes metrics all fits cleanly into Go's standard library without reaching for external frameworks. For teams already writing Go backend services, adding Ollama as a local inference backend requires no new tooling or conceptual model — it is just another HTTP client with a callback for streaming.
Using the OLLAMA_HOST Environment Variable
The api.ClientFromEnvironment() constructor reads the OLLAMA_HOST environment variable, defaulting to http://localhost:11434 if unset. This means your Go binary can be pointed at a remote Ollama server without recompilation:
# Point to a remote Ollama server
export OLLAMA_HOST=http://192.168.1.100:11434
./my-ollama-tool
# Or inline
OLLAMA_HOST=http://team-server:11434 ./my-ollama-tool
For applications with more complex routing (different models on different servers, fallback logic), use the manual client constructor with a parsed URL. The environment variable approach is sufficient for most tools where the server location is a deployment-time concern rather than a runtime decision.
Testing Ollama Integrations in Go
// Use a test HTTP server to mock Ollama in unit tests
import (
"encoding/json"
"net/http"
"net/http/httptest"
"net/url"
"testing"
"github.com/ollama/ollama/api"
)
func TestChatSync(t *testing.T) {
// Mock Ollama server
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if r.URL.Path == "/api/chat" {
w.Header().Set("Content-Type", "application/x-ndjson")
json.NewEncoder(w).Encode(api.ChatResponse{
Message: api.Message{Role: "assistant", Content: "Hello from mock!"},
Done: true,
})
}
}))
defer server.Close()
u, _ := url.Parse(server.URL)
client := api.NewClient(u, server.Client())
result, err := chatSync(client, "llama3.2", "Hi")
if err != nil {
t.Fatal(err)
}
if result != "Hello from mock!" {
t.Errorf("expected 'Hello from mock!', got %q", result)
}
}
Distributing a Go Ollama Tool
# Build for the current platform
go build -o my-ai-tool .
# Cross-compile for Linux AMD64
GOOS=linux GOARCH=amd64 go build -o my-ai-tool-linux .
# Cross-compile for macOS ARM64 (Apple Silicon)
GOOS=darwin GOARCH=arm64 go build -o my-ai-tool-mac-arm .
# Cross-compile for Windows
GOOS=windows GOARCH=amd64 go build -o my-ai-tool.exe .
# Embed version info at build time
go build -ldflags "-X main.Version=1.0.0 -X main.BuildDate=$(date -u +%Y-%m-%d)" -o my-ai-tool .
The resulting binaries require no installation — users copy the file to any location in their PATH and run it immediately. This distribution model is one of Go's strongest practical advantages for developer tooling: you can build a useful Ollama-powered CLI tool and share it as a single file download that works on any platform without any setup instructions beyond "download and run".
Go vs Python vs JavaScript for Ollama Projects
Each language has a clear sweet spot for Ollama integration. Python is the best choice for data science workflows, Jupyter notebooks, ML pipelines, and projects with heavy dependencies on the Python ML ecosystem (NumPy, pandas, scikit-learn). JavaScript is the best choice for web applications, Next.js projects, and browser-based tools where the AI is one feature of a larger web product. Go is the best choice for CLI tools, system utilities, backend microservices, and anything you want to distribute as a binary without runtime dependencies. The Ollama API is identical across all three — if you need to switch languages for a specific project, the concepts and request patterns transfer directly with only syntax changes. Choose the language that fits your team's existing expertise and the target deployment environment, not the one with the most Ollama examples online.
Integrating Ollama with Existing Go Services
Adding Ollama to an existing Go service is straightforward because the client is just an HTTP client — it does not require a global singleton, special initialisation, or framework integration. Create the client once at service startup and pass it to handlers as a dependency. The client is concurrency-safe and can be shared across goroutines. For services that need to handle Ollama being temporarily unavailable (during restarts or model swaps), wrap requests in retry logic using exponential backoff — the standard pattern for unreliable dependencies in Go:
func chatWithRetry(client *api.Client, req *api.ChatRequest, maxAttempts int) (string, error) {
var lastErr error
for attempt := range maxAttempts {
var result string
err := client.Chat(context.Background(), req, func(r api.ChatResponse) error {
result += r.Message.Content
return nil
})
if err == nil {
return result, nil
}
lastErr = err
backoff := time.Duration(1<
Logging and Observability
For production Go services using Ollama, log the key performance metrics from each request — prompt token count, completion token count, and total duration. These metrics are available in the final streaming response chunk (resp.Done == true) and give you visibility into model performance, cost estimation, and anomaly detection (unusually long requests may indicate a model stuck in a loop):
err = client.Chat(ctx, req, func(resp api.ChatResponse) error {
result += resp.Message.Content
if resp.Done {
log.Printf("model=%s prompt_tokens=%d completion_tokens=%d duration_ms=%d",
resp.Model,
resp.PromptEvalCount,
resp.EvalCount,
resp.TotalDuration/1e6, // nanoseconds to milliseconds
)
}
return nil
})
The Go Ecosystem for AI Development in 2026
Go's AI ecosystem is less mature than Python's but growing, particularly for infrastructure and tooling use cases. The Ollama Go library is the highest-quality Go AI integration available for local models. For vector databases, pgvector has a Go client, and Chroma has a community Go client. For embedding pipelines, calling Ollama's embeddings endpoint from Go is as simple as the Python equivalent. For production services that need to process documents, images, or audio at scale with local AI, Go's performance and concurrency characteristics make it a genuinely compelling choice — you get the AI capabilities of the Python ecosystem's best models (served by Ollama) with Go's operational characteristics (low memory overhead, fast startup, easy deployment). This combination is increasingly attractive for teams building AI infrastructure that needs to run reliably at scale without the operational overhead of Python services.
Writing Idiomatic Go with the Ollama Library
The Ollama Go library follows Go conventions closely — methods return errors as the last value, streaming uses callbacks rather than channels, and the client is designed to be created once and reused. A few patterns worth knowing when writing idiomatic Go with the library. First, always check errors from Chat and Generate calls — unlike some streaming APIs that swallow errors, the Ollama library returns an error from the callback-based streaming methods if the connection drops mid-stream. Second, the callback function runs synchronously in the calling goroutine unless you spawn a goroutine inside it, so you can safely write to a strings.Builder inside the callback without mutex protection. Third, the Done field on the response struct marks the final chunk — check it when you need to access the performance metrics (EvalCount, PromptEvalCount, TotalDuration) which are only populated on the final chunk.
For production code, wrap the Ollama client in your own interface so you can swap the implementation in tests without needing a running Ollama instance. Define an LLMClient interface with the methods you use (Chat, Generate, Embeddings) and implement it both with the real Ollama client and with a mock that returns fixtures. This is standard Go dependency injection and makes your AI-powered code significantly more testable than code that calls the Ollama API directly from business logic functions. The Ollama library's callback-based streaming is simple enough that writing a mock implementation takes about 20 lines of code — a worthwhile investment for any production service.
Real-World Go + Ollama Projects
The Go + Ollama combination excels in several project types. Command-line developer tools — code reviewers, documentation generators, commit message writers — benefit from Go's fast startup and single-binary distribution. Background processing daemons that classify, tag, or summarise incoming data (emails, support tickets, news items) benefit from Go's low memory overhead when running as a long-lived service. API gateways that add AI capabilities to existing services without requiring the client to know about Ollama benefit from Go's HTTP handling strengths. And infrastructure tooling — monitoring scripts, log analysers, automated report generators — benefits from Go's easy deployment and cross-compilation. In all of these cases, the Ollama library provides clean access to local AI inference without the runtime overhead or distribution complexity of equivalent Python tooling.
Getting Started
Run go get github.com/ollama/ollama/api in any Go module, create a client with api.ClientFromEnvironment(), and call client.Chat() with your messages and a callback to handle the streaming response. The first working integration takes under 30 minutes for a Go developer already familiar with HTTP client patterns. From there, add structured output using the Format field for extraction tasks, add context-based timeouts for production reliability, and consider the HTTP server wrapper pattern if you need to expose Ollama capabilities to other services or team members. The Go library's idiomatic API and the single-binary distribution model make it the right choice for building local AI tooling that needs to work reliably across different environments and be easy to deploy and share — a development experience that makes local AI tooling in Go feel as natural and productive as any other Go project you would build.
The local AI ecosystem in 2026 has reached a point where the tools and patterns described in this article are stable, well-supported, and ready for production use. Whether you reach for Go, Python, JavaScript, or another language entirely, the underlying Ollama API remains the same — consistent, version-stable, and designed to support exactly the kind of practical applications this series has covered — and that foundation will serve you well as Ollama and the broader local AI ecosystem continue to evolve and new models raise the quality ceiling of what local inference can achieve.