How to Use Ollama with Java and Spring Boot

Ollama’s OpenAI-compatible REST API works from any language with an HTTP client. Java applications can integrate Ollama through the Spring AI framework, which provides a clean abstraction over the Ollama API with Spring Boot autoconfiguration, or via direct HTTP calls using Spring’s RestClient or WebClient for more control.

Option 1: Spring AI (Recommended)

<!-- pom.xml -->
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-ollama-spring-boot-starter</artifactId>
    <version>1.0.0</version>
</dependency>

# application.yml
spring:
  ai:
    ollama:
      base-url: http://localhost:11434
      chat:
        model: llama3.2
        options:
          temperature: 0.7
      embedding:
        model: nomic-embed-text

@RestController
public class ChatController {

    private final ChatClient chatClient;

    public ChatController(ChatClient.Builder builder) {
        this.chatClient = builder.build();
    }

    @PostMapping("/chat")
    public String chat(@RequestBody String message) {
        return chatClient.prompt()
            .user(message)
            .call()
            .content();
    }

    @PostMapping("/chat/stream")
    public Flux<String> chatStream(@RequestBody String message) {
        return chatClient.prompt()
            .user(message)
            .stream()
            .content();
    }
}

Prompt Templates and System Prompts

@Service
public class AssistantService {

    private final ChatClient chatClient;

    public AssistantService(ChatClient.Builder builder) {
        this.chatClient = builder
            .defaultSystem("You are a helpful Java developer assistant. Be concise.")
            .build();
    }

    public String explainCode(String code) {
        return chatClient.prompt()
            .user(u -> u.text("Explain this Java code:\n\n{code}").param("code", code))
            .call()
            .content();
    }
}

Embeddings

@Service
public class EmbeddingService {

    private final EmbeddingModel embeddingModel;

    public EmbeddingService(EmbeddingModel embeddingModel) {
        this.embeddingModel = embeddingModel;
    }

    public float[] embed(String text) {
        return embeddingModel.embed(text);
    }

    public double cosineSimilarity(float[] a, float[] b) {
        double dot = 0, normA = 0, normB = 0;
        for (int i = 0; i < a.length; i++) {
            dot += a[i] * b[i];
            normA += a[i] * a[i];
            normB += b[i] * b[i];
        }
        return dot / (Math.sqrt(normA) * Math.sqrt(normB));
    }
}

Option 2: Direct HTTP with RestClient

@Service
public class OllamaDirectService {

    private final RestClient restClient;

    public OllamaDirectService() {
        this.restClient = RestClient.builder()
            .baseUrl("http://localhost:11434")
            .build();
    }

    public record ChatMessage(String role, String content) {}
    public record ChatRequest(String model, List<ChatMessage> messages, boolean stream) {}
    public record ChatResponse(ChatMessage message, boolean done) {}

    public String chat(String userMessage) {
        var request = new ChatRequest(
            "llama3.2",
            List.of(new ChatMessage("user", userMessage)),
            false
        );
        var response = restClient.post()
            .uri("/api/chat")
            .body(request)
            .retrieve()
            .body(ChatResponse.class);
        return response.message().content();
    }
}

Why Spring AI for Ollama Integration

Spring AI is the official Spring framework for AI integration, providing abstractions over language models, embedding models, and vector stores that work with multiple backends including Ollama, OpenAI, Anthropic, and others. The key advantage over direct HTTP calls is the abstraction layer — your application code depends on the ChatClient and EmbeddingModel interfaces, not on Ollama-specific API details. This means you can switch from Ollama to a cloud model (or back) by changing configuration rather than rewriting application code. For enterprise Java teams, this portability and the Spring Boot autoconfiguration that handles client setup automatically are significant practical benefits.

Spring AI’s reactive support (Flux-based streaming) integrates naturally with Spring WebFlux for building non-blocking AI-powered endpoints — particularly important for long-running LLM responses where blocking threads would limit throughput. The Spring AI ecosystem also includes vector store integrations (pgvector, Redis, MongoDB Atlas) for RAG applications, which allow you to build document Q&A systems without writing custom embedding pipeline code.

RAG with Spring AI and PGVector

<!-- Additional dependency for PGVector -->
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-pgvector-store-spring-boot-starter</artifactId>
    <version>1.0.0</version>
</dependency>

@Service
public class DocumentQaService {

    private final ChatClient chatClient;
    private final VectorStore vectorStore;

    public DocumentQaService(ChatClient.Builder builder, VectorStore vectorStore) {
        this.chatClient = builder.build();
        this.vectorStore = vectorStore;
    }

    public void indexDocument(String content, String documentId) {
        var doc = new Document(content, Map.of("documentId", documentId));
        vectorStore.add(List.of(doc));
    }

    public String answerQuestion(String question) {
        // Retrieve relevant context
        var docs = vectorStore.similaritySearch(
            SearchRequest.query(question).withTopK(3)
        );
        String context = docs.stream()
            .map(Document::getContent)
            .collect(Collectors.joining("\n\n"));

        return chatClient.prompt()
            .system("Answer based only on the provided context.")
            .user(u -> u.text("Context:\n{context}\n\nQuestion: {question}")
                .param("context", context)
                .param("question", question))
            .call()
            .content();
    }
}

Structured Output

public record ProductInfo(
    String name,
    double price,
    boolean inStock,
    List<String> tags
) {}

@Service
public class ExtractionService {
    private final ChatClient chatClient;

    public ExtractionService(ChatClient.Builder builder) {
        this.chatClient = builder.build();
    }

    public ProductInfo extractProduct(String text) {
        return chatClient.prompt()
            .user("Extract product info from: " + text)
            .call()
            .entity(ProductInfo.class);  // Spring AI handles JSON schema + parsing
    }
}

Testing Spring AI Applications

@SpringBootTest
class ChatServiceTest {

    @Autowired
    private ChatClient.Builder chatClientBuilder;

    @Test
    void testChatResponse() {
        // Use a real Ollama instance for integration tests
        // or mock ChatClient for unit tests
        var chatClient = chatClientBuilder.build();
        String response = chatClient.prompt()
            .user("Reply with exactly: OK")
            .call().content();
        assertThat(response).containsIgnoringCase("ok");
    }
}

// For unit tests without Ollama:
@TestConfiguration
class MockAiConfig {
    @Bean
    ChatClient chatClient() {
        return ChatClient.builder(new FakeChatModel()).build();
    }
}

Getting Started

Add the Spring AI Ollama starter to your pom.xml, configure the base URL and model in application.yml, and inject ChatClient into your service classes. The Spring Boot autoconfiguration handles client creation and connection pooling automatically. Start with the basic chat and embedding examples, then add RAG with a vector store if your application needs document Q&A. The Spring AI documentation covers all supported vector stores and provides migration guides for switching between Ollama and cloud model backends — useful for hybrid deployments where you use local models in development and cloud models in production, or vice versa.

Conversation History in Spring AI

@Service
public class ConversationService {

    private final ChatClient chatClient;
    private final Map<String, List<Message>> sessions = new ConcurrentHashMap<>();

    public ConversationService(ChatClient.Builder builder) {
        this.chatClient = builder.build();
    }

    public String chat(String sessionId, String userMessage) {
        var history = sessions.computeIfAbsent(sessionId, k -> new ArrayList<>());
        history.add(new UserMessage(userMessage));

        String response = chatClient.prompt()
            .messages(history)
            .call()
            .content();

        history.add(new AssistantMessage(response));

        // Trim to last 20 messages
        if (history.size() > 20) {
            sessions.put(sessionId, history.subList(history.size() - 20, history.size()));
        }
        return response;
    }

    public void clearSession(String sessionId) {
        sessions.remove(sessionId);
    }
}

Async and WebFlux Integration

@RestController
@RequestMapping("/ai")
public class ReactiveAiController {

    private final ChatClient chatClient;

    public ReactiveAiController(ChatClient.Builder builder) {
        this.chatClient = builder.build();
    }

    // Streaming endpoint — returns SSE
    @GetMapping(value = "/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<String> stream(@RequestParam String prompt) {
        return chatClient.prompt()
            .user(prompt)
            .stream()
            .content();
    }

    // Non-blocking async endpoint
    @PostMapping("/generate")
    public Mono<ResponseEntity<String>> generate(@RequestBody String prompt) {
        return Mono.fromCallable(() ->
            chatClient.prompt().user(prompt).call().content()
        ).subscribeOn(Schedulers.boundedElastic())
         .map(ResponseEntity::ok);
    }
}

Configuration for Different Environments

# application-dev.yml (local Ollama)
spring:
  ai:
    ollama:
      base-url: http://localhost:11434
      chat:
        model: qwen2.5-coder:7b

# application-prod.yml (team Ollama server)
spring:
  ai:
    ollama:
      base-url: http://ai-server.internal:11434
      chat:
        model: llama3.2
        options:
          temperature: 0.3
          num-ctx: 16384

Spring Boot’s profile system makes it straightforward to use a local Ollama in development and a shared team Ollama server in production without any code changes — just activate the appropriate profile. This pattern also allows gradual migration: start with local Ollama for development, add a team server for staging, and decide based on performance and cost whether production should use Ollama or a cloud model backend. Spring AI’s model-agnostic interfaces make the cloud option available without architectural changes if you ever need it.

Java vs Python for Ollama Projects

For teams already working in Java, Spring AI with Ollama is the most natural integration path — it follows Spring conventions, integrates with existing Spring infrastructure (dependency injection, configuration management, security), and does not require a separate Python service for AI capabilities. For greenfield AI projects without a Java constraint, Python’s broader ML ecosystem (LangChain, Pydantic, faster-whisper, the official Ollama Python library) offers more AI-specific tooling with less ceremony. The choice is primarily about your team’s existing stack and skills — if you know Spring Boot, Spring AI with Ollama is an excellent option. If you are building AI-first applications without an existing Spring codebase, Python’s AI ecosystem provides more ready-made components for common tasks.

Ollama with Kotlin

If your team uses Kotlin rather than Java, Spring AI works identically — Kotlin’s coroutines integrate naturally with Spring’s reactive Flux for streaming responses, and Kotlin data classes are cleaner than Java records for the request/response types. All Spring AI annotations and beans work in Kotlin with no configuration changes. Kotlin’s null safety also pairs well with the Optional-based API responses in some Spring AI components, providing cleaner handling of absent values in extracted entities and structured output parsing.

Performance Considerations for Java

Java’s performance characteristics are well-suited to Ollama integration. The JVM’s JIT compilation and efficient thread pool management handle concurrent requests well. Spring WebFlux’s non-blocking I/O is particularly important for streaming responses — a blocking REST controller would hold a thread for the entire duration of an LLM response (potentially 30–120 seconds), which limits throughput to the number of available threads. WebFlux with Flux-based streaming releases threads between chunks, allowing many concurrent streaming responses with a small thread pool. For production deployments serving more than a handful of concurrent users, use the Flux-based streaming endpoints rather than blocking calls even when the response content is the same.

Spring AI in the Broader Ecosystem

Spring AI’s design as a model-agnostic abstraction layer means your investment in building a Spring AI application is not tied to Ollama’s availability or quality. If Ollama adds a new feature you want to use, update the starter and use it. If a new local inference backend (llama.cpp server, vLLM, LM Studio’s API) becomes preferable for your use case, switching is a configuration change rather than a rewrite. If you need cloud model quality for specific tasks, add an OpenAI or Anthropic Spring AI backend alongside Ollama and route requests to the appropriate backend based on task type. This flexibility is the central value proposition of the abstraction layer — it keeps your application code stable while the rapidly evolving AI backend ecosystem changes around it.

Getting the Most from Spring AI with Ollama

The recommended progression for a new Spring AI + Ollama project: start with the basic chat endpoint and verify it works with a small model; add streaming with the Flux-based endpoint; add embeddings with nomic-embed-text for any feature that needs semantic search; add a vector store integration when you need RAG; and add structured output with entity extraction when you need to parse model responses into Java objects. Each step is incremental and independently testable. The Spring AI documentation is comprehensive and the community is active — most integration questions have documented answers, and the consistency of the Spring Boot autoconfiguration model means troubleshooting follows familiar patterns for any Spring developer.

The Spring AI Roadmap

Spring AI is actively developed by the Spring team with regular releases. Features added since the 1.0 release include improved streaming abstractions, additional vector store integrations, multimodal support for models like Gemma 3 and LLaVA, and better structured output support with JSON schema generation from Java types. As Ollama adds new model features (tool use, vision, structured output), Spring AI typically adds support within one or two releases. Following the Spring AI changelog alongside Ollama release notes ensures you are aware of new capabilities as they become available. The combination of Spring’s long-term support commitment and Ollama’s active development makes the Spring AI + Ollama stack a reliable long-term foundation for enterprise Java AI applications — stable enough for production, with a clear path to adopting new capabilities as the ecosystem matures.

For Java developers who have not yet integrated AI into their Spring applications, the low barrier to getting started with Spring AI and Ollama — a dependency, a configuration property, and an injected interface — means there is no longer a meaningful technical reason to delay. The question is which use cases in your existing applications would benefit from natural language understanding, content generation, or document processing, not whether Java can integrate with local AI models. The answer to the second question is clearly yes, and Spring AI with Ollama is how you get there.

Error Handling and Resilience

@Service
public class ResilientChatService {

    private final ChatClient chatClient;

    public ResilientChatService(ChatClient.Builder builder) {
        this.chatClient = builder.build();
    }

    public String chat(String message) {
        try {
            return chatClient.prompt()
                .user(message)
                .call()
                .content();
        } catch (Exception e) {
            if (e.getMessage() != null && e.getMessage().contains("Connection refused")) {
                throw new ServiceUnavailableException("AI service not available — try again shortly");
            }
            throw new RuntimeException("AI request failed: " + e.getMessage(), e);
        }
    }
}

// With Spring Retry for automatic retries
@Retryable(retryFor = Exception.class, maxAttempts = 3, backoff = @Backoff(delay = 2000))
public String chatWithRetry(String message) {
    return chatClient.prompt().user(message).call().content();
}

Migrating from OpenAI to Ollama in Spring

If you have an existing Spring application using Spring AI with OpenAI, migrating to Ollama for local development is straightforward. Replace the OpenAI starter with the Ollama starter in pom.xml, update application.yml to point at localhost with your chosen model name, and run. The ChatClient and EmbeddingModel interfaces are identical across backends — your service classes and controllers require no changes. The main adjustment is model naming (Ollama uses names like llama3.2 instead of OpenAI’s gpt-4o) and the set of options available (Ollama-specific parameters like num_ctx versus OpenAI’s max_tokens). For developers who want to use Ollama locally and OpenAI in production, Spring Boot profiles make this environment-based switching straightforward without any code changes — just different configuration in different profile YAML files.

Next Steps

Add the Spring AI Ollama starter, configure application.yml with your Ollama URL and model, inject ChatClient into a service class, and write a single endpoint that calls it. Deploy it alongside your existing Spring Boot application — no separate process, no additional runtime, just a new dependency and configuration block. From that starting point, add embeddings, RAG, streaming, and structured output incrementally as your use cases require them. The Spring AI documentation provides runnable examples for each feature, and the consistent patterns across all Spring AI components mean each new capability you add feels like a natural extension of what you already built rather than a new framework to learn.

Why This Matters for Enterprise Java Teams

For enterprise Java teams evaluating local AI, Spring AI with Ollama answers the most common objections: data leaves the organisation (it does not — Ollama is local), the Java ecosystem has no AI tooling (Spring AI provides it), the setup is too complex for existing infrastructure (the starter is a single dependency with familiar Spring Boot conventions), and local models are not good enough for business use (7–13B models handle a wide range of business NLP tasks adequately, and larger models are available for higher-quality requirements). The combination removes the barriers that have kept AI capabilities out of many Java enterprise applications while providing a clear migration path to cloud models if quality requirements exceed what local hardware can deliver. For teams ready to add AI to their Spring applications, this is the stack to start with — and the consistency of the Spring conventions means the learning curve is shorter than any other Java AI framework currently available.

The combination of Spring AI’s familiar abstractions, Ollama’s local inference, and Spring Boot’s autoconfiguration model creates an on-ramp to enterprise Java AI development that is faster than any alternative currently available — and the patterns you learn building your first Spring AI + Ollama feature transfer directly to more complex applications as your team’s experience with local AI grows.

Spring AI’s investment in model-agnostic interfaces and comprehensive Spring Boot autoconfiguration has lowered the barrier to AI integration in Java to the point where adding a chat or summarisation feature to an existing Spring application is comparable in effort to adding a database query — a few lines of dependency and configuration, then straightforward service method calls using familiar patterns.