How to Use Ollama with Django

Django is the most popular Python web framework, and adding local LLM capabilities to a Django application is straightforward with Ollama. Whether you want an AI-powered chat endpoint, automatic content summarisation, intelligent search, or document analysis, Ollama provides a local HTTP API you can call from anywhere in a Django project — views, models, management commands, signals, or Celery tasks. This guide covers the core integration patterns: a synchronous view, an async view with streaming, a reusable service class, Django REST Framework integration, and background processing with Celery.

Running Ollama locally means your Django application’s AI features work without any cloud API costs or data leaving your server. For production deployments, Ollama runs as a systemd service on the same machine or on a dedicated GPU server on the local network, and your Django app calls it over HTTP just as it would any other internal service.

Setup

Install the required packages:

pip install django httpx djangorestframework

Add your Ollama configuration to settings.py:

OLLAMA_BASE_URL = "http://localhost:11434"
OLLAMA_DEFAULT_MODEL = "llama3.2"
OLLAMA_TIMEOUT = 120

Keeping configuration in settings means you can override it per environment — a development machine pointing at localhost, a production server pointing at a dedicated GPU host — without touching application code.

A Reusable OllamaService

Create a service class in myapp/services/ollama.py that encapsulates all Ollama interactions:

import httpx
from django.conf import settings

class OllamaService:
    def __init__(self):
        self.base_url = settings.OLLAMA_BASE_URL
        self.model = settings.OLLAMA_DEFAULT_MODEL
        self.timeout = settings.OLLAMA_TIMEOUT

    def chat(self, messages: list, model: str = None) -> str:
        """Synchronous chat — use in regular views and management commands."""
        with httpx.Client(timeout=self.timeout) as client:
            resp = client.post(
                f"{self.base_url}/api/chat",
                json={"model": model or self.model, "messages": messages, "stream": False}
            )
            resp.raise_for_status()
            return resp.json()["message"]["content"]

    async def achat(self, messages: list, model: str = None) -> str:
        """Async chat — use in async views."""
        async with httpx.AsyncClient(timeout=self.timeout) as client:
            resp = await client.post(
                f"{self.base_url}/api/chat",
                json={"model": model or self.model, "messages": messages, "stream": False}
            )
            resp.raise_for_status()
            return resp.json()["message"]["content"]

    def embed(self, text: str, model: str = "nomic-embed-text") -> list:
        """Generate embeddings for semantic search."""
        with httpx.Client(timeout=60) as client:
            resp = client.post(
                f"{self.base_url}/api/embed",
                json={"model": model, "input": text}
            )
            resp.raise_for_status()
            return resp.json()["embeddings"][0]

ollama = OllamaService()

The module-level ollama instance acts as a singleton — import it wherever you need it. Django does not have a built-in dependency injection container, so a module-level singleton is the standard pattern for service classes. The raise_for_status() calls convert HTTP error responses into exceptions that Django’s error handling can catch and log appropriately.

A Synchronous Chat View

Here is a standard Django view that accepts a POST request and returns an AI response:

import json
from django.http import JsonResponse
from django.views import View
from django.views.decorators.csrf import csrf_exempt
from django.utils.decorators import method_decorator
from .services.ollama import ollama

@method_decorator(csrf_exempt, name="dispatch")
class ChatView(View):
    def post(self, request):
        try:
            body = json.loads(request.body)
            prompt = body.get("prompt", "").strip()
            if not prompt:
                return JsonResponse({"error": "prompt required"}, status=400)

            messages = [{"role": "user", "content": prompt}]
            reply = ollama.chat(messages)
            return JsonResponse({"reply": reply})

        except httpx.ConnectError:
            return JsonResponse({"error": "Ollama is not running"}, status=503)
        except httpx.HTTPStatusError as e:
            return JsonResponse({"error": str(e)}, status=502)
        except json.JSONDecodeError:
            return JsonResponse({"error": "Invalid JSON"}, status=400)

Wire it up in urls.py:

from django.urls import path
from .views import ChatView

urlpatterns = [
    path("api/chat/", ChatView.as_view(), name="chat"),
]

The explicit error handling on ConnectError and HTTPStatusError gives API consumers a meaningful status code and message rather than a generic 500. Django’s default exception handler would catch unhandled exceptions and log them, but returning a 503 for “Ollama not running” is much more actionable for a client application than an opaque server error.

Streaming with an Async View

Django 4.2+ supports async views natively. Use StreamingHttpResponse with an async generator to stream tokens from Ollama to the browser:

import json
from django.http import StreamingHttpResponse
from django.views.decorators.csrf import csrf_exempt
from django.views.decorators.http import require_POST
import httpx
from django.conf import settings

@csrf_exempt
@require_POST
async def chat_stream(request):
    body = json.loads(request.body)
    prompt = body.get("prompt", "")

    async def token_generator():
        async with httpx.AsyncClient(timeout=120) as client:
            async with client.stream(
                "POST",
                f"{settings.OLLAMA_BASE_URL}/api/chat",
                json={
                    "model": settings.OLLAMA_DEFAULT_MODEL,
                    "messages": [{"role": "user", "content": prompt}],
                    "stream": True
                }
            ) as resp:
                async for line in resp.aiter_lines():
                    if not line:
                        continue
                    chunk = json.loads(line)
                    token = chunk.get("message", {}).get("content", "")
                    if token:
                        yield f"data: {json.dumps({'token': token})}

"
                    if chunk.get("done"):
                        yield "data: [DONE]

"
                        break

    return StreamingHttpResponse(
        token_generator(),
        content_type="text/event-stream",
        headers={"X-Accel-Buffering": "no", "Cache-Control": "no-cache"}
    )

For async views to work with Django’s dev server, run it with ASGI: python manage.py runserver works for development, but for production you need an ASGI server like Uvicorn or Daphne. Configure settings.ASGI_APPLICATION and deploy with uvicorn myproject.asgi:application. If your project uses WSGI in production, keep streaming in a separate microservice or use Server-Sent Events via a Celery task instead.

Django REST Framework Integration

For API projects using Django REST Framework, create a proper serializer and API view:

from rest_framework import serializers, status
from rest_framework.views import APIView
from rest_framework.response import Response
from rest_framework.permissions import IsAuthenticated
from .services.ollama import ollama

class ChatRequestSerializer(serializers.Serializer):
    prompt = serializers.CharField(max_length=4000)
    model = serializers.CharField(required=False)
    system = serializers.CharField(required=False)

class ChatResponseSerializer(serializers.Serializer):
    reply = serializers.CharField()
    model = serializers.CharField()

class ChatAPIView(APIView):
    permission_classes = [IsAuthenticated]

    def post(self, request):
        serializer = ChatRequestSerializer(data=request.data)
        serializer.is_valid(raise_exception=True)
        data = serializer.validated_data

        messages = []
        if data.get("system"):
            messages.append({"role": "system", "content": data["system"]})
        messages.append({"role": "user", "content": data["prompt"]})

        try:
            reply = ollama.chat(messages, model=data.get("model"))
            return Response({"reply": reply, "model": data.get("model", ollama.model)})
        except Exception as e:
            return Response({"error": str(e)}, status=status.HTTP_502_BAD_GATEWAY)

DRF’s serializer handles input validation automatically — raise_exception=True returns a 400 with field-level error details if the prompt is missing or too long. The IsAuthenticated permission class ensures only logged-in users can call the endpoint, which is important when exposing an AI endpoint to a team or the public to prevent abuse.

Background Processing with Celery

For long-running AI tasks — processing a large document, generating a full report, batch-classifying many items — run Ollama calls in Celery tasks so they don’t block web request threads:

from celery import shared_task
from .services.ollama import ollama
from .models import Document

@shared_task(bind=True, max_retries=3)
def summarise_document(self, document_id: int) -> str:
    try:
        doc = Document.objects.get(pk=document_id)
        messages = [
            {"role": "system", "content": "Summarise the following document concisely."},
            {"role": "user", "content": doc.content[:8000]}
        ]
        summary = ollama.chat(messages)
        doc.summary = summary
        doc.save(update_fields=["summary"])
        return f"Summarised document {document_id}"
    except Document.DoesNotExist:
        return f"Document {document_id} not found"
    except Exception as exc:
        raise self.retry(exc=exc, countdown=60)

# In a view, trigger asynchronously:
def upload_document(request):
    doc = Document.objects.create(content=request.POST["content"])
    summarise_document.delay(doc.pk)
    return JsonResponse({"id": doc.pk, "status": "processing"})

The max_retries=3 and retry(countdown=60) pattern automatically retries failed tasks up to three times with a 60-second delay. If Ollama is temporarily unavailable — restarting, loading a model, or under heavy load — the task retries rather than failing permanently. The view returns immediately with the document ID and a “processing” status, and the client can poll a status endpoint or receive a webhook notification when the summary is ready.

Using Embeddings for Semantic Search

Store document embeddings in your database and use cosine similarity for semantic search. With PostgreSQL and the pgvector extension, you can store and query embeddings natively:

# Install pgvector support
pip install pgvector

# models.py
from django.db import models
from pgvector.django import VectorField

class Article(models.Model):
    title = models.CharField(max_length=255)
    content = models.TextField()
    embedding = VectorField(dimensions=768, null=True)

    def save(self, *args, **kwargs):
        if not self.embedding:
            self.embedding = ollama.embed(self.content[:2000])
        super().save(*args, **kwargs)

# Semantic search view
from pgvector.django import CosineDistance

def semantic_search(request):
    query = request.GET.get("q", "")
    if not query:
        return JsonResponse({"results": []})
    query_embedding = ollama.embed(query)
    results = Article.objects.order_by(
        CosineDistance("embedding", query_embedding)
    )[:10]
    return JsonResponse({"results": [
        {"id": a.id, "title": a.title} for a in results
    ]})

The nomic-embed-text model produces 768-dimensional vectors, matching the dimensions=768 in the model field. Pull it with ollama pull nomic-embed-text before running the server. The CosineDistance ordering from pgvector performs the vector similarity search entirely in PostgreSQL, making it fast even for large article collections without any external vector database.

A Management Command for Batch Processing

Django management commands are ideal for one-off or scheduled batch AI tasks. Create myapp/management/commands/generate_summaries.py:

from django.core.management.base import BaseCommand
from myapp.models import Article
from myapp.services.ollama import ollama

class Command(BaseCommand):
    help = "Generate AI summaries for articles that don't have one"

    def add_arguments(self, parser):
        parser.add_argument("--limit", type=int, default=10)
        parser.add_argument("--model", type=str, default=None)

    def handle(self, *args, **options):
        articles = Article.objects.filter(summary="").[:options["limit"]]
        self.stdout.write(f"Processing {articles.count()} articles...")

        for i, article in enumerate(articles, 1):
            try:
                summary = ollama.chat(
                    [{"role": "user", "content": f"Summarise in 2 sentences:

{article.content[:4000]}"}],
                    model=options["model"]
                )
                article.summary = summary
                article.save(update_fields=["summary"])
                self.stdout.write(f"[{i}/{articles.count()}] {article.title}")
            except Exception as e:
                self.stderr.write(f"Error on {article.title}: {e}")

        self.stdout.write(self.style.SUCCESS("Done."))

Run with python manage.py generate_summaries --limit 50. Management commands work well in cron jobs: 0 2 * * * cd /app && python manage.py generate_summaries --limit 100 processes 100 articles nightly without requiring Celery or any other infrastructure. The --limit argument prevents runaway processing if a large backlog builds up.

Testing Django Ollama Integration

Test your views without a running Ollama instance by mocking the service:

from unittest.mock import patch, MagicMock
from django.test import TestCase, Client
import json

class ChatViewTests(TestCase):
    def setUp(self):
        self.client = Client()

    @patch("myapp.views.ollama.chat")
    def test_chat_returns_reply(self, mock_chat):
        mock_chat.return_value = "Hello from mock Ollama!"
        resp = self.client.post(
            "/api/chat/",
            data=json.dumps({"prompt": "Hi"}),
            content_type="application/json"
        )
        self.assertEqual(resp.status_code, 200)
        data = json.loads(resp.content)
        self.assertEqual(data["reply"], "Hello from mock Ollama!")

    @patch("myapp.views.ollama.chat")
    def test_ollama_down_returns_503(self, mock_chat):
        import httpx
        mock_chat.side_effect = httpx.ConnectError("Connection refused")
        resp = self.client.post(
            "/api/chat/",
            data=json.dumps({"prompt": "Hi"}),
            content_type="application/json"
        )
        self.assertEqual(resp.status_code, 503)

Patching myapp.views.ollama.chat rather than the underlying httpx calls tests the view’s error handling logic at the right level of abstraction. The tests run in milliseconds, pass on any machine regardless of whether Ollama is installed, and verify both the happy path and the failure modes that matter in production.

Performance Considerations

The key performance consideration for Django and Ollama is that LLM inference is slow compared to typical database queries. A chat response might take 5 to 30 seconds depending on model size and prompt length. For synchronous WSGI Django, each gunicorn worker thread is blocked for the entire duration of the Ollama call, reducing the concurrency of your web server proportionally. If you have 4 gunicorn workers and all four are waiting on Ollama, your site becomes unresponsive to other requests.

The cleanest solutions are: use async Django with Uvicorn (workers are not blocked during awaits), offload to Celery tasks and return job IDs immediately, or dedicate separate gunicorn workers to AI endpoints with a higher timeout. Whichever approach fits your architecture, the OllamaService class encapsulates the HTTP logic so switching between sync, async, and background processing is a matter of calling ollama.chat, await ollama.achat, or summarise_document.delay — the service interface stays the same.

Adding a Django Signal for Auto-Processing

Django signals let you trigger Ollama processing automatically when a model instance is saved, without modifying the save method or the view that creates it. This is useful for generating summaries, embeddings, or tags whenever new content is added to the database:

from django.db.models.signals import post_save
from django.dispatch import receiver
from .models import Article
from .services.ollama import ollama
from .tasks import generate_embedding

@receiver(post_save, sender=Article)
def article_post_save(sender, instance, created, **kwargs):
    if created and instance.content:
        # Queue embedding generation asynchronously
        generate_embedding.delay(instance.pk)

Connect the signal in your app’s AppConfig.ready() method by importing the signals module there. The signal fires after every save, so the Celery task is queued immediately when a new article is created — no manual trigger needed in any view. This pattern keeps the AI processing logic decoupled from the view layer entirely, making it easy to add or remove without touching the views.

Caching Responses for Repeated Queries

LLM responses for identical inputs are deterministic at low temperature, making them good candidates for caching. Use Django’s cache framework to store responses and avoid redundant Ollama calls:

from django.core.cache import cache
import hashlib

def cached_chat(messages: list, model: str = None, ttl: int = 3600) -> str:
    key = "ollama:" + hashlib.sha256(
        str(messages).encode() + (model or "").encode()
    ).hexdigest()[:32]
    cached = cache.get(key)
    if cached:
        return cached
    result = ollama.chat(messages, model=model)
    cache.set(key, result, ttl)
    return result

A one-hour TTL is reasonable for most use cases — long enough to benefit from caching repeated queries within a session, short enough that stale responses don’t persist indefinitely. For frequently asked questions or document summaries that change rarely, increase the TTL to 24 hours or more. Use Django’s Redis cache backend in production for cache persistence across server restarts and shared cache across multiple Django workers.

Deployment Checklist

Before deploying a Django application with Ollama integration to production, work through a few configuration items. Set OLLAMA_BASE_URL to point at your GPU server’s internal IP rather than localhost if Ollama runs on a separate machine. Configure the OLLAMA_TIMEOUT high enough for your largest expected prompt — 120 seconds is a safe default, but batch processing endpoints may need longer. Add the Ollama URL to your monitoring stack so you get alerted if the service goes down. Set OLLAMA_KEEP_ALIVE on the Ollama side to a long duration so models stay loaded between requests rather than reloading on every call. And document clearly in your team’s runbook which models need to be pulled and how to restart the service after a server reboot.

Multi-Turn Conversation State in Django

Django sessions give you a natural place to store per-user conversation history without a database model. Store the message list in the session and retrieve it on each request to maintain multi-turn context across HTTP calls:

@method_decorator(csrf_exempt, name='dispatch')
class ConversationView(View):
    def post(self, request):
        body = json.loads(request.body)
        user_msg = body.get('message', '').strip()
        if not user_msg:
            return JsonResponse({'error': 'message required'}, status=400)

        history = request.session.get('chat_history', [])
        history.append({'role': 'user', 'content': user_msg})

        try:
            reply = ollama.chat(history)
            history.append({'role': 'assistant', 'content': reply})
            # Keep last 20 messages to avoid session bloat
            request.session['chat_history'] = history[-20:]
            return JsonResponse({'reply': reply})
        except Exception as e:
            history.pop()  # Remove failed user message
            request.session['chat_history'] = history
            return JsonResponse({'error': str(e)}, status=502)

    def delete(self, request):
        request.session.pop('chat_history', None)
        return JsonResponse({'status': 'cleared'})

Django’s session middleware handles the session storage backend — file-based by default, Redis or database in production. Capping the history at 20 messages prevents the session data from growing unboundedly for long conversations, and rolling back the history on error keeps the session in a consistent state if the Ollama call fails. The DELETE handler clears the conversation history, giving users a way to start fresh without logging out.

When to Choose Django Over FastAPI for Ollama Integration

FastAPI is often recommended for AI API work because of its async-first design and automatic OpenAPI docs. Django is the right choice when your AI feature is one part of a larger application that already uses Django — an existing CMS, e-commerce platform, or internal tool. Adding Ollama to a Django project means zero additional framework overhead: you get the ORM, admin panel, auth system, form handling, and template engine for free alongside your AI endpoints. FastAPI requires you to rebuild all of that if you need it. For greenfield AI-only API services, FastAPI is a reasonable choice. For adding AI capabilities to an existing Django application, staying in Django avoids the maintenance overhead of running two separate Python web services and keeps your codebase cohesive.

Structuring Your Django Ollama Project

As your Django Ollama integration grows beyond a single view, a clear directory structure keeps the codebase maintainable. A dedicated ai/ app within your Django project works well — place the OllamaService in ai/services.py, Celery tasks in ai/tasks.py, DRF views in ai/views.py, serializers in ai/serializers.py, and URL patterns in ai/urls.py. This keeps all AI-related code in one place and makes it easy to test the AI layer independently of the rest of the application. Register the app in INSTALLED_APPS as 'ai' and include its URLs with a prefix like /api/ai/ in your root URL configuration.

Write integration tests that spin up a test Ollama instance if one is available, and fall back to mocked responses if not. Use an environment variable like OLLAMA_INTEGRATION_TESTS=1 to gate the integration tests so they only run in CI environments where Ollama is available, keeping the main test suite fast for local development. This two-tier testing approach — fast unit tests with mocks always, slower integration tests with a real Ollama when available — is the most practical setup for teams working on Django applications with local AI features.

The OllamaService class, the Celery task pattern, the session-based conversation history, and the pgvector semantic search setup described in this guide give you a complete toolkit for adding local AI to any Django application. Start with the synchronous view for the simplest integration, add Celery when response times start affecting user experience, and introduce semantic search when your content collection grows large enough to benefit from it.

Leave a Comment