Celery is Python’s standard distributed task queue. Combining it with Ollama lets you offload AI inference to background workers — keeping web endpoints fast, processing documents in parallel, and handling request bursts gracefully. This guide covers the Celery + Ollama integration pattern for async AI workloads.
Why Async AI Tasks
LLM inference is slow by web standards — a 500-word response at 40 tokens/sec takes 15–30 seconds. Running inference synchronously in a web request blocks the server thread, limits concurrency, and produces poor user experience. Moving inference to a background task queue solves all three: the web endpoint responds immediately with a task ID, workers process inference in parallel, and users poll for results or receive a webhook when done. Celery + Redis is the standard Python stack for this pattern.
Setup
pip install celery redis ollama
redis-server &
Basic Celery + Ollama Task
# tasks.py
from celery import Celery
import ollama
app = Celery('ai_tasks', broker='redis://localhost:6379/0', backend='redis://localhost:6379/0')
@app.task(name='generate_summary')
def generate_summary(text: str, model: str = 'llama3.2') -> str:
response = ollama.chat(
model=model,
messages=[{'role': 'user', 'content': f'Summarise in 3 bullet points:\n\n{text}'}],
options={'temperature': 0.3}
)
return response['message']['content']
@app.task(name='classify_document')
def classify_document(text: str) -> dict:
from pydantic import BaseModel
from typing import Literal
class Classification(BaseModel):
category: Literal['invoice', 'contract', 'report', 'email', 'other']
confidence: Literal['high', 'medium', 'low']
response = ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': f'Classify this document:\n\n{text[:2000]}'}],
format=Classification.model_json_schema(),
options={'temperature': 0}
)
return Classification.model_validate_json(response['message']['content']).model_dump()
Starting Workers
celery -A tasks worker --loglevel=info
celery -A tasks worker --loglevel=info --concurrency=2
celery -A tasks flower # pip install flower — web UI at :5555
FastAPI Integration
from fastapi import FastAPI
from tasks import generate_summary, classify_document
app = FastAPI()
@app.post('/summarise')
async def summarise(text: str, model: str = 'llama3.2'):
task = generate_summary.delay(text, model)
return {'task_id': task.id, 'status': 'queued'}
@app.get('/result/{task_id}')
async def get_result(task_id: str):
from celery.result import AsyncResult
result = AsyncResult(task_id)
if result.ready():
return {'status': 'done', 'result': result.get()}
return {'status': result.status}
Batch Processing
from celery import group
documents = ['doc1 text', 'doc2 text', 'doc3 text']
result = group(generate_summary.s(doc) for doc in documents)()
summaries = result.get(timeout=300)
The Case for Background Inference
Web applications are built around the assumption of fast responses — users expect HTTP requests to complete in under a second for interactive features and under a few seconds for complex operations. LLM inference breaks this assumption fundamentally: generating a 200-word response takes 5–30 seconds depending on hardware and model size. Synchronous inference in a web request creates several problems beyond just slow responses. It blocks the worker thread for the entire inference duration, which limits the number of concurrent requests a single process can handle. It makes timeout configuration difficult — you need very long timeouts to accommodate inference, which can mask legitimate infrastructure problems. And it makes retry logic complex, since retrying a timed-out inference request may cause duplicate processing.
Celery solves all of these problems by decoupling request receipt from request processing. The web endpoint becomes a thin layer that validates input, queues a task, and returns a task ID — this takes milliseconds and does not block a thread during inference. Workers pick up tasks from the queue and call Ollama, taking as long as the inference requires without affecting web endpoint availability. This architecture scales naturally: add more workers to increase inference throughput, and the web tier and worker tier scale independently based on their respective bottlenecks.
Task Routing for Multiple Models
Different models have different resource requirements — a 1.5B model is fast and fits on CPU, while a 13B model needs a GPU and takes longer. Route tasks to appropriate workers using Celery’s queue system:
# Route heavy tasks to GPU workers, light tasks to CPU workers
app.conf.task_routes = {
'generate_summary': {'queue': 'cpu'}, # small model
'classify_document': {'queue': 'cpu'}, # structured output, fast
'analyse_document': {'queue': 'gpu'}, # large model, better quality
}
# Start separate worker pools
# GPU worker:
celery -A tasks worker --queues=gpu --loglevel=info --concurrency=1
# CPU workers (multiple):
celery -A tasks worker --queues=cpu --loglevel=info --concurrency=4
Error Handling and Retries
@app.task(
name='generate_summary',
max_retries=3,
default_retry_delay=5,
autoretry_for=(Exception,),
)
def generate_summary(text: str, model: str = 'llama3.2') -> str:
try:
response = ollama.chat(
model=model,
messages=[{'role':'user','content':f'Summarise:\n\n{text}'}],
options={'temperature':0.3}
)
return response['message']['content']
except Exception as e:
# Log the error before Celery retries
print(f'Inference failed (attempt {generate_summary.request.retries}): {e}')
raise # Celery handles retry
Progress Tracking
@app.task(name='process_batch', bind=True)
def process_batch(self, documents: list) -> list:
results = []
for i, doc in enumerate(documents):
self.update_state(
state='PROGRESS',
meta={'current': i+1, 'total': len(documents), 'percent': round((i+1)/len(documents)*100)}
)
response = ollama.chat(model='llama3.2',
messages=[{'role':'user','content':f'Summarise:\n\n{doc}'}])
results.append(response['message']['content'])
return results
# Poll progress from FastAPI
@app.get('/progress/{task_id}')
async def progress(task_id: str):
from celery.result import AsyncResult
result = AsyncResult(task_id)
if result.state == 'PROGRESS':
return result.info # {'current': 5, 'total': 20, 'percent': 25}
return {'state': result.state}
Production Deployment
In production, run Celery workers as systemd services alongside Ollama. Configure worker concurrency to match your hardware — on a single GPU machine, concurrency=1 for GPU tasks (only one inference at a time on the GPU) and concurrency=4 for CPU tasks. Use Celery’s built-in result expiry (result_expires=3600) to automatically clean up completed task results from Redis. Monitor worker health via Flower or by scraping Celery’s built-in metrics endpoint. The combination of Celery, Redis, and Ollama is a production-proven pattern for async AI workloads that requires no special AI infrastructure — just standard Python and Redis tooling that most backend teams already operate.
Choosing Between Celery and Alternatives
Celery is not the only Python task queue — alternatives include Dramatiq, Huey, ARQ (async), and RQ (Redis Queue). For Ollama workloads, the choice comes down to team familiarity and infrastructure. Celery is the most widely used and has the most documentation, examples, and integrations. Dramatiq is considered easier to use and less error-prone than Celery for straightforward tasks. ARQ is purpose-built for async Python applications and integrates naturally with FastAPI’s async model. RQ is the simplest option and good for small-scale deployments. All four support Redis as a broker and share the same fundamental pattern: enqueue a task, workers process it, results are stored and retrieved.
For teams already using Celery elsewhere in their stack, adding Ollama tasks to the existing worker pool is trivial — the AI tasks are just another task type alongside database operations, email sending, or report generation. For new projects, Celery’s maturity and ecosystem (Flower monitoring, beat scheduler, chord/group/chain primitives for complex workflows) make it the safe default. If your application is async-first (FastAPI + asyncio), ARQ is worth evaluating as a more idiomatic fit than Celery’s synchronous worker model.
Scaling Workers Horizontally
One of Celery’s strongest operational characteristics is horizontal scaling — adding more workers is as simple as starting the same Celery worker command on additional machines pointing at the same Redis broker. For Ollama workloads, this translates to running Ollama on multiple machines and distributing inference tasks across them. Each worker machine runs its own Ollama instance; the Celery task code is identical on all workers but the OLLAMA_HOST environment variable points each worker at its local Ollama instance. This architecture scales inference throughput linearly with the number of worker machines, without any central coordination beyond the Redis task queue.
# Use environment-aware Ollama host in tasks
import os
OLLAMA_HOST = os.getenv('OLLAMA_HOST', 'http://localhost:11434')
@app.task(name='generate_summary')
def generate_summary(text: str) -> str:
client = ollama.Client(host=OLLAMA_HOST)
response = client.chat(
model='llama3.2',
messages=[{'role':'user','content':f'Summarise:\n\n{text}'}]
)
return response['message']['content']
Scheduling Periodic AI Jobs
Celery Beat (the built-in scheduler) handles periodic tasks — useful for AI workloads that should run on a schedule:
from celery.schedules import crontab
app.conf.beat_schedule = {
# Re-embed all documents daily at 2am to keep the index fresh
'reindex-docs-daily': {
'task': 'reindex_all_documents',
'schedule': crontab(hour=2, minute=0)
},
# Generate daily summaries every morning
'daily-summary': {
'task': 'generate_daily_report',
'schedule': crontab(hour=7, minute=0)
}
}
# Start the beat scheduler
# celery -A tasks beat --loglevel=info
Getting Started
Install Celery and Redis, copy the task definitions from this article, start Redis locally, and run celery -A tasks worker --loglevel=info. In a separate terminal, call your task with generate_summary.delay('your text here') and watch it process in the worker output. Add the FastAPI endpoints for the async request/poll pattern when you are ready to build the web interface. The Flower monitoring UI (celery -A tasks flower) gives immediate visibility into queued, processing, and completed tasks during development. The whole setup from zero to a working async AI pipeline takes under an hour, and the resulting architecture scales to high throughput without the concurrency complications of trying to run parallel inference in a single synchronous web process.
When Celery Adds Too Much Complexity
Celery is the right tool when you have sustained workloads, need reliability guarantees (retry logic, dead letter queues), require horizontal scaling across multiple machines, or need task scheduling via Celery Beat. It is overkill for simple cases: if you just want non-blocking inference in a FastAPI application serving a handful of users, running Ollama calls in a thread pool executor with asyncio is simpler and involves no external dependencies. The heuristic: if you need Redis running anyway for other purposes (caching, sessions), adding Celery is low overhead. If Redis would be a new dependency just for Celery, evaluate whether asyncio-based background tasks or a simple in-process queue would suffice for your scale. The patterns in this article are intentionally simple — a single tasks.py file, standard Celery configuration, and FastAPI endpoints that any Python developer can understand and maintain without specialised knowledge of the Celery internals.
Real-World Use Cases
The most common production uses of Celery + Ollama follow predictable patterns. Document processing pipelines — PDF uploads, email attachments, user-submitted content — are the archetypal use case: the file arrives via HTTP, an upload task saves it, an extraction task converts it to text, an AI task classifies or summarises it, and results are stored for retrieval. The AI step is naturally async because it is slow and because the user does not need to wait for classification before the upload is confirmed. Email or notification generation is another natural fit: a background task generates a personalised message using an LLM and sends it, entirely decoupled from the web request that triggered it. And analytics pipelines where you apply AI analysis to batches of records overnight — product categorisation, sentiment analysis on reviews, data extraction from unstructured fields — map directly to Celery’s batch processing primitives.
The common thread across all these use cases is that AI is a processing step that happens asynchronously, not a real-time requirement for the user interaction. Recognising this pattern — separating the “user triggered something” event from the “AI processed it” outcome — is the key insight that makes Celery the right architectural tool for most production AI workflows. Once you have this separation, the queue, workers, and result storage that Celery provides give you exactly the infrastructure the pattern needs, with well-understood operational characteristics and a large community of engineers who know how to run it reliably.
Monitoring Celery Workers
Beyond Flower’s web UI, Celery exposes metrics that integrate with Prometheus via the celery-prometheus-exporter package. Key metrics to monitor: active tasks per worker (detect workers stuck on long-running inference), task success/failure rates (catch systematic errors in AI tasks), task queue depth (measure backlog and detect when capacity is insufficient), and task duration histograms (track inference time trends). These metrics combined with the Ollama metrics from the Prometheus monitoring article in this series give you full observability of the async AI processing pipeline — from task creation through inference to result storage.
Celery in the Broader AI Infrastructure Stack
Celery sits between your web application layer and your AI inference layer, providing the reliability and scalability guarantees that production AI workloads require. It is not an AI-specific tool — it is general-purpose task queue infrastructure that happens to work particularly well for AI workloads because of how well the async task pattern matches the latency characteristics of LLM inference. Teams that adopt Celery for Ollama tasks gain the same benefits for any other slow operation in their application: image processing, PDF generation, email sending, report compilation. The investment in learning Celery pays dividends across the entire application, not just the AI features. For teams building serious production AI applications in Python, Celery + Redis + Ollama is a proven, maintainable, and scalable stack that does not require specialised AI infrastructure knowledge to operate effectively.
The async task pattern is one of those architectural decisions that feels like overhead when you first implement it and feels indispensable six months later when your inference volume has grown and you are grateful the web tier never blocked on LLM calls. Building it in from the start, even before you need the scale, costs an hour and saves weeks of architectural rework later — a trade-off that consistently pays off for applications that evolve from prototype to production — and the Celery + Ollama combination specifically is well-proven for exactly this trajectory, handling gracefully the order-of-magnitude increases in request volume that typically accompany a successful AI application launch. The async task pattern ages particularly well — it is one of the few architectural decisions that becomes more valuable, not less, as usage scales up — compounding its value rather than becoming a limitation as your AI application matures and traffic grows — making the upfront investment in the async architecture one of the most reliable engineering decisions you can make for an AI-powered application — one that remains sound as your user base and workload evolve over time.
Start simple — one tasks.py file, Redis running locally, a single worker — then grow the complexity only as your actual workload demands it. The pattern described in this article has carried many production AI applications from zero to millions of requests, and it will carry yours too.