How to Add Image Captioning to Your App with a Local LLM

Vision-capable models — LLaVA, moondream, Gemma 3 — can describe images, extract text, answer questions about photos, and classify visual content, all running locally via Ollama. This guide covers image captioning and visual Q&A with practical Python examples for common use cases.

Prerequisites

# Pull a vision model
ollama pull llava           # 4.7GB — versatile general vision model
ollama pull moondream      # 1.7GB — fast, compact, good for captioning
ollama pull gemma3:4b      # 3.3GB — Google's multimodal model

# Python dependency for reading images
pip install pillow

Basic Image Captioning

import ollama
import base64
from pathlib import Path

def caption_image(image_path: str, model: str = 'moondream') -> str:
    image_data = Path(image_path).read_bytes()
    b64 = base64.b64encode(image_data).decode()
    response = ollama.chat(
        model=model,
        messages=[{
            'role': 'user',
            'content': 'Describe this image in detail.',
            'images': [b64]
        }]
    )
    return response['message']['content']

print(caption_image('photo.jpg'))

Visual Question Answering

def ask_about_image(image_path: str, question: str, model: str = 'llava') -> str:
    b64 = base64.b64encode(Path(image_path).read_bytes()).decode()
    response = ollama.chat(
        model=model,
        messages=[{'role':'user','content':question,'images':[b64]}]
    )
    return response['message']['content']

# Examples
print(ask_about_image('receipt.jpg', 'What is the total amount on this receipt?'))
print(ask_about_image('diagram.png', 'Explain what this diagram shows.'))
print(ask_about_image('screenshot.png', 'What errors or warnings are visible?'))

Batch Processing a Folder of Images

from pathlib import Path
import csv

def batch_caption(folder: str, output_csv: str, model: str = 'moondream') -> None:
    image_exts = {'.jpg','.jpeg','.png','.webp','.gif'}
    images = [f for f in Path(folder).iterdir() if f.suffix.lower() in image_exts]
    print(f'Processing {len(images)} images...')

    with open(output_csv, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(['filename','caption'])
        for img in images:
            print(f'  {img.name}...', end='', flush=True)
            caption = caption_image(str(img), model)
            writer.writerow([img.name, caption])
            print(' done')

batch_caption('product_photos/', 'captions.csv')

Extracting Text from Images (OCR)

def extract_text_from_image(image_path: str) -> str:
    b64 = base64.b64encode(Path(image_path).read_bytes()).decode()
    response = ollama.chat(
        model='llava',
        messages=[{
            'role':'user',
            'content':'Extract all text visible in this image. Return only the text, preserving formatting.',
            'images':[b64]
        }]
    )
    return response['message']['content']

# Works for: screenshots, photos of whiteboards, business cards, signs
text = extract_text_from_image('whiteboard_photo.jpg')
print(text)

Image Classification

from pydantic import BaseModel
from typing import Literal

class ImageClassification(BaseModel):
    category: Literal['invoice', 'receipt', 'contract', 'photo', 'diagram', 'screenshot', 'other']
    confidence: Literal['high', 'medium', 'low']
    description: str

def classify_image(image_path: str) -> ImageClassification:
    b64 = base64.b64encode(Path(image_path).read_bytes()).decode()
    response = ollama.chat(
        model='llava',
        messages=[{'role':'user','content':'Classify this document/image.','images':[b64]}],
        format=ImageClassification.model_json_schema(),
        options={'temperature': 0}
    )
    return ImageClassification.model_validate_json(response['message']['content'])

result = classify_image('uploaded_doc.jpg')
print(f'{result.category} ({result.confidence}): {result.description}')

Model Comparison: LLaVA vs Moondream vs Gemma 3

Each vision model has strengths for different tasks. Moondream (1.7GB) is the fastest option for high-volume captioning — it generates brief, accurate descriptions quickly and handles common photographic subjects well. Its weakness is complex diagrams and dense text. LLaVA (4.7GB) is the most versatile — it handles diverse image types, detailed visual Q&A, and longer descriptions better than moondream. For most general-purpose vision tasks, LLaVA is the recommended default. Gemma 3 (3.3GB at 4B) offers a good balance of quality and speed, with particularly strong performance on text extraction from images and understanding charts and graphs. For OCR-adjacent tasks, Gemma 3 4B is worth trying alongside LLaVA.

Why Local Vision Models Are Practically Useful

Image captioning and visual Q&A have historically required cloud APIs (Google Cloud Vision, AWS Rekognition, OpenAI GPT-4V) that charge per image, require sending images to external servers, and may have data retention policies incompatible with private or sensitive images. Local vision models via Ollama eliminate all of these constraints. For applications that process medical images, legal documents, internal company materials, or personal photos, local processing is often not just preferable but required by data governance policies.

The practical quality of vision models in the 2–7B parameter range has improved dramatically. Moondream at 1.7GB produces accurate captions for common photographic subjects and handles text extraction reasonably well. LLaVA handles more complex visual reasoning, diagram interpretation, and detailed visual Q&A. These models are not as capable as GPT-4V on hard vision tasks, but they handle the 80% of common vision use cases — caption generation, basic OCR, document type classification, simple visual Q&A — with quality that is good enough for production use.

Integrating with a Web Application

# FastAPI endpoint for image captioning
from fastapi import FastAPI, UploadFile
import ollama, base64

app = FastAPI()

@app.post('/caption')
async def caption(file: UploadFile):
    image_bytes = await file.read()
    b64 = base64.b64encode(image_bytes).decode()
    response = ollama.chat(
        model='moondream',
        messages=[{'role':'user','content':'Describe this image.','images':[b64]}]
    )
    return {'caption': response['message']['content']}

# Run: uvicorn app:app --reload
# POST an image: curl -X POST http://localhost:8000/caption -F 'file=@photo.jpg'

Handling Different Image Sources

import requests
import ollama, base64
from pathlib import Path
from io import BytesIO
from PIL import Image

def load_image_b64(source: str) -> str:
    '''Load image from file path or URL and return as base64 string.'''
    if source.startswith('http://') or source.startswith('https://'):
        resp = requests.get(source, timeout=10)
        image_bytes = resp.content
    else:
        image_bytes = Path(source).read_bytes()
    # Optionally resize large images to reduce token usage
    img = Image.open(BytesIO(image_bytes))
    if max(img.size) > 1024:
        img.thumbnail((1024, 1024))
        buf = BytesIO()
        img.save(buf, format='JPEG', quality=85)
        image_bytes = buf.getvalue()
    return base64.b64encode(image_bytes).decode()

# Works with both local files and URLs
caption = caption_image_b64(load_image_b64('https://example.com/photo.jpg'))

Resizing Images for Performance

Vision models process images by dividing them into patches. Larger images produce more patches and consume more tokens, which increases both inference time and context window usage. For most captioning tasks, images scaled to 512×512 or 768×768 pixels are sufficient — the model captures the essential content without the overhead of processing a full-resolution image. The load_image_b64 helper above applies this optimisation automatically. For tasks requiring fine detail (small text, intricate diagrams), use higher resolution — 1024×1024 is usually the practical ceiling that balances quality and performance on consumer hardware.

Privacy Considerations

Local vision processing is the correct choice for any application handling images that should not be sent to external services. Medical imaging (X-rays, scans, clinical photos), personal identification documents, private correspondence photos, security camera footage, and internal business documents all fall into this category. With local vision models, the images are processed entirely in RAM on your machine — they are never transmitted, logged, or stored anywhere outside your control. For organisations with HIPAA, GDPR, or other regulatory requirements around image data, local vision processing is often the only compliant path for AI-powered image analysis.

Getting Started

Pull moondream for fast captioning (ollama pull moondream) and LLaVA for versatile vision Q&A (ollama pull llava). Run the basic caption_image function on a few images to evaluate quality for your use case. Moondream is the right starting point for high-volume or latency-sensitive applications; LLaVA when you need better accuracy on complex visual reasoning. Both integrate with the same API pattern — switching between them is a one-line model name change.

Practical Use Cases That Work Well Locally

The vision tasks that work best with local models are those where the images are relatively standard and the questions are focused. Photo captioning for product catalogues — generating alt text or search descriptions for product images — is a high-value, high-volume use case where moondream’s speed and reasonable accuracy make it production-viable. Document scanning workflows where you need to extract key information from photos of forms, receipts, or handwritten notes work well with LLaVA when combined with specific extraction prompts. Internal tool screenshots and error message photos, which developers frequently share in team communication, can be automatically transcribed and summarised using the OCR-style prompts from this article. Accessibility tools that describe images for screen reader users benefit from local processing for privacy reasons and the ability to customise description style and detail level through prompt engineering without API constraints.

Tasks that work less well locally: complex medical image analysis requiring radiologist-level interpretation, fine-grained optical character recognition of dense, poorly lit, or damaged documents (where purpose-built OCR tools like Tesseract or cloud OCR APIs outperform general vision LLMs), and scenes requiring real-world knowledge to interpret accurately (identifying specific celebrities, landmarks, or products by name). For these cases, local vision models provide a starting point but may need to be supplemented by or replaced with specialised tools. Use local vision for the common cases it handles well and route edge cases to more capable tools when quality is critical.

Combining Vision with Text Models

Vision models in Ollama produce text output, which means their output can be piped directly into further text processing. A practical pipeline: use moondream to generate a caption for each image in a folder, use a text model (llama3.2) to classify and tag the caption, then store the structured metadata. Or: use LLaVA to extract information from a photo of a receipt (merchant, amount, date, items), then use Pydantic structured output to parse the extracted text into a structured object for storage. The vision model handles the image-to-text step; the text model handles the text-to-structure step. Combining the two produces a pipeline that goes from raw image to structured data entirely locally, which is a genuinely useful capability for document processing, expense tracking, and catalogue management workflows.

Vision Model Accuracy Expectations

Setting realistic expectations for local vision model accuracy helps avoid disappointment. For common photographic subjects — people, animals, everyday objects, outdoor scenes — moondream and LLaVA produce accurate, detailed captions consistently. For text extraction from clean, well-lit documents with standard fonts, accuracy is typically 85–95% on clear images, dropping significantly for handwriting, poor lighting, or unusual fonts. For complex diagrams, charts, and technical drawings, LLaVA handles simple cases well but misses subtle details in complex layouts. For face recognition, brand identification, and specific object recognition requiring real-world knowledge, local models lag meaningfully behind cloud vision APIs that are trained on larger and more carefully curated datasets.

The gap between local and cloud vision is smaller than the gap between local and cloud language models for most common tasks. A well-prompted moondream or LLaVA handles 75–85% of typical business vision tasks at acceptable quality — enough for internal tools, automation pipelines, and accessibility features where the cost and privacy benefits of local processing justify minor quality trade-offs compared to a cloud API. Evaluate on your specific images and use cases before deciding whether local vision is sufficient — the quality you experience on your actual data is more informative than any general benchmark number.

Advanced: Multi-Image Comparison

Vision models can compare two images when both are provided in the same message. This opens up use cases like comparing product photos between catalogue versions, checking before-and-after images for changes, or verifying that a rendered design matches a reference:

def compare_images(img_path_a: str, img_path_b: str, question: str = 'What are the differences between these two images?') -> str:
    b64_a = base64.b64encode(Path(img_path_a).read_bytes()).decode()
    b64_b = base64.b64encode(Path(img_path_b).read_bytes()).decode()
    response = ollama.chat(
        model='llava',
        messages=[{
            'role': 'user',
            'content': question,
            'images': [b64_a, b64_b]  # both images in same message
        }]
    )
    return response['message']['content']

print(compare_images('design_v1.png', 'design_v2.png',
    'List all visual differences between version 1 and version 2.'))

Performance on Common Hardware

For a single image caption on modern hardware: moondream generates in 3–8 seconds on CPU (M2 MacBook Pro), 1–3 seconds on Apple Silicon GPU via Metal, and under 1 second on a mid-range NVIDIA GPU. LLaVA is 3–5x slower than moondream on the same hardware due to its larger size. For batch processing, the dominant cost is model load time on first request — subsequent requests in the same session are faster because the model stays loaded (keep-alive). For a batch of 100 images, expect roughly 5–15 minutes with moondream on CPU, or under 3 minutes on a GPU. GPU acceleration makes a significant practical difference for vision workloads compared to text-only tasks because the image processing step benefits more from GPU parallelism than sequential token generation does.

The Broader Vision of Local Vision AI

The availability of capable local vision models represents a genuine shift in what individual developers can build without cloud dependencies. Accessibility tools that describe images for visually impaired users, document management systems that auto-tag and search image content, quality control pipelines that flag visual defects, and research tools that extract data from paper figures — all of these are now buildable with local models and hardware that many developers already own. The privacy and cost advantages of local processing make these applications viable in contexts where cloud vision APIs would be cost-prohibitive or data-governance-incompatible. As vision model quality at the sub-10B parameter range continues to improve, the range of tasks where local vision is genuinely production-viable will expand further, making the patterns in this article increasingly valuable over time.

Getting the Most from Local Vision

Three practices make the most difference in local vision quality. First, write specific prompts — “List all text visible in this image” outperforms “What does this image show?” for OCR tasks, and “Describe any errors, warnings, or unusual elements in this screenshot” outperforms “Describe this image” for developer tools. Second, test your chosen model against a representative sample of 20–30 real images from your use case before committing to it — model performance varies significantly across image types and the benchmark that matters is performance on your actual data. Third, resize images to the 512–1024px range before sending them to the model — it reduces inference time and token usage without meaningfully degrading quality for most tasks, and makes batch processing significantly faster on CPU hardware where image processing time is the bottleneck. These three practices together — specific prompts, representative testing, and image resizing — consistently produce better results than switching to a larger or different model while keeping the other factors constant. Invest the 30 minutes to test and tune these variables on your specific images before optimising anything else — the returns are immediate and reliable — and far more impactful than chasing marginal gains from larger models or more complex infrastructure — the fundamentals of good prompting and appropriate image preprocessing unlock more value than any model upgrade — and both are skills you carry forward to every future vision project you build — making the initial time investment compound across your entire body of AI-powered work.

Leave a Comment