Ollama supports multimodal models that can analyse images alongside text. With a vision model running locally, you can describe images, extract text from screenshots, analyse charts, identify objects, and build image-aware applications — entirely on your own hardware with no API costs. This guide covers which vision models are available, how to use them from the command line and Python, and practical use cases.
Available Vision Models
Several vision-capable models are available through Ollama:
- llava:7b — the original LLaVA model, good general image understanding, runs on 8GB VRAM
- llava:13b — better quality, needs 16GB VRAM
- llava-llama3 — LLaVA fine-tuned on Llama 3, better instruction following
- moondream2 — very small (1.9B), fast, good for simple descriptions on CPU
- gemma3:4b — Google’s multimodal Gemma 3, strong text+image reasoning
- qwen2.5vl:7b — Qwen2.5 Vision-Language, excellent document and chart understanding
- minicpm-v — strong on document OCR and structured image analysis
# Pull your chosen vision model
ollama pull llava:7b
ollama pull moondream2 # smallest/fastest option
ollama pull qwen2.5vl:7b # best for documents and charts
Using Vision Models from the CLI
# Describe an image
ollama run llava:7b "What is in this image?" --image photo.jpg
# Ask a specific question about an image
ollama run llava:7b "What text is visible in this screenshot?" --image screenshot.png
# Analyse a chart
ollama run qwen2.5vl:7b "Summarise the key trends in this chart" --image chart.png
Using Vision Models via the Ollama Python Library
import ollama
import base64
from pathlib import Path
def encode_image(image_path: str) -> str:
with open(image_path, 'rb') as f:
return base64.b64encode(f.read()).decode('utf-8')
def describe_image(image_path: str, model: str = 'llava:7b',
prompt: str = 'Describe this image in detail.') -> str:
response = ollama.chat(
model=model,
messages=[{
'role': 'user',
'content': prompt,
'images': [image_path] # pass file path directly
}]
)
return response['message']['content']
# Simple usage
description = describe_image('photo.jpg')
print(description)
# With a specific question
answer = describe_image(
'chart.png',
model='qwen2.5vl:7b',
prompt='What does this chart show? List the main data points.'
)
print(answer)
Using the OpenAI-Compatible API with Images
import base64
from openai import OpenAI
from pathlib import Path
client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
def image_to_data_url(image_path: str) -> str:
suffix = Path(image_path).suffix.lower()
mime = {'jpg': 'image/jpeg', 'jpeg': 'image/jpeg',
'png': 'image/png', 'gif': 'image/gif',
'webp': 'image/webp'}.get(suffix.lstrip('.'), 'image/jpeg')
with open(image_path, 'rb') as f:
data = base64.b64encode(f.read()).decode('utf-8')
return f'data:{mime};base64,{data}'
def analyse_image(image_path: str, question: str,
model: str = 'llava:7b') -> str:
response = client.chat.completions.create(
model=model,
messages=[{
'role': 'user',
'content': [
{'type': 'text', 'text': question},
{'type': 'image_url', 'image_url': {'url': image_to_data_url(image_path)}}
]
}]
)
return response.choices[0].message.content
# Usage
result = analyse_image('receipt.jpg', 'Extract all line items and total from this receipt')
print(result)
Batch Image Processing
import ollama
from pathlib import Path
import json
def batch_describe_images(image_dir: str, model: str = 'moondream2') -> dict:
results = {}
image_dir = Path(image_dir)
image_files = list(image_dir.glob('*.jpg')) + list(image_dir.glob('*.png'))
print(f'Processing {len(image_files)} images...')
for img_path in image_files:
response = ollama.chat(
model=model,
messages=[{
'role': 'user',
'content': 'Describe this image briefly in one sentence.',
'images': [str(img_path)]
}]
)
results[img_path.name] = response['message']['content']
print(f' {img_path.name}: done')
return results
descriptions = batch_describe_images('./screenshots', model='moondream2')
with open('descriptions.json', 'w') as f:
json.dump(descriptions, f, indent=2)
Practical Use Cases
Screenshot OCR: Extract text from screenshots without a dedicated OCR tool. Qwen2.5-VL and MiniCPM-V are particularly strong at this, handling handwriting, formatted documents, and code screenshots accurately. For batch processing a folder of screenshots, moondream2 is the fastest option — its small size means low latency even without a GPU.
Chart and diagram analysis: Ask a vision model to summarise the key trends or data points in a chart. Qwen2.5-VL handles complex charts (multi-series line charts, scatter plots, confusion matrices) better than LLaVA at the same size, because its training data included more structured visual content.
Product and inventory cataloguing: Point a vision model at photos of physical items to generate descriptions for a database. This is significantly faster than manual cataloguing for large inventories and produces consistent output format when combined with a structured system prompt.
Document processing: Extract structured data from scanned documents, forms, invoices, and receipts. A well-prompted vision model can identify field names and values, convert tables to JSON, and handle varied document layouts with reasonable accuracy — good enough for a first-pass extraction that a human reviews, not a fully autonomous pipeline.
Choosing the Right Vision Model
For most everyday image description tasks, LLaVA 7B is the reliable default — well-tested, widely supported, and handles general scenes well. For document and chart analysis, upgrade to Qwen2.5-VL 7B — its structured content understanding is noticeably better. For speed-critical or CPU-only setups, moondream2 is the only practical choice at 1.9B parameters. For the best overall quality when VRAM is not a constraint, Qwen2.5-VL 72B or LLaVA 34B produce significantly more accurate and detailed analyses than their smaller counterparts, though they require 40GB+ of VRAM or RAM to run comfortably.
How Vision Models Work in Ollama
Multimodal models in Ollama work by combining a vision encoder (which converts the image into a sequence of embedding vectors) with a language model backbone. The image is encoded into a fixed-length sequence of tokens that the language model treats similarly to text tokens — it can attend to them, reason about them, and generate text that refers to specific visual details. The encoding step happens locally before the language model runs, so there is no intermediate network call or external service involved.
The practical implication of this architecture is that image resolution affects both quality and speed. Most vision models resize images to a fixed resolution (typically 336×336 or 448×448 pixels) before encoding them. If you pass a high-resolution image, Ollama resizes it down automatically. For tasks where fine detail matters — reading small text in a screenshot, distinguishing similar-looking objects — pass the highest quality image you have and let the model use what it can. For tasks where only the overall scene matters, lower-resolution images are faster and use less memory with no quality loss.
A key limitation to be aware of: most 7B vision models have a relatively small visual token budget, meaning they can attend to a limited amount of visual detail simultaneously. A 7B model looking at a dense spreadsheet screenshot may miss some rows, or a model looking at a complex circuit diagram may conflate similar-looking components. These are not bugs — they are capacity limitations of the vision encoder relative to the complexity of the image. Larger models (13B+) have more visual token capacity and perform noticeably better on complex structured images.
Streaming Responses from Vision Models
For interactive applications where you want to display the model’s description as it generates rather than waiting for the full response, use streaming with the Ollama Python library or the OpenAI API:
import ollama
def stream_image_description(image_path: str, prompt: str,
model: str = 'llava:7b'):
stream = ollama.chat(
model=model,
messages=[{'role': 'user', 'content': prompt, 'images': [image_path]}],
stream=True
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
print() # newline at end
stream_image_description('photo.jpg', 'Describe everything visible in this image.')
Multi-Turn Conversations About Images
You can have a multi-turn conversation about an image by passing the image in the first message and following up with text-only questions. Ollama’s vision models maintain the image context across turns in the same conversation session:
import ollama
def image_conversation(image_path: str, model: str = 'llava:7b'):
history = [{
'role': 'user',
'content': 'Describe this image.',
'images': [image_path]
}]
response = ollama.chat(model=model, messages=history)
reply = response['message']['content']
print(f'Model: {reply}')
history.append({'role': 'assistant', 'content': reply})
while True:
user_input = input('You: ').strip()
if not user_input or user_input.lower() == 'quit':
break
history.append({'role': 'user', 'content': user_input})
response = ollama.chat(model=model, messages=history)
reply = response['message']['content']
print(f'Model: {reply}')
history.append({'role': 'assistant', 'content': reply})
image_conversation('diagram.png')
Performance Expectations
Vision model inference is slower than text-only inference for two reasons: the image encoding step adds latency before generation begins, and the visual tokens add to the effective context length, which increases the prefill computation. On a mid-range GPU (RTX 3070 or M2 Pro), expect 5–15 seconds for the first token on a 7B vision model, compared to 1–3 seconds for a text-only 7B model. Generation speed after the first token is similar. For interactive use this latency is noticeable but acceptable. For batch processing where you are running hundreds of images, the throughput is practical — a 7B vision model on an RTX 3080 can process 200–400 images per hour depending on image complexity and description length.
Moondream2 is the exception — at 1.9B parameters it starts generating the first token in 1–2 seconds on a GPU and can process images at several per minute even on CPU-only hardware. For use cases where speed matters more than depth of analysis (classifying images into categories, quick presence/absence detection, simple one-sentence summaries), moondream2 is the practical choice. Its understanding of complex scenes is weaker than LLaVA 7B but it handles straightforward description and classification tasks reliably and much faster.
Integrating Vision into a RAG Pipeline
A practical extension of local vision models is adding image-aware retrieval to a RAG pipeline. The approach is to pre-process a collection of images — diagrams, screenshots, product photos — with a vision model to generate text descriptions, then index those descriptions in a vector store alongside regular text documents. Queries retrieve both text documents and image descriptions, and the LLM can answer questions that draw on visual content without needing to process images at query time. This is more practical than trying to embed images directly into a vector store, because text embeddings of image descriptions are well-supported by existing tools and produce good retrieval results for the kinds of questions users actually ask about visual content.
Building an Image Description API with FastAPI
Combining Ollama’s vision capabilities with FastAPI gives you a simple REST endpoint for image analysis that any other service can call. This is useful for building internal tools — a Slack bot that describes images in channels, a content management system that auto-generates alt text, or a quality inspection pipeline that flags anomalies in product photos.
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import JSONResponse
import ollama
import tempfile, os
app = FastAPI()
@app.post('/describe')
async def describe_image(
file: UploadFile = File(...),
prompt: str = 'Describe this image in detail.',
model: str = 'llava:7b'
):
# Save uploaded file temporarily
with tempfile.NamedTemporaryFile(delete=False, suffix=os.path.splitext(file.filename)[1]) as tmp:
tmp.write(await file.read())
tmp_path = tmp.name
try:
response = ollama.chat(
model=model,
messages=[{'role': 'user', 'content': prompt, 'images': [tmp_path]}]
)
return JSONResponse({'description': response['message']['content'], 'model': model})
finally:
os.unlink(tmp_path)
# Run: uvicorn app:app --host 0.0.0.0 --port 8001
# Test: curl -X POST http://localhost:8001/describe -F file=@photo.jpg
Getting the Best Results
A few prompting habits improve vision model outputs significantly. First, be specific about what you want extracted — “describe this image” produces a general description, while “list all the text visible in this image, preserving the original formatting” produces a structured text extraction. Second, for structured output like tables or JSON, include the desired format in the prompt: “extract the data from this table and return it as a JSON array with column names as keys”. Third, if the model misses details in a complex image, try cropping to the region of interest before passing it — a 500×300 crop of a data table will be analysed more accurately than the same table embedded in a 3000×2000 full-page screenshot where it occupies 5% of the image area. Fourth, for OCR-heavy tasks, prefer Qwen2.5-VL or MiniCPM-V over LLaVA — they were trained on more document and text-heavy images and produce more accurate character-level transcriptions, especially for non-standard fonts, handwriting, and technical notation.
Privacy Advantages of Local Vision Processing
Processing images locally with Ollama means the image data never leaves your machine. This matters in several professional contexts. For legal and medical professionals, screenshots of client communications, patient records, or case documents cannot be sent to cloud vision APIs without careful compliance review — local processing sidesteps this entirely. For competitive intelligence and product development work, screenshots of unreleased features, internal dashboards, and competitor analysis cannot be safely processed by cloud services that might log inputs. For journalists and researchers working with sensitive visual materials, local processing eliminates the risk that image content could be accessed by a third party or used to train future models.
The practical quality gap between local vision models and cloud vision APIs (GPT-4 Vision, Claude, Gemini) is real but narrower than it was a year ago. For general image description, chart reading, and screenshot OCR, local models at the 7B tier are good enough for most production use cases with well-crafted prompts. The cloud models still have an edge on highly ambiguous images, complex multi-object scenes with fine-grained distinctions, and tasks requiring broad world knowledge to interpret visual context correctly. For use cases where privacy matters and the visual content is reasonably structured, local vision models are a practical alternative to cloud APIs that is worth evaluating before accepting the privacy tradeoffs of sending images to an external service.
Getting Started
The fastest path to running local vision inference is: pull moondream2 for a quick test (ollama pull moondream2), run ollama run moondream2 "describe this" --image yourimage.jpg, and see a description generated in seconds. If the quality is sufficient for your use case, you are done — moondream2’s speed makes it practical for interactive use even on modest hardware. If you need better quality, upgrade to llava:7b or qwen2.5vl:7b, which require more VRAM but produce significantly more detailed and accurate analyses. The progression from moondream2 to llava:7b to qwen2.5vl:7b covers most use cases from casual image Q&A to production document processing, and each step is a straightforward model swap with no code changes required if you are using the Ollama Python library or the OpenAI-compatible API.