Ollama makes it easy to pull and run models with a single command, but most users never discover its most powerful feature: the Modelfile. A Modelfile lets you customise any model’s system prompt, default parameters, stop sequences, and create a named custom model you can run by a shorthand name. This article walks through everything you can do with a Modelfile, with practical examples for the most common use cases.
What Is a Modelfile?
A Modelfile is a plain text file that defines how Ollama should configure a model when it runs. It is conceptually similar to a Dockerfile — you specify a base model to build from, then layer on customisations. Ollama reads the Modelfile and creates a new named model that can be run with ollama run your-model-name. The customisations are baked in, so you do not have to pass flags or prompts on the command line every time.
Basic Modelfile Structure
# Minimal Modelfile — save as 'Modelfile' (no extension)
FROM llama3.2
SYSTEM """You are a concise technical assistant. Always reply in plain text, no markdown formatting. Keep answers under 3 sentences unless the user explicitly asks for more detail."""
Save that as a file named Modelfile, then create and run your custom model:
# Create the custom model from the Modelfile
ollama create concise-assistant -f Modelfile
# Run it like any other Ollama model
ollama run concise-assistant
# List all your models including custom ones
ollama list
Setting Model Parameters
The PARAMETER instruction controls the model’s generation behaviour. These map directly to the sampling parameters used at inference time.
FROM llama3.2
SYSTEM "You are a helpful coding assistant specialised in Python."
# Temperature: 0.0 = deterministic, 1.0 = creative. Lower is better for code.
PARAMETER temperature 0.2
# top_p: nucleus sampling cutoff
PARAMETER top_p 0.9
# top_k: limit vocabulary to top-k tokens at each step
PARAMETER top_k 40
# num_ctx: context window size in tokens
PARAMETER num_ctx 8192
# num_predict: max tokens to generate (-1 = unlimited)
PARAMETER num_predict 2048
# repeat_penalty: penalise repeated tokens
PARAMETER repeat_penalty 1.1
# stop sequences: model stops generating at these strings
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_of_text|>"
Pointing to Different Base Models
The FROM instruction can reference any model Ollama supports, including specific quantisation variants and local GGUF files.
# Use a specific quantisation — q4_K_M balances size and quality well
FROM llama3.2:3b-instruct-q4_K_M
# Use a larger model if you have sufficient VRAM
FROM qwen2.5-coder:14b
# Point directly to a local GGUF file on disk
FROM /path/to/your/model.gguf
Four Practical Modelfile Examples
Code reviewer with low temperature:
FROM qwen2.5-coder:7b
SYSTEM """You are an expert code reviewer. When given code:
1. Identify bugs and logic errors first
2. Point out performance issues
3. Suggest idiomatic improvements
Be specific: quote the problematic line and explain why."""
PARAMETER temperature 0.1
PARAMETER num_ctx 16384
JSON-only output:
FROM llama3.2
SYSTEM "You always respond with valid JSON only. No markdown, no explanation, no preamble."
PARAMETER temperature 0.0
Document summariser with fixed output format:
FROM llama3.2:8b
SYSTEM """Summarise documents in this exact format:
ONE LINE SUMMARY: (one sentence)
KEY POINTS: (3-5 bullet points)
ACTION ITEMS: (any tasks mentioned, or 'None')
Do not add any other text outside this format."""
PARAMETER temperature 0.3
PARAMETER num_ctx 32768
Low-RAM assistant for older hardware:
FROM llama3.2:3b-instruct-q4_K_S
SYSTEM "You are a helpful assistant. Be concise."
PARAMETER num_ctx 2048
PARAMETER temperature 0.7
PARAMETER num_predict 512
Using the Ollama API with a Custom Model
Once you have created a custom model with ollama create, you can call it through the Ollama REST API exactly like any built-in model. This is useful for scripts and applications — your system prompt is baked into the model, so you do not repeat it in every request.
import requests
def chat(prompt: str, model: str = 'code-reviewer') -> str:
response = requests.post(
'http://localhost:11434/api/chat',
json={
'model': model,
'messages': [{'role': 'user', 'content': prompt}],
'stream': False
}
)
return response.json()['message']['content']
# System prompt from Modelfile is applied automatically
review = chat('Review this: def divide(a, b): return a/b')
print(review)
Updating a Custom Model
To update a custom model, edit the Modelfile and run ollama create again with the same name. The old version is overwritten in place.
# Edit Modelfile, then rebuild to apply changes
ollama create code-reviewer -f Modelfile
# Verify the update
ollama show code-reviewer
# See the full Modelfile of any model — great for understanding defaults
ollama show llama3.2 --modelfile
The ollama show --modelfile trick is particularly useful: run it on any official model to see the default system prompt and parameters it ships with. This is the fastest way to find the correct stop tokens for a given model family.
Seeding Conversation History with MESSAGE
The MESSAGE instruction pre-populates the conversation history with example exchanges before the user’s first message. This is effectively few-shot prompting baked into the model — useful when you want the model to reliably follow a specific format or tone without a lengthy system prompt.
FROM llama3.2
SYSTEM "You are a SQL expert. Respond with the SQL query only, no explanation."
# Seed with examples so the model knows exactly what format to follow
MESSAGE user "Get all users created in the last 7 days"
MESSAGE assistant "SELECT * FROM users WHERE created_at >= NOW() - INTERVAL '7 days';"
MESSAGE user "Count orders by status"
MESSAGE assistant "SELECT status, COUNT(*) as count FROM orders GROUP BY status;"
PARAMETER temperature 0.0
Sharing and Exporting Modelfiles
A Modelfile is just a text file, so sharing it is straightforward. Export it from any model with ollama show --modelfile, commit it to a git repo, and teammates can recreate the model with one command.
# Export to a file
ollama show code-reviewer --modelfile > Modelfile.code-reviewer
# Teammate recreates it
ollama create code-reviewer -f Modelfile.code-reviewer
# Push to Ollama registry (requires free account)
ollama push yourusername/code-reviewer
Full Instruction Reference
Every instruction Ollama’s Modelfile supports:
# FROM — required. The base model to build on.
FROM llama3.2
# SYSTEM — system prompt prepended to every conversation.
SYSTEM "You are a helpful assistant."
# PARAMETER — inference settings. Key options:
# temperature 0.0-2.0 creativity (default 0.8)
# top_k int top-k sampling (default 40)
# top_p 0.0-1.0 nucleus sampling (default 0.9)
# num_ctx int context window in tokens (default 2048)
# num_predict int max output tokens, -1 unlimited
# repeat_penalty float penalise repetition (default 1.1)
# seed int 0 = random; fixed = reproducible
# stop string stop generation at this token (repeatable)
PARAMETER temperature 0.7
# MESSAGE — seed the conversation history for few-shot prompting.
MESSAGE user "Example question"
MESSAGE assistant "Example answer"
# TEMPLATE — override the prompt template (advanced use only).
# Only needed for models with non-standard chat formats.
TEMPLATE "{{ .System }}{{ .Prompt }}{{ .Response }}"
# LICENSE — informational license string.
LICENSE "MIT"
Common Gotchas
A few things that catch people out. First, num_ctx directly determines RAM usage — the KV cache scales linearly with context length, so if Ollama is falling back to CPU or running out of memory, halving num_ctx is the first thing to try. Second, stop sequences are model-specific — the right tokens depend on the model’s chat template. Run ollama show modelname --modelfile on the base model to find its default stop tokens and copy them into your Modelfile. Third, editing the Modelfile file on disk does nothing on its own — you must re-run ollama create for changes to take effect. Fourth, temperature 0.0 is not fully deterministic unless you also set a fixed seed, because GPU floating point operations are non-deterministic across runs. Set both temperature 0.0 and seed 42 (or any fixed integer) if you need fully reproducible outputs.
Why Modelfiles Matter for Productivity
The core productivity gain from Modelfiles is that they eliminate the need to re-enter the same context every session. If you are using Ollama for a recurring task — reviewing pull requests, writing commit messages, converting unstructured notes into structured formats — a Modelfile captures your exact prompt engineering once and makes it reusable as a first-class model. Instead of starting every session with a long system prompt copy-pasted from a notes file, you just run ollama run your-model and the model behaves exactly as configured.
This becomes especially valuable when sharing workflows with a team. A well-crafted Modelfile is a reproducible, version-controllable artefact. Commit it to your repository alongside the code it supports — a Modelfile for your project’s code reviewer, one for your documentation writer, one for your test case generator. New team members get a consistent AI assistant behaviour with zero setup beyond running ollama create.
Combining Modelfiles with Ollama’s Multimodal Models
Modelfiles work with multimodal models too — models that accept images as input alongside text. The configuration approach is identical: specify a multimodal base model in the FROM instruction and set your system prompt and parameters as usual. This lets you create specialised vision assistants with baked-in instructions for specific visual tasks.
# Vision model configured for structured image analysis
FROM llava:13b
SYSTEM """You are a precise image analyser. When given an image, describe:
1. Main subject and composition
2. Any text visible in the image (exact transcription)
3. Technical quality: lighting, focus, any issues
Respond in plain text. Be specific and factual."""
PARAMETER temperature 0.1
PARAMETER num_ctx 4096
ollama create image-analyser -f Modelfile
# Pass an image at runtime
ollama run image-analyser "Describe this image" --image screenshot.png
Debugging Modelfile Issues
If your custom model is not behaving as expected, the most common issues are: the system prompt being ignored (check that the base model actually supports system prompts — some older models do not), the model stopping too early (wrong or missing stop tokens — copy them from ollama show basemodel --modelfile), or generation being too slow (num_ctx too large for available VRAM — reduce it and rebuild). Running ollama show your-model without flags gives a summary including parameter count, context length, and the system prompt as the model sees it, which is the fastest way to verify your Modelfile was applied correctly.
Choosing the Right Base Model for Your Modelfile
The FROM instruction is the most consequential decision in a Modelfile. The base model determines the fundamental capability ceiling — a system prompt cannot make a 3B model perform like a 70B model, but it can dramatically improve how reliably the 3B model follows your specific instructions within its capability range. For most practical customisation tasks, the right base model depends on three factors: the task type, your available RAM, and whether you prioritise speed or quality.
For coding tasks, the Qwen2.5-Coder and DeepSeek-Coder model families consistently outperform general-purpose models of the same parameter count on code generation and review. A Modelfile built on qwen2.5-coder:7b will produce better code review output than the same Modelfile on llama3.2:8b, even though they are similar in size. For general reasoning, instruction following, and writing tasks, Llama 3.2 and Mistral Instruct are strong base choices that respond well to system prompt customisation. For tasks requiring long context — processing large documents, long codebases — look at models that explicitly support extended context: Llama 3.2 supports up to 128K tokens when num_ctx is set accordingly, though RAM requirements scale proportionally.
When using quantised variants in the FROM instruction, the naming convention is consistent across models: Q4_K_M is the recommended default for most hardware (good quality, reasonable RAM), Q5_K_M gives better quality at higher RAM cost, Q8_0 is near-lossless at roughly twice the RAM of Q4_K_M, and Q4_K_S is the smallest reasonable quantisation when RAM is the hard constraint. These can be specified directly in the FROM line as FROM llama3.2:8b-instruct-q4_K_M.
System Prompt Design for Reliable Output
The system prompt in a Modelfile is not just a description of the role — it is the primary mechanism for making model behaviour reliable and consistent. Three principles make the difference between a system prompt that works and one that the model partially ignores. First, lead with the output format constraint before anything else. Models follow format instructions more reliably when they appear at the top of the system prompt rather than buried at the end. If you need JSON output, the first sentence should say so. Second, give explicit negative constraints for the things you most want to avoid — “no markdown”, “do not add preamble”, “do not apologise” — because models have strong defaults toward these behaviours that a positive instruction alone may not suppress. Third, keep the system prompt focused on a single role. A system prompt that tries to make a model be a code reviewer, a documentation writer, and a general assistant simultaneously tends to produce inconsistent behaviour. Better to create three separate Modelfiles and run the appropriate one for each task.
Modelfiles vs Passing System Prompts at Runtime
You might wonder why a Modelfile is better than just passing a system prompt via the API or the -system flag at runtime. The answer depends on your use case. For one-off or exploratory use, runtime system prompts are more flexible — you can change them without rebuilding a model. For recurring, team-shared, or production use, Modelfiles win on several dimensions. They are version-controllable in git, shareable as a single file, executable as a named model without any flags, and produce more consistent behaviour because the parameters and system prompt are always applied together atomically. There is also a practical reproducibility advantage: a Modelfile committed to a repository guarantees that everyone on the team is using the exact same temperature, context length, stop tokens, and system prompt — no drift from someone forgetting to copy the full prompt or using a slightly different parameter.
For automated pipelines and scripts, Modelfiles are particularly valuable because they eliminate a category of bugs where the system prompt is accidentally omitted or truncated. When you call ollama run my-model or hit the API with model: my-model, the system prompt is guaranteed to be present regardless of how the calling code is written. This makes Modelfile-backed models substantially more reliable in production integrations than relying on application code to always pass the correct system prompt.
The practical workflow that works well for most teams is: develop and iterate on system prompts at runtime using the API or CLI, then once a prompt is stable and delivering reliable results, promote it into a Modelfile, name the model descriptively, and commit the Modelfile to the project repository. This combines the flexibility of runtime experimentation with the reliability and shareability of a Modelfile-defined model.