Gemma 3 is Google’s third-generation open-weight model family, released in early 2025. It is notable for two things that make it particularly interesting for local deployment: it is genuinely multimodal (the 4B, 12B, and 27B variants all accept image inputs natively), and it achieves strong benchmark results at smaller parameter counts than most competing models. The 4B variant fits comfortably on 6GB of VRAM and outperforms many 7B models from other families on reasoning tasks. This guide covers what makes Gemma 3 worth trying, how to run it with Ollama, and which variant to choose for your hardware.
The Gemma 3 Model Family
Gemma 3 comes in four sizes: 1B, 4B, 12B, and 27B. All except the 1B support image inputs. The models are instruction-tuned variants designed for chat and assistant use — there are also base (pre-trained only) variants available on Hugging Face for fine-tuning purposes, but for local deployment the instruct variants are what you want.
- Gemma 3 1B — text only, fast on CPU, useful for embedded or edge scenarios
- Gemma 3 4B — text and image, fits on 6GB VRAM, strong reasoning for its size
- Gemma 3 12B — text and image, needs 10GB+ VRAM, flagship mid-range option
- Gemma 3 27B — text and image, needs 20GB+ VRAM or Apple Silicon 32GB+, near-frontier quality locally
Running Gemma 3 with Ollama
# Pull the 4B instruct variant (recommended starting point)
ollama pull gemma3:4b
# Run interactively
ollama run gemma3:4b
# Other sizes
ollama pull gemma3:12b
ollama pull gemma3:27b
# Check what is running and VRAM usage
ollama ps
Using Gemma 3’s Vision Capabilities
# Describe an image from the CLI
ollama run gemma3:4b "What is in this image?" --image photo.jpg
# Ask a specific question
ollama run gemma3:4b "What text is visible in this screenshot?" --image screen.png
import ollama
# Image analysis with Gemma 3
response = ollama.chat(
model='gemma3:4b',
messages=[{
'role': 'user',
'content': 'Describe what you see in this image in detail.',
'images': ['photo.jpg']
}]
)
print(response['message']['content'])
Gemma 3 4B vs Llama 3.2 8B: The Size-Quality Trade-Off
The headline claim for Gemma 3 4B is that it outperforms many 7B and 8B models on reasoning benchmarks despite being half the size. In practice this holds up for certain task types — mathematical reasoning, code generation, and structured logical problems — where Gemma 3’s training approach shows clear advantages. For general conversation and creative writing, the gap is smaller and sometimes reverses in favour of larger models from other families. The practical question for local use is whether Gemma 3 4B produces output you are satisfied with for your specific tasks, at half the VRAM cost of an 8B model.
On 8GB VRAM, Gemma 3 4B leaves 2–3GB of headroom that an 8B model would consume. This headroom can be used for a larger context window (raising num_ctx to 16K or 32K is comfortable on Gemma 3 4B at 8GB VRAM), or for simultaneously keeping an embedding model loaded. For machines at the 8GB VRAM limit, Gemma 3 4B’s efficiency makes it worth serious evaluation as an alternative to pushing an 8B model to its VRAM limits.
Gemma 3 12B: The Sweet Spot for Quality
Gemma 3 12B sits in a similar tier to Mistral Nemo 12B but with native multimodality at the same parameter count. If you are already running a 12B model for its quality advantage over 7–8B models, Gemma 3 12B adds image understanding without increasing the hardware requirement. The 12B variant performs well on all the tasks where 7B models show weakness: complex multi-step reasoning, long document comprehension, maintaining coherence in extended conversations, and following complex structured output formats reliably. At Q4_K_M quantisation it requires approximately 8GB of VRAM, making it accessible on a 10GB or 12GB GPU with some context window headroom.
Gemma 3 27B: Local Frontier Quality
The 27B variant pushes into territory where it genuinely competes with cloud models for many tasks. On Apple Silicon with 32GB unified memory it runs smoothly at Q4_K_M quantisation, producing responses that are competitive with GPT-4o-mini class models on most practical tasks. For users with M2 Max or M3 Max machines, Gemma 3 27B is the most capable model they can run at reasonable speed without quantisation compromising quality significantly. At Q4_K_M, 27B uses approximately 17GB of VRAM or unified memory — fitting comfortably on a 24GB discrete GPU or a 32GB Apple Silicon system.
Context Window and Multilingual Support
Gemma 3 supports a 128K token context window — the same as Mistral Nemo — making it genuinely capable of processing long documents in a single pass. To use more than the default 2048-token window in Ollama, increase num_ctx via a Modelfile:
cat > Modelfile.gemma3 << 'EOF'
FROM gemma3:4b
PARAMETER num_ctx 32768
PARAMETER temperature 0.7
EOF
ollama create gemma3-32k -f Modelfile.gemma3
Gemma 3 also has strong multilingual support, with Google reporting solid performance in over 140 languages. For applications serving non-English users, Gemma 3 is worth evaluating specifically on your target languages — its multilingual training breadth exceeds most other open models at comparable sizes.
Gemma 3 vs Other Local Models: How to Choose
Choose Gemma 3 4B over Llama 3.2 8B when VRAM is constrained and you need strong reasoning — you get more reasoning capability per gigabyte of VRAM. Choose Gemma 3 12B when you want native multimodality alongside strong general capability and are already targeting the 12B tier. Choose Gemma 3 27B when you have Apple Silicon with 32GB+ or a 24GB GPU and want the best locally-runnable model quality for text and image tasks combined. Choose Llama 3.2 or Qwen2.5 over Gemma 3 when creative writing quality, general instruction following, or deep Python/JavaScript code generation is the primary task — the other model families have narrower advantages in those specific areas even if Gemma 3 wins on broad reasoning benchmarks.
Google’s Open Weights Commitment
Gemma 3 is released under the Gemma Terms of Use, which permits use in commercial applications for most users — organisations with over one million monthly active users require a separate licence. The weights are available on Hugging Face and through Ollama’s model library with no access gating beyond accepting the terms. This makes Gemma 3 a practical choice for production local deployments without the licensing complexity that surrounds some other open-weight models. The continued investment Google is making in the Gemma family — with each generation bringing meaningful capability improvements — suggests the model series will continue to be a relevant option in the local LLM landscape for the foreseeable future.
Running Gemma 3 in a Python Application
The native Ollama library and the OpenAI-compatible endpoint both work with Gemma 3 without any special configuration. Use the chat endpoint rather than generate — Gemma 3 is a chat-format model and the generate endpoint bypasses the chat template, producing inconsistent formatting on instruction-following tasks.
import ollama
def ask_gemma(question: str, model: str = 'gemma3:4b') -> str:
response = ollama.chat(
model=model,
messages=[{'role': 'user', 'content': question}],
options={'temperature': 0.7, 'num_ctx': 8192}
)
return response['message']['content']
def analyse_image_gemma(image_path: str, question: str) -> str:
response = ollama.chat(
model='gemma3:4b',
messages=[{'role': 'user', 'content': question, 'images': [image_path]}]
)
return response['message']['content']
print(ask_gemma('Explain the attention mechanism in two paragraphs.'))
print(analyse_image_gemma('chart.png', 'Summarise the key trends in this chart.'))
Gemma 3 on Real Benchmarks
Published evaluations give Gemma 3 4B scores comparable to Llama 3.1 8B on MMLU and reasoning tasks — impressive for half the parameter count. For coding (HumanEval), Qwen2.5-Coder 7B remains stronger. The tasks that most favour Gemma 3 are mathematical reasoning (GSM8K, MATH), instruction following (IFEval), and factual Q&A. For users whose work involves quantitative reasoning, Gemma 3 4B is genuinely the best model in the sub-6GB VRAM tier available today.
On long-form generation, Gemma 3 4B holds quality well across 2,000–3,000 tokens without the coherence degradation that affects some smaller models. This is a consequence of Google’s training data quality and instruction tuning investment, which shows up more clearly in extended generation tasks than in short-answer benchmarks. For a 4B model to sustain structured, coherent output across several pages is genuinely useful in real workflows — summarisation, drafting, analysis — where short-answer quality is a poor proxy for usefulness.
Document Analysis with Large Context
With a 32K context window configured, Gemma 3 4B can process documents of 20,000–25,000 words in a single pass — research papers, contracts, technical documentation — without chunking. The combination of strong reasoning, large context, and multimodal capability in a single 4B model is Gemma 3’s defining characteristic. No other open model at this parameter count has all three simultaneously.
def analyse_document(file_path: str, question: str) -> str:
with open(file_path) as f:
content = f.read()
words = content.split()
if len(words) > 20000:
content = ' '.join(words[:20000])
return ollama.chat(
model='gemma3:4b',
messages=[{'role':'user','content':f'{question}\n\n{content}'}],
options={'num_ctx': 32768}
)['message']['content']
print(analyse_document('report.txt', 'What are the three main conclusions?'))
Gemma 3 vs Other Local Models: How to Choose
Choose Gemma 3 4B over Llama 3.2 8B when VRAM is constrained and reasoning quality matters — you get more reasoning capability per gigabyte. Choose Gemma 3 12B when you want native multimodality alongside strong general capability at the 12B tier. Choose Gemma 3 27B on Apple Silicon 32GB+ or a 24GB GPU for near-frontier local quality across text and image tasks. Stick with Qwen2.5-Coder for pure code generation, and Llama 3.2 when you have established workflows already tuned for that model family and the switching cost outweighs the potential gain. For users still selecting their primary local model, Gemma 3 4B is the most compelling option in the sub-6GB VRAM tier available in 2026 — a model that punches significantly above its weight on the tasks that matter most for productive daily use.
Google’s Open Weights Commitment
Gemma 3 is released under the Gemma Terms of Use, which permits commercial use for most organisations — only those with over one million monthly active users need a separate licence. The weights are available on Hugging Face and through Ollama’s model library without access gating beyond accepting the terms. This licensing clarity makes Gemma 3 a practical choice for production local deployments without the uncertainty that surrounds some other open-weight models.
The continued investment Google is making in the Gemma family is also worth noting as a signal of longevity. Each generation has brought meaningful improvements in capability, efficiency, and multimodal support. For organisations or individuals making a longer-term commitment to a local model family, Google’s demonstrated investment in Gemma — backed by the research infrastructure of one of the world’s leading AI labs — provides reasonable confidence that the model series will remain a relevant and improving option. Gemma 3 is not a snapshot of what was possible at a point in time; it is the current milestone in an ongoing series that will continue to advance.
Getting Started
Pull Gemma 3 4B with ollama pull gemma3:4b, run it with ollama run gemma3:4b, and try a reasoning task you normally use as a benchmark. For vision, add --image yourfile.jpg to the run command. If the 4B output quality is sufficient, you have the most capable sub-6GB VRAM model available with no further configuration needed. If you need more quality and have the hardware, upgrade to 12B — the command is identical, just a different model tag. The jump in quality from 4B to 12B is noticeable on complex multi-step tasks, while the jump from 12B to 27B is smaller and more hardware-intensive.
Gemma 3 is one of the few local model families where the full range from ultralight edge deployment to high-quality professional use is covered by a single coherent architecture — 1B through 27B, all with consistent APIs through Ollama. Starting with 4B and scaling up as your hardware and requirements evolve is a smooth progression rather than a platform switch, which lowers the long-term cost of adopting the family as your primary local model stack.
For anyone who has not tried a Google-trained model locally before, Gemma 3 is the most compelling entry point in the family’s history — the 4B variant in particular represents a genuine step change in what is achievable at this parameter count and hardware tier. Pull it once, run it for a week on your real tasks, and compare the results directly against whatever you currently use — that concrete comparison is worth more than any benchmark table.
That empirical approach — test on your real work rather than relying on published benchmarks — is the right way to evaluate any local model, and Gemma 3 holds up well under it.