Ollama Quantization Explained: Q4 vs Q5 vs Q8 and How to Choose

When you pull a model in Ollama, the model name often includes a tag like :q4_K_M or :q8_0. These tags indicate the quantisation level — how aggressively the model’s weights have been compressed from their original 16-bit or 32-bit floating point representation. Understanding quantisation helps you choose the right model version for your hardware and use case, and explains why a 7B model can weigh anywhere from 3GB to 14GB depending on the tag.

What Quantisation Does

A neural network’s weights are numbers. At full precision (FP32), each weight is a 32-bit floating point number — 4 bytes. At half precision (FP16 or BF16), each weight is 2 bytes. A 7B model at FP16 occupies approximately 14GB of memory (7 billion × 2 bytes). Most consumer GPUs cannot fit this. Quantisation reduces each weight to fewer bits: an 8-bit integer (INT8) halves memory to 7GB; a 4-bit integer (INT4) halves it again to 3.5GB. The trade-off is that the reduced precision introduces rounding errors that slightly degrade the model’s output quality — the degree of degradation depends on how carefully the quantisation was done.

The GGUF Quantisation Formats

Ollama uses GGUF format models from llama.cpp. The quantisation naming follows this pattern:

  • Q2_K — 2-bit quantisation, extremely small, significant quality loss, only for very constrained hardware
  • Q3_K_S / Q3_K_M / Q3_K_L — 3-bit, small/medium/large variant (larger = better quality)
  • Q4_0 — 4-bit, older method, good size but Q4_K variants are better quality at similar size
  • Q4_K_S — 4-bit K-quantisation, small variant — smallest 4-bit option
  • Q4_K_M — 4-bit K-quantisation, medium variant — the standard recommendation for most users
  • Q5_K_S / Q5_K_M — 5-bit, noticeably better quality than Q4 with modest size increase
  • Q6_K — 6-bit, near full quality for most tasks, good for 24GB+ VRAM
  • Q8_0 — 8-bit, essentially lossless quality, only practically different from FP16 on extreme precision tasks
  • F16 — full half-precision, full quality, 2x the size of Q8_0

The K-Quantisation Difference

K-quantisation (the _K_ in Q4_K_M) is a smarter compression method than simple integer quantisation. Rather than quantising every weight uniformly, K-quantisation uses different precision for different layers, concentrating more bits on the layers that are most sensitive to precision loss and fewer bits on layers that are more robust. The result is a model that achieves similar or better quality than older Q4_0 quantisation at the same or smaller file size. This is why Q4_K_M has largely replaced Q4_0 as the standard recommendation — it is strictly better at similar hardware cost.

Approximate Sizes for a 7B Model

Quantisation Size VRAM needed Quality vs FP16
Q2_K ~2.7GB 4GB+ Noticeably degraded
Q4_K_S ~4.1GB 6GB+ Good, minor loss
Q4_K_M ~4.5GB 6GB+ Good, minimal loss
Q5_K_M ~5.1GB 8GB+ Very good, negligible loss
Q6_K ~5.9GB 8GB+ Near lossless
Q8_0 ~7.0GB 10GB+ Essentially lossless
F16 ~14GB 18GB+ Full precision

Which Quantisation to Use

Q4_K_M is the right default for most users. It fits a 7B model on 6GB of VRAM, runs fast, and produces quality that is hard to distinguish from higher quantisations on everyday tasks. The quality gap versus Q8_0 on tasks like summarisation, chat, and code generation is small enough that most users cannot reliably tell the difference without side-by-side comparison on specific edge cases.

Q5_K_M is worth the extra VRAM if you have 8GB. The quality step from Q4_K_M to Q5_K_M is more noticeable than Q5 to Q8. For users who have 8GB VRAM and notice occasional factual errors or incoherence with Q4_K_M models, Q5_K_M is a cost-effective upgrade.

Q8_0 for tasks where precision matters. If you are using a model for mathematical reasoning, code generation where subtle bugs matter, or any task where you find the model is making errors that seem quantisation-related, Q8_0 is the practical ceiling before diminishing returns set in. It requires roughly 2× the VRAM of Q4_K_M for the same model.

Q2_K and Q3_K only for extreme hardware constraints. The quality degradation at these levels is significant and generally not worth accepting unless you have no alternative — a smaller model at higher quantisation (e.g., a 3B model at Q8_0) will typically produce better results than a 7B model at Q2_K.

Pulling Specific Quantisations in Ollama

# Default pull (Ollama chooses the recommended quantisation)
ollama pull llama3.2

# Specific quantisation tag
ollama pull llama3.2:8b-instruct-q4_K_M
ollama pull llama3.2:8b-instruct-q5_K_M
ollama pull llama3.2:8b-instruct-q8_0

# List available tags for a model
# Browse at: https://ollama.com/library/llama3.2/tags

# See what quantisation you have
ollama show llama3.2 | grep quantization

What Ollama Pulls by Default

When you run ollama pull llama3.2 without a tag, Ollama pulls the model’s default tag — typically Q4_K_M for most models, as it offers the best balance of size, speed, and quality for the widest range of hardware. Checking the model’s page on ollama.com/library confirms which quantisation is used by the default tag before pulling.

Does Quantisation Affect Speed?

Yes, but not always in the way you expect. Lower quantisation (fewer bits per weight) means less data to load and less memory bandwidth required during inference, which can actually make Q4 models faster than Q8 models on memory-bandwidth-constrained hardware. On NVIDIA consumer GPUs, which have limited memory bandwidth relative to compute, Q4_K_M models often run 20–40% faster than Q8_0 models of the same parameter count. On Apple Silicon with its high unified memory bandwidth, the speed difference is smaller — Q8_0 models run closer to Q4 speed because the bandwidth bottleneck is less severe. The practical implication: if your goal is maximum tokens per second, Q4_K_M is usually faster than Q8_0 on the same hardware, not slower.

Real-World Quality Comparison

The quantisation quality debate is often more theoretical than practical. For the tasks that most local LLM users actually do — writing assistance, summarisation, Q&A, coding help, document analysis — the difference between Q4_K_M and Q8_0 of the same model is difficult to detect without systematic testing. Both produce fluent, coherent responses. Both follow instructions reliably. Both make similar types of errors on difficult tasks. The cases where quantisation level makes a meaningful difference are narrower than often claimed: very precise mathematical computation, tasks that require maintaining numerical accuracy across long chains of reasoning, and edge cases in code generation where subtle precision differences affect logical correctness.

A practical test: take ten representative prompts from your actual workflow and run them through the same model at Q4_K_M and Q8_0. Rate the responses without knowing which is which. Most users find the outputs are indistinguishable on their real tasks, which is the most useful information for making a hardware decision — it means Q4_K_M gives you the same practical value at roughly half the VRAM cost of Q8_0.

Quantisation for Embedding Models

Embedding models are less sensitive to quantisation than generative models. The task of converting text to a fixed-dimensional vector is more robust to precision reduction than generating coherent token sequences. For practical RAG applications, nomic-embed-text at Q4_K_M produces embedding vectors that retrieve relevant documents just as accurately as F16 embeddings — the cosine similarity rankings are essentially identical for typical document retrieval tasks. The quality-sensitive users who might want Q8_0 for their chat model can confidently use Q4_K_M for their embedding model without any practical retrieval quality loss.

The Bigger Model at Lower Quantisation vs Smaller Model at Higher Quantisation

A common hardware dilemma: you have 8GB of VRAM. You can run a 7B model at Q5_K_M (5.1GB) or a 13B model at Q2_K (approximately 5.5GB). Which should you choose? In almost every case, the 7B at Q5_K_M produces better results than the 13B at Q2_K. This seems counterintuitive — more parameters should mean more capability — but Q2_K quantisation introduces enough precision loss to significantly degrade the 13B model’s advantages. The relationship is not linear: going from Q4_K_M to Q2_K loses more quality than going from a 7B to a 13B model gains. The general rule holds across model families: prefer a smaller model at higher quantisation over a larger model at very low quantisation (Q2 or Q3) when your VRAM forces a choice between them. The crossover point where a larger model at Q3_K_M beats a smaller model at Q5_K_M is approximately at the 2× parameter count boundary — a 13B at Q3_K_M is roughly comparable to a 7B at Q5_K_M, but not clearly better for most tasks.

Checking Quantisation of Installed Models

# Show quantisation info for a specific model
ollama show llama3.2

# List all models with their sizes (approximate proxy for quantisation)
ollama list

# More detail including architecture
ollama show llama3.2 --modelinfo

The ollama show command displays the quantisation type in the model information output. This is useful for verifying what you actually pulled when the model was downloaded without a specific tag, or when a Modelfile was created from a base model whose quantisation you want to confirm.

Future Direction: GGUF Improvements

The GGUF quantisation ecosystem continues to improve. Newer quantisation methods (IQ variants — IQ2_XXS, IQ3_S, etc.) achieve better quality at the same bit count compared to the K-quantisation methods described here, by using importance-weighted quantisation that analyses which weights most affect model output and preserves those with higher precision. As these methods mature and gain broader adoption in the Ollama library, the quality gap between 4-bit and 8-bit quantisations will narrow further. For users today, Q4_K_M remains the practical standard; for users reading this in 2027 or beyond, the equivalent recommendation may be IQ4_XS or a successor method that achieves the same hardware efficiency with even less quality loss.

Summary Decision Guide

If you have 6GB VRAM: use Q4_K_M — it is your only practical 7B option and it works well. If you have 8GB VRAM: use Q4_K_M for speed and headroom, or Q5_K_M if you notice quality issues with a specific task. If you have 12GB VRAM: use Q5_K_M or Q6_K for 7B models, or Q4_K_M for 13B models. If you have 16GB+ VRAM: Q8_0 is accessible for 7B models if you want the closest-to-full-quality experience, but Q4_K_M of a 13B model is a better use of the same VRAM for most tasks. On Apple Silicon with 16GB unified memory: Q4_K_M for 7B is comfortable; on 32GB you can run Q8_0 for 7B or Q4_K_M for 13B. The overarching principle: spend your memory budget on more parameters first, then on higher quantisation — parameter count matters more than quantisation precision in the ranges practical on consumer hardware.

Getting Started

If you are new to Ollama and uncertain about quantisation, start with the default pull (no tag). Ollama’s default is Q4_K_M for most models, which is the right choice for the majority of hardware configurations. Run the model for a week on your real tasks. If you notice quality issues that seem model-capability-related rather than prompt-related, try Q5_K_M or Q8_0 and compare. If the model is too slow for interactive use, try Q4_K_S which is slightly smaller and faster. The quantisation level is one of the easiest model parameters to experiment with — changing it requires only pulling a different tag, and switching between quantisation levels gives you direct, concrete data about the quality-speed-memory trade-off on your specific hardware and use cases rather than relying on general guidance that may not reflect your specific situation and hardware.

Leave a Comment