GGUF vs GPTQ vs AWQ: Which LLM Format Should You Use?

You found a 70B model you want to run locally. The Hugging Face page lists fifteen different downloads: GGUF Q4_K_M, GGUF Q5_K_S, GPTQ-Int4, AWQ-4bit, and a dozen more. Which one do you download? Download the wrong format and your inference engine refuses to load it. Choose the wrong quantization level and you either waste VRAM or sacrifice quality. Pick a method mismatched to your hardware and you get half the speed you should.

This confusion is understandable. GGUF, GPTQ, and AWQ look like interchangeable options on a download page, but they solve fundamentally different problems. GGUF is a file format optimized for CPU and hybrid inference. GPTQ is a quantization algorithm that compresses weights using error compensation. AWQ is a different algorithm that protects the weights that matter most. The distinction isn’t academic—wrong choices directly determine whether your model runs fast, runs slow, or doesn’t run at all. This guide cuts through the noise and gives you a clear, practical framework for choosing the right format based on your actual setup.

What Quantization Actually Does

Before comparing formats, understanding the underlying process clarifies why these formats differ in the first place.

A 7B parameter model stored in full 16-bit precision occupies roughly 14GB. Most consumer GPUs have 8–24GB of VRAM. Quantization solves this by reducing each weight from a 16-bit floating-point number to a smaller representation—typically 4 or 8 bits. A 7B model at 4-bit quantization shrinks to roughly 4GB. Same model, same capability, one-quarter the memory.

The hard part isn’t shrinking the numbers—it’s doing so without destroying quality. Naively rounding every weight to fewer bits produces garbage output. The model’s intelligence lives in subtle relationships between weights. Destroying those relationships through blunt rounding destroys coherence. Each quantization method solves this differently, and those differences determine speed, quality, and compatibility.

Why You Can’t Just “Convert” Between Them

GPTQ and AWQ are quantization algorithms—they define how weights get compressed. GGUF is a file format—it defines how compressed weights get stored and loaded. They’re not the same category of thing. You can store GGUF-quantized weights in a GGUF file, but you can’t load a GPTQ-quantized model into llama.cpp expecting GGUF behavior. The tools that load each format are different, the inference engines that run them are different, and the hardware they’re optimized for is different.

GPTQ: The GPU Workhorse

GPTQ (Generative Pre-trained Transformer Quantization) is a post-training quantization method that compresses weights one layer at a time, using calibration data to minimize the error introduced by each compression step.

How GPTQ Works

GPTQ quantizes weights sequentially, adjusting remaining weights in the same layer to compensate for the error introduced by each quantized weight. Think of it like a team of chefs: if one chef adds too much flour by mistake, another compensates by reducing sugar. The cake still works.

The technical mechanism: GPTQ uses approximate second-order information (the Hessian matrix) to determine how much each weight matters. Weights that strongly influence output get more careful treatment. After quantizing one weight, GPTQ updates the remaining weights in that layer to absorb the error.

Calibration dependency: GPTQ’s quality depends on the calibration dataset used during quantization. A well-chosen calibration set produces excellent compression. A poorly chosen one can introduce subtle biases. The quality of the quantized model can be directly influenced by the selection and size of this dataset, potentially leading to reduced accuracy on real-world benchmarks if not chosen carefully.

GPTQ Performance in Practice

GPTQ shines on dedicated GPU inference. In terms of speed, GPTQ excels in GPU inference, often providing substantial speedups. With optimized kernels like Marlin, Marlin is the clear winner for speed—the only method that beats FP16 baseline (712 tok/s vs 461 tok/s), with lowest latency.

Where GPTQ stumbles: coding tasks. GPTQ and Marlin drop more significantly to around 46% on HumanEval, which is about 10% below baseline. If your workflow is code generation, GPTQ’s speed advantage may not justify the quality trade-off.

When to choose GPTQ:

  • You have a dedicated NVIDIA GPU (RTX 30/40 series, A-series)
  • Throughput and latency are your top priorities
  • Your workload is text-heavy (summarization, Q&A, conversation)
  • You’re serving multiple requests simultaneously

AWQ: Smarter Compression

AWQ (Activation-aware Weight Quantization) takes a fundamentally different approach. Instead of treating all weights equally and compensating for errors after the fact, AWQ identifies which weights matter before compressing and protects them.

How AWQ Works

AWQ works around the core concept that “only 1% of weights really matter.” These “salient” weights carry most of the important information. AWQ identifies these salient weights by observing activation patterns—the data flowing through the model during inference. Channels with high activations indicate critical pathways. AWQ scales these weights up before quantization so they occupy more of the available precision range, then scales down less important weights to free up precision for what matters.

The result: AWQ achieves 4-bit compression while preserving the model’s generalization ability better than methods that treat all weights identically. AWQ does not rely on backpropagation or reconstruction, which can make its quantization process potentially faster and less data-intensive, requiring less calibration data.

AWQ Performance in Practice

AWQ generally offers superior accuracy and inference speed for 4-bit quantization on GPU hardware. Its activation-aware approach, which protects a small fraction of salient weights, proves more robust and data-efficient than alternatives, making it the preferred choice for high-performance, quality-sensitive cloud or edge GPU serving.

However, raw throughput benchmarks tell a more nuanced story. AWQ is surprisingly slow at 67 tok/s in vLLM. This might be due to the specific kernel implementation. AWQ’s speed depends heavily on which inference engine and kernel you use. With optimized AWQ kernels, performance is competitive with GPTQ. With unoptimized ones, it can be significantly slower.

When to choose AWQ:

  • Quality preservation matters more than raw speed
  • You’re doing creative writing, instruction following, or nuanced tasks
  • You have a modern NVIDIA GPU with good driver support
  • You want the best balance of quality and efficiency at 4-bit

GGUF: The Format That Runs Everywhere

GGUF isn’t a quantization algorithm—it’s a file format. GGUF is not an algorithm but a standardized, portable file format that has democratized LLM deployment on consumer-grade hardware. Paired with the llama.cpp inference engine, it excels in CPU-centric and hybrid CPU-GPU environments, offering unparalleled ease of use and cross-platform compatibility.

How GGUF Works

GGUF files store quantized model weights in a structured format that llama.cpp can load efficiently. The quantization inside GGUF files uses various algorithms (including k-quant methods), and GGUF offers granular control over precision through its naming convention.

GGUF quantization levels:

  • Q8_0: 8-bit, near-lossless quality, largest file size
  • Q5_K_M: 5-bit with medium k-quant, excellent quality balance
  • Q4_K_M: 4-bit with medium k-quant—the most popular choice
  • Q4_K_S: 4-bit with small k-quant, slightly lower quality, smaller file
  • Q3_K_M: 3-bit, noticeable quality loss, very small file
  • Q2_K: 2-bit, significant degradation, only for extreme memory constraints

The _K_M and _K_S suffixes indicate “k-quant” variants. The mixed-precision approaches seen in GGUF’s _M and _L variants, where different layers receive different bit-widths, are an early example of more sophisticated quantization. Different layers get different precision levels based on their importance, producing better quality than uniform quantization at the same average bit-width.

GGUF’s Unique Advantage: Hybrid CPU/GPU

GGUF allows users to run LLMs on a CPU while offloading some layers to the GPU, offering speed improvements. GGUF is particularly useful for those running models on CPUs or Apple devices.

This layer-offloading is GGUF’s killer feature. On a machine with 8GB VRAM and 32GB system RAM, you can offload the most compute-intensive layers to the GPU while keeping the rest in system memory. No other format handles this gracefully.

GGUF Performance in Practice

GGUF’s performance story is counterintuitive. GGUF performs best among quantized models at 54.27% on HumanEval, only 2% below baseline. This is surprising since GGUF had the worst perplexity. Perplexity and task performance don’t always correlate—GGUF’s K-quant method seems to preserve reasoning ability better than its perplexity score would suggest.

On pure GPU inference through vLLM, GGUF is slower than GPTQ or AWQ. But that’s not its playing field. On llama.cpp—the engine it was designed for—GGUF runs fast and handles hardware configurations that other formats simply can’t.

When to choose GGUF:

  • You’re on a CPU or Apple Silicon (M1/M2/M3)
  • You have limited VRAM and want CPU/GPU hybrid inference
  • You’re using Ollama, llama.cpp, or similar tools
  • You want maximum flexibility in precision levels
  • Cross-platform compatibility matters

Format Comparison at a Glance

Practical characteristics across the dimensions that actually matter

GPTQ
Type Quantization algorithm
Primary engine vLLM, TGI, Transformers
Hardware sweet spot NVIDIA GPU (dedicated)
CPU support ❌ None
Speed (GPU) ⭐⭐⭐⭐⭐ (with Marlin kernel)
Quality retention ⭐⭐⭐☆☆ (coding-sensitive)
Ease of use ⭐⭐⭐☆☆
AWQ
Type Quantization algorithm
Primary engine vLLM, TGI, Transformers
Hardware sweet spot NVIDIA GPU (modern)
CPU support ❌ None
Speed (GPU) ⭐⭐⭐☆☆ (kernel-dependent)
Quality retention ⭐⭐⭐⭐⭐ (best at 4-bit)
Ease of use ⭐⭐⭐☆☆
GGUF
Type File format
Primary engine llama.cpp, Ollama
Hardware sweet spot CPU, Apple Silicon, hybrid
CPU support ✅ Full (native)
Speed (GPU) ⭐⭐☆☆☆ (not its role)
Quality retention ⭐⭐⭐⭐☆ (strong on reasoning)
Ease of use ⭐⭐⭐⭐⭐

Choosing by Hardware

Your hardware dictates your format more than your preferences do. Here’s the decision logic based on what you actually own.

Apple Silicon (M1 / M2 / M3 / M4)

Use GGUF. No discussion needed. Apple’s Metal framework works with llama.cpp, and GGUF’s CPU/GPU hybrid approach maps perfectly to Apple Silicon’s unified memory architecture. GPTQ and AWQ require CUDA—they simply won’t run on Mac.

Best GGUF level for Apple Silicon: Q4_K_M for 7B models. Q5_K_M if you have an M2 Pro or better with 24GB+ unified memory.

NVIDIA GPU (RTX 30/40 Series, Consumer)

You have options, and the choice depends on your priority.

  • Maximum speed: GPTQ with Marlin kernel. Fastest throughput by a significant margin.
  • Best quality at 4-bit: AWQ. Preserves coherence better, particularly for nuanced generation.
  • Flexibility/ease of use: GGUF via Ollama. Simplest setup, works out of the box, good enough speed for interactive use.

CPU Only (No Discrete GPU)

Use GGUF. GPTQ and AWQ are GPU-only formats. GGUF is the only option here, and a surprisingly capable one. A 7B model at Q4_K_M runs at 5–15 tokens/second on a modern CPU—slow but usable for non-interactive tasks.

Mixed Environments (Some VRAM, Plenty of RAM)

GGUF again. Its layer-offloading lets you put as many layers on the GPU as fit in VRAM and handle the rest in system RAM. This is the most practical approach for machines with 6–12GB VRAM trying to run 13B+ models.

Quantization Levels: The Numbers Behind the Names

Choosing between GGUF/GPTQ/AWQ is only half the decision. Within each format, you choose a bit-width. Here’s what the numbers actually mean for a 7B model:

8-bit (Q8): ~8GB file. Near-perfect quality. Use this if you have the VRAM and quality matters most.

5-bit (Q5_K_M): ~6GB file. For production: Use Q5_K_M for critical applications or Q4_K_M for cost-sensitive deployments. An excellent middle ground when you can afford slightly more memory than 4-bit.

4-bit (Q4_K_M): ~4GB file. The default recommendation for most use cases. For beginners: Start with Q4_K_M quantization for the best balance of quality and efficiency. Quality retention stays strong—Q4_K_M: 15 tokens/second, 95% quality retention compared to full precision on the same hardware.

3-bit and below: Meaningful quality degradation. Only use these if memory is the binding constraint and you’re okay accepting noticeably rougher outputs.

Decision Flowchart

🍎 Running on Apple Silicon?
GGUF Q4_K_M via Ollama or llama.cpp. Only viable option, and a great one.
🖥️ Have an NVIDIA GPU, need max speed?
GPTQ-Int4 with Marlin kernel via vLLM. Fastest throughput available.
✍️ Have an NVIDIA GPU, quality is king?
AWQ-4bit. Best quality preservation at 4-bit, especially for creative or nuanced tasks.
💻 CPU only, no GPU at all?
GGUF Q4_K_M. The only format that works here. Slow but functional.
🔀 Limited VRAM, want to run a bigger model?
GGUF Q4_K_M with partial GPU offload. Splits layers between VRAM and system RAM.
🚀 Serving multiple users, production inference?
GPTQ or AWQ via vLLM. Both scale well; pick GPTQ for speed, AWQ for quality.

Common Mistakes and How to Avoid Them

Downloading the Wrong Format for Your Engine

The most common mistake: downloading a GPTQ model and trying to load it in Ollama (which expects GGUF). Or downloading GGUF and trying to use it in vLLM’s GPTQ mode. Match your format to your inference engine before downloading anything.

Engine → Format mapping:

  • Ollama → GGUF
  • llama.cpp → GGUF
  • vLLM → GPTQ or AWQ
  • Hugging Face Transformers → GPTQ or AWQ (with appropriate libraries)

Assuming Higher Bits Always Means Better

Going from Q4 to Q8 doubles your memory usage but may only improve output quality by 2–5% on most tasks. All methods kept perplexity within 7% of baseline. 4-bit quantization is practical for real-world use. For most applications, Q4_K_M is sufficient. Reserve Q8 for tasks where subtle quality differences matter—fine-tuning preparation, benchmark evaluation, or highly sensitive outputs.

Ignoring Kernel Support

Kernels matter more than algorithms: Marlin uses the same GPTQ weights but runs 2.5x faster thanks to optimized CUDA kernels. The quantization algorithm is only half the story. Before settling on GPTQ, check whether your inference engine supports optimized kernels for it. Unoptimized GPTQ inference can actually be slower than full precision—a painful surprise if you haven’t checked.

Testing Only on One Benchmark

Perplexity doesn’t tell the whole story: GGUF had the worst perplexity but second-best HumanEval score. Always test on your actual use case. Perplexity, MMLU, HumanEval—each measures something different. A format that wins on perplexity might lose on code generation. Test the specific task you care about, not just the headline number.

Conclusion

GGUF, GPTQ, and AWQ aren’t competing versions of the same thing—they’re tools designed for different jobs. GGUF is the format that made local LLMs accessible on consumer hardware, excelling on CPUs, Apple Silicon, and hybrid setups. GPTQ delivers the fastest GPU inference when paired with optimized kernels, making it the choice for throughput-sensitive production serving. AWQ preserves model quality most carefully at 4-bit compression, making it the pick when coherence and accuracy matter more than raw speed. Your hardware and workload determine the answer, not marketing claims.

Start with GGUF Q4_K_M if you’re unsure—it’s the most forgiving choice across hardware configurations, and Ollama makes it trivially easy to run. If you find yourself on dedicated NVIDIA hardware and need either maximum speed or maximum quality, switch to GPTQ or AWQ respectively. The formats are cheap to experiment with (downloads, not compute), and running the same model in two formats back-to-back is the fastest way to see which one delivers for your specific use case.

Leave a Comment