AWQ vs GPTQ vs bitsandbytes: LLM Quantization Methods Compared

Quantization reduces the memory footprint of large language models by representing weights in lower precision than the bf16 used during training. A 7B model in bf16 requires roughly 14GB of GPU memory; quantized to 4 bits, the same model fits in 3.5GB. This makes the difference between requiring an A100 and running on a consumer GPU. AWQ, GPTQ, and bitsandbytes are the three most widely used quantization libraries, each with a different approach, quality profile, and integration story.

What Weight-Only Quantization Does

All three methods implement weight-only quantization: weights are stored in reduced precision (4-bit or 8-bit integers), but matrix multiplications happen in higher precision after dequantization. LLM decode is memory-bandwidth bound — the bottleneck is loading weights from HBM, not compute. Loading 4-bit weights requires reading 4x less data than bf16, which translates directly to 4x higher token throughput at the same memory bandwidth. The dequantization overhead is negligible for bandwidth-bound decode. Quantization methods differ primarily in how they identify and handle the weight channels most sensitive to precision loss.

bitsandbytes

bitsandbytes integrates directly with HuggingFace Transformers and requires no calibration data — loading a model in 8-bit or 4-bit is a single argument change. LLM.int8() keeps outlier activation channels in fp16 while quantizing the remainder to int8, achieving under 1% perplexity degradation on most benchmarks. The 4-bit NF4 (Normal Float 4) format places quantization levels at equal quantiles of a normal distribution, concentrating precision where weight values are dense. Combined with double quantization (quantizing the quantization scales themselves), NF4 achieves better perplexity than standard int4 at the same bit-width and is the format used in QLoRA.

The main limitation of bitsandbytes is inference speed. Its CUDA kernels are not as throughput-optimized as AWQ or GPTQ — on A100 GPUs, bitsandbytes 4-bit inference is typically 20–40% slower than AWQ. bitsandbytes is the right choice for QLoRA fine-tuning and rapid experimentation, but for production serving where throughput matters, AWQ or GPTQ are generally preferable.

GPTQ

GPTQ uses a small calibration dataset to minimize quantization error per layer using second-order gradient information (Hessians), identifying which weights are most sensitive and quantizing them more carefully. Calibration takes 30 minutes to several hours, but pre-quantized GPTQ models for popular bases are widely available on HuggingFace Hub. Inference uses AutoGPTQ or the ExLlamaV2 backend, which has highly optimized 4-bit matrix multiplication kernels. GPTQ quality at 4 bits is slightly below AWQ due to not accounting for activation magnitudes. At 3 bits with groupsize=128, GPTQ is competitive with AWQ 4-bit — making it the method of choice for 3-bit deployment on extremely memory-constrained hardware like 8GB consumer GPUs.

AWQ

AWQ (Activation-aware Weight Quantization) improves on GPTQ by accounting for input activation magnitudes when quantizing. Quantization error in a weight channel is amplified by the magnitude of the corresponding input activation — channels with large activations contribute proportionally more to output error when quantized. AWQ identifies the top 1% of weight channels by activation magnitude and protects them via scaling rather than mixed precision, effectively allocating more quantization precision to sensitive channels without changing the uniform 4-bit storage format. This produces better perplexity than GPTQ at the same bit-width while using the same memory layout and optimized kernels. AWQ with vLLM — which has native AWQ support with fused GEMM kernels — is the standard production stack for 4-bit LLM serving in 2026.

Quality Comparison

At 4-bit precision, quality ranking from best to worst is generally: AWQ ≈ GPTQ with groupsize=32 > GPTQ with groupsize=128 > bitsandbytes NF4. The differences are small — on Llama 3 8B, all three methods achieve perplexity within 2–5% of the bf16 baseline. For most practical tasks, quality differences between methods are hard to distinguish in user studies. The choice between AWQ and GPTQ for inference is therefore made primarily on inference speed and integration rather than quality.

For tasks requiring the highest possible quality at reduced precision — reasoning, code generation, structured output extraction — consider 8-bit quantization (LLM.int8() or GPTQ int8) which reduces memory by 2x with under 0.5% perplexity degradation, rather than 4-bit with its larger quality trade-off. For tasks where quality is flexible and memory is the primary constraint — consumer hardware, maximizing concurrency on limited VRAM — 4-bit AWQ is the right choice.

Choosing in Practice

For production inference serving with vLLM: use AWQ 4-bit. Best-in-class quality at 4 bits, native vLLM support, optimized GEMM kernels, and wide availability of pre-quantized models make this the default. For QLoRA fine-tuning: use bitsandbytes NF4 — it’s the only format that supports gradient flow through the quantized base model. For 3-bit deployment on consumer GPUs where 4-bit still doesn’t fit: use GPTQ with groupsize=128 at 3 bits via ExLlamaV2. For rapid experimentation without separate quantization tooling: bitsandbytes is the path of least resistance with competitive quality.

Quantization and Fine-Tuned Models

Quantizing a fine-tuned model requires re-running calibration on the fine-tuned weights — you cannot reuse quantization from the base model. This is because fine-tuning shifts the weight distributions, and the optimal quantization scaling factors for the base model are no longer optimal for the fine-tuned model. The calibration data should be representative of the fine-tuned model’s target domain, not the general domain used for base model calibration. For AWQ, using 128–512 samples from your fine-tuning dataset as calibration data typically produces better quantization quality than using the default general-domain calibration set, especially if your fine-tuning has significantly shifted the model’s weight distributions toward domain-specific patterns.

When deploying a QLoRA fine-tuned model, the standard workflow is: merge the LoRA adapter into the dequantized base model weights (producing a full bf16 model), then re-quantize the merged model with AWQ or GPTQ using domain-appropriate calibration data. The merged-then-quantized model is typically 5–10% better in quality than a quantized base model with a floating-point adapter on top, because the quantization is optimized for the final merged weights rather than the original base weights.

Quantization-Aware Training

Post-training quantization (what all three methods above implement) applies quantization after training is complete. Quantization-aware training (QAT) instead simulates quantization during training, allowing the model to adapt its weights to minimize the impact of quantization noise. QAT typically produces better quality than post-training quantization at the same bit-width, at the cost of a full training run. For models where post-training 4-bit quantization produces unacceptable quality degradation — typically smaller models below 7B where the capacity loss from quantization is proportionally larger — QAT is worth evaluating. For 7B+ models where post-training AWQ or GPTQ achieves under 3% perplexity degradation, QAT’s additional training cost is rarely justified.

Group Size and Its Effect on Quality

GPTQ and AWQ both use per-group quantization scales, where a group is a contiguous block of weights that share the same quantization scale factor. Smaller group sizes mean more scale factors (more storage overhead but better quality); larger group sizes mean fewer scale factors (less overhead but coarser quantization). Group size 128 is the standard default — it adds roughly 3% memory overhead versus the weight storage and produces good quality for most models. Group size 32 produces noticeably better quality, especially for smaller models where each weight matters more, at roughly 10% memory overhead. For a 7B model already quantized to 4 bits, the additional memory from smaller group sizes is typically worth the quality improvement for quality-sensitive deployments.

The interaction between group size and model architecture matters. Models with high weight variance within a layer benefit more from small group sizes — the more diverse the weight values in a block, the coarser the approximation with a single scale factor. Attention layers in transformers tend to have higher weight variance than FFN layers, which is why some quantization methods use smaller group sizes specifically for attention weights while using larger group sizes for FFN weights to balance quality and memory efficiency.

Deployment on Consumer Hardware

One of the most practically important applications of quantization is enabling inference on hardware that would otherwise be unable to run large models at all. An RTX 4090 with 24GB VRAM cannot load a 34B parameter model in bf16 (68GB). At 4-bit AWQ (approximately 17GB for 34B), it fits with room for KV cache. At 3-bit GPTQ (approximately 13GB), a 34B model fits on a single 16GB GPU. This democratizes access to large models for individual researchers, small teams, and edge deployment scenarios where cloud inference costs are prohibitive. The quality trade-off at 3-bit is more significant than at 4-bit, but for many applications a slightly degraded 34B model outperforms a full-precision 7B model, making the trade-off worthwhile.

Quantization and Context Length

Quantizing model weights reduces the memory required for the model itself, but the KV cache for long-context inference remains in the same precision as activations (typically bf16) unless KV cache quantization is also applied. For a 7B model in AWQ 4-bit serving 4K-token contexts at batch size 32, the model weights require roughly 3.5GB while the KV cache requires approximately 4GB — the KV cache is larger than the model. At 32K-token contexts, the KV cache dominates overwhelmingly. This means that weight quantization alone does not solve the memory problem for long-context serving; KV cache quantization (fp8 or int8 KV cache) must be paired with weight quantization for high-concurrency long-context deployments. vLLM supports both simultaneously: AWQ weights with fp8 KV cache is a well-tested combination that achieves close to 4x memory reduction on both the model and the KV cache relative to bf16 with bf16 KV.

The quality impact of KV cache quantization is separate from and additive to the quality impact of weight quantization. AWQ 4-bit weight quantization with bf16 KV cache degrades perplexity by roughly 2–5% versus bf16 baseline. Adding fp8 KV cache quantization on top degrades it by an additional 0.5–1%. For most applications this combined degradation is acceptable, and the memory savings enable much higher concurrency or longer contexts than either technique alone.

Leave a Comment