Quantization – LLM Glossary

Quantization compresses a model by representing its billions of weights using fewer bits. A 7B model in FP16 occupies 14 GB of VRAM; Q4_K_M reduces this to ~4.5 GB — making it runnable on a single consumer GPU — with less than 3.5% perplexity degradation.

How Quantization Works

For post-training quantization (the standard approach), each weight matrix is divided into groups (e.g., 32 or 128 weights per group). Within each group, the full-precision range is linearly mapped onto a limited set of integer values. The mapping parameters (scale and zero-point) are stored alongside the quantized weights. At inference, weights are dequantized on the fly before each multiply-accumulate operation.

Bit Depths Explained

FP16 / BF16 (16-bit)

No quantization — the reference precision. 2 bytes per weight. Best quality; requires most VRAM.

INT8 / Q8 (8-bit)

1 byte per weight. Negligible quality loss (<1%). The safe choice when you have enough VRAM. Used in bitsandbytes LLM.int8().

INT4 / Q4 (4-bit)

0.5 bytes per weight. ~3–5% perplexity increase with k-quants. The most popular on-premise choice. Fits 7B in ~4–5 GB.

INT3 / Q3 (3-bit)

~10% perplexity increase. Noticeable quality drop. Use only under extreme memory constraints.

INT2 / Q2 (2-bit)

~30%+ perplexity increase. Significant coherence loss. Experimental / research only.

1.58-bit (BitNet / ternary)

Weights constrained to {-1, 0, +1}. Requires training from scratch with modified objective. Research frontier; not yet practical for on-premise.

GGUF K-Quant Naming Guide

Name	Avg bits/w	7B VRAM	Quality	Use when...
Q2_K	2.6	2.8 GB	Poor	Under 4 GB RAM only
Q3_K_M	3.3	3.3 GB	Fair	Very tight memory
Q4_K_S	4.4	4.1 GB	Good	Tight memory, acceptable quality
Q4_K_M ✓	4.6	4.5 GB	Good	Default recommendation
Q5_K_M	5.7	5.6 GB	Very good	You have 8 GB VRAM to spare
Q6_K	6.6	6.1 GB	Excellent	≥8 GB VRAM, want near-lossless
Q8_0	8.5	7.7 GB	Near-FP16	Benchmarking / max quality

Quantization Methods Compared

Method	Format	Target HW	Notes
GGUF k-quants	.gguf	CPU/GPU/Metal	Best for Ollama, LM Studio
GPTQ	.safetensors	NVIDIA GPU	Hessian-based, calibration needed
AWQ	.safetensors	NVIDIA GPU	Activation-weighted, no calibration
EXL2	.safetensors	NVIDIA GPU	Mixed precision, fastest GPU inference
bitsandbytes	on-the-fly	NVIDIA GPU	NF4 for QLoRA training

Practical Decision Tree

Have an NVIDIA GPU? Use GGUF with Ollama for simplicity, or GPTQ/AWQ with vLLM for max throughput. Apple Silicon? GGUF Q4_K_M via Ollama or llama.cpp. CPU only? GGUF Q4_K_M — expect 3–8 tokens/s on a modern CPU. Fine-tuning? bitsandbytes NF4 (QLoRA).