Quantization compresses a model by representing its billions of weights using fewer bits. A 7B model in FP16 occupies 14 GB of VRAM; Q4_K_M reduces this to ~4.5 GB — making it runnable on a single consumer GPU — with less than 3.5% perplexity degradation.
How Quantization Works
For post-training quantization (the standard approach), each weight matrix is divided into groups (e.g., 32 or 128 weights per group). Within each group, the full-precision range is linearly mapped onto a limited set of integer values. The mapping parameters (scale and zero-point) are stored alongside the quantized weights. At inference, weights are dequantized on the fly before each multiply-accumulate operation.
Bit Depths Explained
FP16 / BF16 (16-bit)
No quantization — the reference precision. 2 bytes per weight. Best quality; requires most VRAM.
INT8 / Q8 (8-bit)
1 byte per weight. Negligible quality loss (<1%). The safe choice when you have enough VRAM. Used in bitsandbytes LLM.int8().
INT4 / Q4 (4-bit)
0.5 bytes per weight. ~3–5% perplexity increase with k-quants. The most popular on-premise choice. Fits 7B in ~4–5 GB.
INT3 / Q3 (3-bit)
~10% perplexity increase. Noticeable quality drop. Use only under extreme memory constraints.
INT2 / Q2 (2-bit)
~30%+ perplexity increase. Significant coherence loss. Experimental / research only.
1.58-bit (BitNet / ternary)
Weights constrained to {-1, 0, +1}. Requires training from scratch with modified objective. Research frontier; not yet practical for on-premise.
GGUF K-Quant Naming Guide
| Name | Avg bits/w | 7B VRAM | Quality | Use when... |
|---|---|---|---|---|
| Q2_K | 2.6 | 2.8 GB | Poor | Under 4 GB RAM only |
| Q3_K_M | 3.3 | 3.3 GB | Fair | Very tight memory |
| Q4_K_S | 4.4 | 4.1 GB | Good | Tight memory, acceptable quality |
| Q4_K_M ✓ | 4.6 | 4.5 GB | Good | Default recommendation |
| Q5_K_M | 5.7 | 5.6 GB | Very good | You have 8 GB VRAM to spare |
| Q6_K | 6.6 | 6.1 GB | Excellent | ≥8 GB VRAM, want near-lossless |
| Q8_0 | 8.5 | 7.7 GB | Near-FP16 | Benchmarking / max quality |
Quantization Methods Compared
| Method | Format | Target HW | Notes |
|---|---|---|---|
| GGUF k-quants | .gguf | CPU/GPU/Metal | Best for Ollama, LM Studio |
| GPTQ | .safetensors | NVIDIA GPU | Hessian-based, calibration needed |
| AWQ | .safetensors | NVIDIA GPU | Activation-weighted, no calibration |
| EXL2 | .safetensors | NVIDIA GPU | Mixed precision, fastest GPU inference |
| bitsandbytes | on-the-fly | NVIDIA GPU | NF4 for QLoRA training |
Practical Decision Tree
Have an NVIDIA GPU? Use GGUF with Ollama for simplicity, or GPTQ/AWQ with vLLM for max throughput. Apple Silicon? GGUF Q4_K_M via Ollama or llama.cpp. CPU only? GGUF Q4_K_M — expect 3–8 tokens/s on a modern CPU. Fine-tuning? bitsandbytes NF4 (QLoRA).