A model's weights are billions of numbers. By default each is stored in 16 bits (FP16); quantization stores them in fewer — 8, 4, even 2 bits — by mapping the range of values onto a smaller set. The payoff is dramatic VRAM savings (and often speed); the cost is a controlled, usually small, loss of precision. Understanding the levels and formats lets you fit the biggest, best model your hardware can hold.
Quantization levels: the quality/size trade-off
| Level | Size vs FP16 | Quality | Use when |
|---|---|---|---|
| FP16 | 1× (full) | Reference | VRAM is no constraint |
| 8-bit | ~0.5× | Near-lossless | quality-critical, have VRAM |
| 4-bit | ~0.25× | Very good | the default sweet spot |
| 3-bit | ~0.19× | Noticeable loss | squeezing onto small VRAM |
| 2-bit | ~0.13× | Significant loss | last resort only |
The three formats
- GGUF — the llama.cpp format. Runs on CPU, Apple Silicon and mixed CPU+GPU, and is what Ollama and LM Studio use. Most flexible for local/desktop; not the fastest for pure-GPU high-concurrency serving.
- GPTQ — GPU-focused 4-bit quantization, widely supported by vLLM/TGI. Fast pure-GPU inference; a long-standing, well-tooled standard.
- AWQ — activation-aware quantization that often preserves quality slightly better than GPTQ at the same bit-width, with fast GPU inference. Increasingly the default for serving.
How to choose
- Mac / CPU / desktop, easy setup → GGUF (Ollama, LM Studio).
- Production GPU serving, many users → AWQ (or GPTQ) on vLLM/TGI.
- Quality matters most and you have VRAM → 8-bit.
- Default for most local use → a good 4-bit quant (Q4_K_M in GGUF, or 4-bit AWQ).