A model's weights are billions of numbers. By default each is stored in 16 bits (FP16); quantization stores them in fewer — 8, 4, even 2 bits — by mapping the range of values onto a smaller set. The payoff is dramatic VRAM savings (and often speed); the cost is a controlled, usually small, loss of precision. Understanding the levels and formats lets you fit the biggest, best model your hardware can hold.

Quantization levels: the quality/size trade-off

LevelSize vs FP16QualityUse when
FP161× (full)ReferenceVRAM is no constraint
8-bit~0.5×Near-losslessquality-critical, have VRAM
4-bit~0.25×Very goodthe default sweet spot
3-bit~0.19×Noticeable losssqueezing onto small VRAM
2-bit~0.13×Significant losslast resort only

The three formats

  • GGUF — the llama.cpp format. Runs on CPU, Apple Silicon and mixed CPU+GPU, and is what Ollama and LM Studio use. Most flexible for local/desktop; not the fastest for pure-GPU high-concurrency serving.
  • GPTQ — GPU-focused 4-bit quantization, widely supported by vLLM/TGI. Fast pure-GPU inference; a long-standing, well-tooled standard.
  • AWQ — activation-aware quantization that often preserves quality slightly better than GPTQ at the same bit-width, with fast GPU inference. Increasingly the default for serving.

How to choose

  • Mac / CPU / desktop, easy setup → GGUF (Ollama, LM Studio).
  • Production GPU serving, many users → AWQ (or GPTQ) on vLLM/TGI.
  • Quality matters most and you have VRAM → 8-bit.
  • Default for most local use → a good 4-bit quant (Q4_K_M in GGUF, or 4-bit AWQ).