What does quantizing an LLM do?

It stores the model weights in fewer bits (e.g. 4 instead of 16), cutting VRAM roughly 4x with only small quality loss, so large models fit on consumer GPUs.

Does 4-bit quantization hurt quality?

For most tasks, modern 4-bit methods lose very little. 8-bit is near-lossless; below 4-bit (3-bit, 2-bit) degradation becomes noticeable, especially on reasoning.

GGUF, GPTQ or AWQ — which should I use?

GGUF for CPU/Mac or mixed CPU-GPU via llama.cpp/Ollama; GPTQ or AWQ for fast pure-GPU inference via vLLM/TGI. AWQ often edges GPTQ on quality at the same bit-width.

LLM Quantization Explained (2026): GGUF vs GPTQ vs AWQ

A model's weights are billions of numbers. By default each is stored in 16 bits (FP16); quantization stores them in fewer — 8, 4, even 2 bits — by mapping the range of values onto a smaller set. The payoff is dramatic VRAM savings (and often speed); the cost is a controlled, usually small, loss of precision. Understanding the levels and formats lets you fit the biggest, best model your hardware can hold.

Quantization levels: the quality/size trade-off

Level	Size vs FP16	Quality	Use when
FP16	1× (full)	Reference	VRAM is no constraint
8-bit	~0.5×	Near-lossless	quality-critical, have VRAM
4-bit	~0.25×	Very good	the default sweet spot
3-bit	~0.19×	Noticeable loss	squeezing onto small VRAM
2-bit	~0.13×	Significant loss	last resort only

The three formats

GGUF — the llama.cpp format. Runs on CPU, Apple Silicon and mixed CPU+GPU, and is what Ollama and LM Studio use. Most flexible for local/desktop; not the fastest for pure-GPU high-concurrency serving.
GPTQ — GPU-focused 4-bit quantization, widely supported by vLLM/TGI. Fast pure-GPU inference; a long-standing, well-tooled standard.
AWQ — activation-aware quantization that often preserves quality slightly better than GPTQ at the same bit-width, with fast GPU inference. Increasingly the default for serving.

How to choose

Mac / CPU / desktop, easy setup → GGUF (Ollama, LM Studio).
Production GPU serving, many users → AWQ (or GPTQ) on vLLM/TGI.
Quality matters most and you have VRAM → 8-bit.
Default for most local use → a good 4-bit quant (Q4_K_M in GGUF, or 4-bit AWQ).