Quantization

Optimization

Reducing the numerical precision of model weights from FP16 to INT8 or INT4, dramatically cutting VRAM requirements with only a small trade-off in quality.

Quantization compresses a model by representing its billions of weights using fewer bits. A 7B model in FP16 occupies 14 GB of VRAM; Q4_K_M reduces this to ~4.5 GB — making it runnable on a single consumer GPU — with less than 3.5% perplexity degradation.

How Quantization Works

For post-training quantization (the standard approach), each weight matrix is divided into groups (e.g., 32 or 128 weights per group). Within each group, the full-precision range is linearly mapped onto a limited set of integer values. The mapping parameters (scale and zero-point) are stored alongside the quantized weights. At inference, weights are dequantized on the fly before each multiply-accumulate operation.

Bit Depths Explained

FP16 / BF16 (16-bit)

No quantization — the reference precision. 2 bytes per weight. Best quality; requires most VRAM.

INT8 / Q8 (8-bit)

1 byte per weight. Negligible quality loss (<1%). The safe choice when you have enough VRAM. Used in bitsandbytes LLM.int8().

INT4 / Q4 (4-bit)

0.5 bytes per weight. ~3–5% perplexity increase with k-quants. The most popular on-premise choice. Fits 7B in ~4–5 GB.

INT3 / Q3 (3-bit)

~10% perplexity increase. Noticeable quality drop. Use only under extreme memory constraints.

INT2 / Q2 (2-bit)

~30%+ perplexity increase. Significant coherence loss. Experimental / research only.

1.58-bit (BitNet / ternary)

Weights constrained to {-1, 0, +1}. Requires training from scratch with modified objective. Research frontier; not yet practical for on-premise.

GGUF K-Quant Naming Guide

NameAvg bits/w7B VRAMQualityUse when...
Q2_K2.62.8 GBPoorUnder 4 GB RAM only
Q3_K_M3.33.3 GBFairVery tight memory
Q4_K_S4.44.1 GBGoodTight memory, acceptable quality
Q4_K_M ✓4.64.5 GBGoodDefault recommendation
Q5_K_M5.75.6 GBVery goodYou have 8 GB VRAM to spare
Q6_K6.66.1 GBExcellent≥8 GB VRAM, want near-lossless
Q8_08.57.7 GBNear-FP16Benchmarking / max quality

Quantization Methods Compared

MethodFormatTarget HWNotes
GGUF k-quants.ggufCPU/GPU/MetalBest for Ollama, LM Studio
GPTQ.safetensorsNVIDIA GPUHessian-based, calibration needed
AWQ.safetensorsNVIDIA GPUActivation-weighted, no calibration
EXL2.safetensorsNVIDIA GPUMixed precision, fastest GPU inference
bitsandbyteson-the-flyNVIDIA GPUNF4 for QLoRA training

Practical Decision Tree

Have an NVIDIA GPU? Use GGUF with Ollama for simplicity, or GPTQ/AWQ with vLLM for max throughput. Apple Silicon? GGUF Q4_K_M via Ollama or llama.cpp. CPU only? GGUF Q4_K_M — expect 3–8 tokens/s on a modern CPU. Fine-tuning? bitsandbytes NF4 (QLoRA).