GGUF is a self-contained binary format that stores model weights, quantization metadata, the tokenizer vocabulary, and model configuration in a single file. It replaced the older GGML format in August 2023 and is maintained by the llama.cpp project.
GGUF File Naming Convention
Names follow the pattern: model-name-Q{bits}_{variant}.gguf
Quantization Levels in GGUF
| Format | Bits/weight | 7B VRAM | Quality | Recommended for |
|---|---|---|---|---|
| Q2_K | 2.6 | ~2.8 GB | Poor | Very low RAM; avoid for production |
| Q3_K_M | 3.3 | ~3.3 GB | Acceptable | Extreme memory constraints only |
| Q4_K_S | 4.4 | ~4.1 GB | Good | Small variant of Q4_K_M |
| Q4_K_M | 4.6 | ~4.5 GB | Good | Recommended default |
| Q5_K_M | 5.7 | ~5.6 GB | Very good | Best quality/size when you have room |
| Q6_K | 6.6 | ~6.1 GB | Excellent | Near-lossless; 16GB GPU headroom |
| Q8_0 | 8.5 | ~7.7 GB | Near-FP16 | Max quality at INT8; GPU only |
| F16 | 16 | ~14 GB | Reference | Validation / training baseline |
K-Quants Explained
The "K" suffix (e.g., Q4_K_M) refers to k-quants — an improved quantization method that uses different precision levels for different layer types. Attention layers get higher precision (k-quantized) while feed-forward layers use standard INT4. This "mixed precision within a single quantization level" gives significantly better quality than naive Q4_0 at the same file size. "_M" means medium grouping size; "_S" is smaller (less VRAM, less quality); "_L" is larger (more quality).
Why It Matters for On-Premise
GGUF is the format of choice for Ollama, LM Studio, and GPT4All — the most common on-premise serving stacks. Q4_K_M is the sweet spot for most use cases: a 7B Q4_K_M fits in 6 GB of RAM (or VRAM), runs at 25–40 tokens/s on a modern GPU, and loses less than 2% quality vs FP16 on standard benchmarks.