GGUF – LLM Glossary

GGUF is a self-contained binary format that stores model weights, quantization metadata, the tokenizer vocabulary, and model configuration in a single file. It replaced the older GGML format in August 2023 and is maintained by the llama.cpp project.

GGUF File Naming Convention

Names follow the pattern: model-name-Q{bits}_{variant}.gguf

Quantization Levels in GGUF

Format	Bits/weight	7B VRAM	Quality	Recommended for
Q2_K	2.6	~2.8 GB	Poor	Very low RAM; avoid for production
Q3_K_M	3.3	~3.3 GB	Acceptable	Extreme memory constraints only
Q4_K_S	4.4	~4.1 GB	Good	Small variant of Q4_K_M
Q4_K_M	4.6	~4.5 GB	Good	Recommended default
Q5_K_M	5.7	~5.6 GB	Very good	Best quality/size when you have room
Q6_K	6.6	~6.1 GB	Excellent	Near-lossless; 16GB GPU headroom
Q8_0	8.5	~7.7 GB	Near-FP16	Max quality at INT8; GPU only
F16	16	~14 GB	Reference	Validation / training baseline

K-Quants Explained

The "K" suffix (e.g., Q4_K_M) refers to k-quants — an improved quantization method that uses different precision levels for different layer types. Attention layers get higher precision (k-quantized) while feed-forward layers use standard INT4. This "mixed precision within a single quantization level" gives significantly better quality than naive Q4_0 at the same file size. "_M" means medium grouping size; "_S" is smaller (less VRAM, less quality); "_L" is larger (more quality).

Why It Matters for On-Premise

GGUF is the format of choice for Ollama, LM Studio, and GPT4All — the most common on-premise serving stacks. Q4_K_M is the sweet spot for most use cases: a 7B Q4_K_M fits in 6 GB of RAM (or VRAM), runs at 25–40 tokens/s on a modern GPU, and loses less than 2% quality vs FP16 on standard benchmarks.