GGUF

Format

GPT-Generated Unified Format — the binary file format used by llama.cpp to store quantized LLM weights. The standard for CPU and consumer-GPU inference.

GGUF is a self-contained binary format that stores model weights, quantization metadata, the tokenizer vocabulary, and model configuration in a single file. It replaced the older GGML format in August 2023 and is maintained by the llama.cpp project.

GGUF File Naming Convention

Names follow the pattern: model-name-Q{bits}_{variant}.gguf

Quantization Levels in GGUF

FormatBits/weight7B VRAMQualityRecommended for
Q2_K2.6~2.8 GBPoorVery low RAM; avoid for production
Q3_K_M3.3~3.3 GBAcceptableExtreme memory constraints only
Q4_K_S4.4~4.1 GBGoodSmall variant of Q4_K_M
Q4_K_M4.6~4.5 GBGoodRecommended default
Q5_K_M5.7~5.6 GBVery goodBest quality/size when you have room
Q6_K6.6~6.1 GBExcellentNear-lossless; 16GB GPU headroom
Q8_08.5~7.7 GBNear-FP16Max quality at INT8; GPU only
F1616~14 GBReferenceValidation / training baseline

K-Quants Explained

The "K" suffix (e.g., Q4_K_M) refers to k-quants — an improved quantization method that uses different precision levels for different layer types. Attention layers get higher precision (k-quantized) while feed-forward layers use standard INT4. This "mixed precision within a single quantization level" gives significantly better quality than naive Q4_0 at the same file size. "_M" means medium grouping size; "_S" is smaller (less VRAM, less quality); "_L" is larger (more quality).

Why It Matters for On-Premise

GGUF is the format of choice for Ollama, LM Studio, and GPT4All — the most common on-premise serving stacks. Q4_K_M is the sweet spot for most use cases: a 7B Q4_K_M fits in 6 GB of RAM (or VRAM), runs at 25–40 tokens/s on a modern GPU, and loses less than 2% quality vs FP16 on standard benchmarks.