GPTQ (Frantar et al., 2022) quantizes model weights to INT4 or INT8 using an approximate second-order optimisation that minimises the reconstruction error for each layer. It produces GPU-optimised files distinct from the GGUF/llama.cpp ecosystem.
How GPTQ Works
For each linear layer, GPTQ solves: "given that I must round these weights to N bits, what rounding minimises the Hessian-weighted squared error?" The Hessian (curvature of the loss w.r.t. weights) is approximated from a calibration dataset of ~128 samples. The process is run layer by layer (column-wise within each layer) and takes hours on a large model but only once per model.
GPTQ vs GGUF Comparison
| Dimension | GPTQ | GGUF (Q4_K_M) |
|---|---|---|
| Target hardware | NVIDIA/AMD GPU | CPU, Apple Metal, GPU |
| Inference engine | AutoGPTQ, ExLlamaV2, vLLM | llama.cpp, Ollama, LM Studio |
| GPU throughput | Faster (INT4 tensor cores) | Slightly slower on pure GPU |
| CPU support | Poor/none | Excellent |
| Quality at INT4 | Slightly better | Very close (K-quants) |
| Quantize cost | Hours + calibration data | Available pre-quantized |
Alternatives to GPTQ
AWQ (Activation-Weighted Quantization)
Protects salient weights identified via activation magnitude. Similar quality to GPTQ, faster quantization. Supported by vLLM and TGI.
EXL2 (ExLlamaV2)
Mixed-precision GPTQ variant. Assign different bit depths per layer based on sensitivity. Can match Q6 quality at Q4 average bit depth. Fastest GPU inference available.
Why It Matters for On-Premise
If your on-premise server has one or more discrete NVIDIA GPUs and uses vLLM or ExLlamaV2, GPTQ/AWQ models will outperform equivalent GGUF models in throughput. For mixed CPU+GPU or Apple Silicon deployments, stick with GGUF.