GPTQ – LLM Glossary

GPTQ (Frantar et al., 2022) quantizes model weights to INT4 or INT8 using an approximate second-order optimisation that minimises the reconstruction error for each layer. It produces GPU-optimised files distinct from the GGUF/llama.cpp ecosystem.

How GPTQ Works

For each linear layer, GPTQ solves: "given that I must round these weights to N bits, what rounding minimises the Hessian-weighted squared error?" The Hessian (curvature of the loss w.r.t. weights) is approximated from a calibration dataset of ~128 samples. The process is run layer by layer (column-wise within each layer) and takes hours on a large model but only once per model.

GPTQ vs GGUF Comparison

Dimension	GPTQ	GGUF (Q4_K_M)
Target hardware	NVIDIA/AMD GPU	CPU, Apple Metal, GPU
Inference engine	AutoGPTQ, ExLlamaV2, vLLM	llama.cpp, Ollama, LM Studio
GPU throughput	Faster (INT4 tensor cores)	Slightly slower on pure GPU
CPU support	Poor/none	Excellent
Quality at INT4	Slightly better	Very close (K-quants)
Quantize cost	Hours + calibration data	Available pre-quantized

Alternatives to GPTQ

AWQ (Activation-Weighted Quantization)

Protects salient weights identified via activation magnitude. Similar quality to GPTQ, faster quantization. Supported by vLLM and TGI.

EXL2 (ExLlamaV2)

Mixed-precision GPTQ variant. Assign different bit depths per layer based on sensitivity. Can match Q6 quality at Q4 average bit depth. Fastest GPU inference available.

Why It Matters for On-Premise

If your on-premise server has one or more discrete NVIDIA GPUs and uses vLLM or ExLlamaV2, GPTQ/AWQ models will outperform equivalent GGUF models in throughput. For mixed CPU+GPU or Apple Silicon deployments, stick with GGUF.