GPTQ

Optimization

A GPU-native post-training quantization method using second-order Hessian information to minimise weight-rounding error — typically faster on GPU than GGUF at the same bit depth.

GPTQ (Frantar et al., 2022) quantizes model weights to INT4 or INT8 using an approximate second-order optimisation that minimises the reconstruction error for each layer. It produces GPU-optimised files distinct from the GGUF/llama.cpp ecosystem.

How GPTQ Works

For each linear layer, GPTQ solves: "given that I must round these weights to N bits, what rounding minimises the Hessian-weighted squared error?" The Hessian (curvature of the loss w.r.t. weights) is approximated from a calibration dataset of ~128 samples. The process is run layer by layer (column-wise within each layer) and takes hours on a large model but only once per model.

GPTQ vs GGUF Comparison

DimensionGPTQGGUF (Q4_K_M)
Target hardwareNVIDIA/AMD GPUCPU, Apple Metal, GPU
Inference engineAutoGPTQ, ExLlamaV2, vLLMllama.cpp, Ollama, LM Studio
GPU throughputFaster (INT4 tensor cores)Slightly slower on pure GPU
CPU supportPoor/noneExcellent
Quality at INT4Slightly betterVery close (K-quants)
Quantize costHours + calibration dataAvailable pre-quantized

Alternatives to GPTQ

AWQ (Activation-Weighted Quantization)

Protects salient weights identified via activation magnitude. Similar quality to GPTQ, faster quantization. Supported by vLLM and TGI.

EXL2 (ExLlamaV2)

Mixed-precision GPTQ variant. Assign different bit depths per layer based on sensitivity. Can match Q6 quality at Q4 average bit depth. Fastest GPU inference available.

Why It Matters for On-Premise

If your on-premise server has one or more discrete NVIDIA GPUs and uses vLLM or ExLlamaV2, GPTQ/AWQ models will outperform equivalent GGUF models in throughput. For mixed CPU+GPU or Apple Silicon deployments, stick with GGUF.