TurboQuant for LLM Model Compression

TurboQuant is an implementation of a quantization algorithm originally developed for KV-cache, now adapted for model weight compression. The goal is to reduce the memory footprint of large language models (LLMs) without significantly sacrificing accuracy.

Details and Benchmarks

The TurboQuant approach involves using 4-bit quantization combined with 8-bit residuals. This allows for a good trade-off between compression and performance maintenance. Benchmark results on Qwen3.5-0.8B with WikiText-103 show:

  • Baseline BF16: PPL 14.29, size 1,504 MB
  • 4+4 bit quantization (with residuals): PPL 14.29, size 762 MB
  • 4-bit quantization (group=full): PPL 16.23, size 361 MB
  • 4-bit quantization (group=128): PPL 16.57, size 381 MB

As the data shows, the 4+4 bit configuration achieves a perplexity (PPL) identical to the BF16 baseline, effectively halving the model size. The 4-bit configurations without residuals show a performance degradation.

TurboQuant is proposed as a drop-in replacement for the nn.Linear module of PyTorch, simplifying integration into existing models. For those evaluating on-premise deployments, there are trade-offs to consider, as discussed in AI-RADAR on /llm-onpremise.