Llama.cpp Optimization: -90% dequantization, +22% speed

Optimizing Dequantization in Llama.cpp

A recent open-source effort focuses on optimizing KV cache dequantization in Llama.cpp. The TurboQuant implementation aimed to improve performance but encountered a significant bottleneck: dequantization.

In long contexts (32K on M5 Max), dequantization alone consumed about 40% of the decode time. Conventional approaches like LUTs, SIMD, and fused kernels did not yield appreciable improvements.

Leveraging Attention Sparsity

The effective solution was to leverage attention sparsity. Since Flash Attention computes softmax weights before accessing V, in long contexts, most of these weights tend to zero. Instead of accelerating dequantization, the decision was made to skip it entirely for positions with negligible attention.

Results on Qwen3.5-35B-A3B (M5 Max) with TurboQuant KV (turbo3) show a 22.8% increase in decode speed at 32K, with no changes in perplexity. Similar tests on standard q8_0 KV cache showed a 5% increase in decoding, also with no impact on perplexity.

The optimization is not specific to TurboQuant but directly leverages attention sparsity. Tests on M2 Pro showed significant improvements with 4-mag LUT on the K side and sparse V stack.

The code and benchmarks are available on GitHub.

Llama.cpp Optimization: -90% dequantization, +22% speed

Optimizing Dequantization in Llama.cpp

Leveraging Attention Sparsity

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

Motherboard buying advice: where to save and when to invest

Palantir’s AI Deployed to Find Chairs for US Government Staff

Optimizations in progress for llama.cpp

👥 Join 160+ AI explorers