RotorQuant: A Faster Alternative to TurboQuant
RotorQuant is a novel vector quantization technique that utilizes Clifford rotors to achieve superior performance compared to TurboQuant. Early results show a speed increase between 10 and 19 times, with a 44-fold reduction in the number of parameters.
The key idea is to replace the dรd random orthogonal matrix with Clifford rotors in Cl(3,0). Instead of a dense matrix multiplication, the vector is divided into groups of 3 dimensions and each is rotated with a 4-parameter rotor. This approach drastically reduces the number of operations required.
Results and Performance
Tests on Qwen2.5-3B-Instruct KV cache show:
- Cosine similarity: 0.990 (vs 0.991 for TurboQuant)
- 44x fewer parameters (372 vs 16,399 for d=128)
- Fused CUDA kernel: 10-19x faster than cuBLAS matmul on RTX PRO 4000
- Fused Metal shader: 9-31x faster on Apple M4
- Perfect performance in needle-in-haystack tests
The implementation leverages fused kernels that keep data in registers, avoiding memory accesses and outperforming TurboQuant despite the latter's optimization.
Implications
RotorQuant represents a promising step forward in vector quantization, offering a significant improvement in performance with a reduced memory footprint. This could have a notable impact on LLM inference applications, especially in resource-constrained contexts.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!