KBVQ-MoE: Low-Bit Quantization for MoE Large Language Models

Large Language Models (LLMs) based on Mixture of Experts (MoE) architectures offer high performance, but their size poses a challenge for deployment in resource-constrained environments.

KBVQ-MoE: A New Approach to Quantization

KBVQ-MoE is a framework that aims to improve low-bit quantization for MoE-based LLMs. It addresses two main issues:

Redundancy among experts: KBVQ-MoE uses a Karhunen-Loeve Transform (KLT) guided by singular value decomposition (SVD) to extract and share dominant weight components across experts, eliminating redundant representations.
Cumulative bias: Vector quantization is applied only to expert-specific (non-redundant) representations, and the quantized outputs are corrected via channel-wise affine compensation.

Experimental Results

Experimental results on various MoE LLMs demonstrate that KBVQ-MoE preserves accuracy significantly better than existing quantization methods. For example, 3-bit quantization of Qwen1.5-MoE-A2.7B achieves an average accuracy of 67.99%, nearly identical to the FP16 baseline of 68.07%. This highlights KBVQ-MoE's potential for efficient deployment on edge devices and other resource-constrained platforms.

KBVQ-MoE: Low-Bit Quantization for MoE Large Language Models

KBVQ-MoE: A New Approach to Quantization

Experimental Results

💻 Need GPU Cloud Infrastructure?

💬 Commenti (0)

📚 Approfondimenti

Approfondisci su LLM On-Premise

Supporto a Qwen3.5 integrato in llama.cpp

Fine-tuning di Qwen 14B per autocompletamento su Discord

Siccofanti digitali: i modelli linguistici sono davvero allineati?