Large Language Models (LLMs) based on Mixture of Experts (MoE) architectures offer high performance, but their size poses a challenge for deployment in resource-constrained environments.
KBVQ-MoE: A New Approach to Quantization
KBVQ-MoE is a framework that aims to improve low-bit quantization for MoE-based LLMs. It addresses two main issues:
- Redundancy among experts: KBVQ-MoE uses a Karhunen-Loeve Transform (KLT) guided by singular value decomposition (SVD) to extract and share dominant weight components across experts, eliminating redundant representations.
- Cumulative bias: Vector quantization is applied only to expert-specific (non-redundant) representations, and the quantized outputs are corrected via channel-wise affine compensation.
Experimental Results
Experimental results on various MoE LLMs demonstrate that KBVQ-MoE preserves accuracy significantly better than existing quantization methods. For example, 3-bit quantization of Qwen1.5-MoE-A2.7B achieves an average accuracy of 67.99%, nearly identical to the FP16 baseline of 68.07%. This highlights KBVQ-MoE's potential for efficient deployment on edge devices and other resource-constrained platforms.
๐ฌ Commenti (0)
๐ Accedi o registrati per commentare gli articoli.
Nessun commento ancora. Sii il primo a commentare!