Large Language Models (LLMs) based on Mixture of Experts (MoE) architectures offer high performance, but their size poses a challenge for deployment in resource-constrained environments.

KBVQ-MoE: A New Approach to Quantization

KBVQ-MoE is a framework that aims to improve low-bit quantization for MoE-based LLMs. It addresses two main issues:

  1. Redundancy among experts: KBVQ-MoE uses a Karhunen-Loeve Transform (KLT) guided by singular value decomposition (SVD) to extract and share dominant weight components across experts, eliminating redundant representations.
  2. Cumulative bias: Vector quantization is applied only to expert-specific (non-redundant) representations, and the quantized outputs are corrected via channel-wise affine compensation.

Experimental Results

Experimental results on various MoE LLMs demonstrate that KBVQ-MoE preserves accuracy significantly better than existing quantization methods. For example, 3-bit quantization of Qwen1.5-MoE-A2.7B achieves an average accuracy of 67.99%, nearly identical to the FP16 baseline of 68.07%. This highlights KBVQ-MoE's potential for efficient deployment on edge devices and other resource-constrained platforms.