Mixture-of-Experts (MoE) models have gained popularity as a means of scaling large language models (LLMs) while maintaining sparse activations and reduced per-token compute.

However, in memory-constrained inference settings, expert weights must be offloaded to CPU, creating a performance bottleneck from CPU-GPU transfers during decoding. A new study proposes an expert prefetching scheme that leverages currently computed internal model representations to speculate future experts, enabling memory transfers to overlap with computation.

Speculating Experts: More Efficient Inference

The technique, called Speculating Experts, demonstrates that future experts can be reliably predicted by these internal representations. Executing speculated experts generally maintains downstream task accuracy, thus preserving more effective compute-memory overlap by eliminating the need to re-fetch true router-selected experts.

Integrated into an optimized inference engine, this approach achieves up to 14% reduction in time per output token (TPOT) over on-demand loading of experts from CPU memory. For MoEs where speculative execution alone yields suboptimal accuracy, lightweight estimators are examined to improve expert prediction hit rates, thereby reducing performance degradation.

The project's code is released in open-source on GitHub.