Efficient MoE Inference on Edge Devices
Mixture-of-Experts (MoE) models offer scalable performance but present significant challenges in terms of memory requirements, especially on resource-constrained edge devices. Existing offloading strategies often suffer from I/O bottlenecks due to the dynamic, low-information nature of autoregressive expert activation.
MoE-SpAc repurposes Speculative Decoding (SD) as a predictive sensor for memory management. The framework integrates a Speculative Utility Estimator to track expert demand, a Heterogeneous Workload Balancer to dynamically partition computation via online integer optimization, and an Asynchronous Execution Engine to unify prefetching and eviction in the same utility space.
Experimental results on seven benchmarks demonstrate that MoE-SpAc achieves a 42% improvement in TPS over the SOTA SD-based baseline and an average speedup of 4.04x compared to all standard baselines. The code is available on GitHub.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!