MoE-SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios

Published on 2026-03-12 04:00 🏆 ArXiv cs.LG 📰 Read the original source article →

MoE-SpAc: inference MoE efficiente su edge eterogeneo

Efficient MoE Inference on Edge Devices

Mixture-of-Experts (MoE) models offer scalable performance but present significant challenges in terms of memory requirements, especially on resource-constrained edge devices. Existing offloading strategies often suffer from I/O bottlenecks due to the dynamic, low-information nature of autoregressive expert activation.

MoE-SpAc repurposes Speculative Decoding (SD) as a predictive sensor for memory management. The framework integrates a Speculative Utility Estimator to track expert demand, a Heterogeneous Workload Balancer to dynamically partition computation via online integer optimization, and an Asynchronous Execution Engine to unify prefetching and eviction in the same utility space.

Experimental results on seven benchmarks demonstrate that MoE-SpAc achieves a 42% improvement in TPS over the SOTA SD-based baseline and an average speedup of 4.04x compared to all standard baselines. The code is available on GitHub.

AI-Radar Takeaway

MoE-SpAc is a framework for Mixture-of-Experts (MoE) model inference on resource-constrained edge devices. It leverages Speculative Decoding not only to accelerate computations but also to manage memory more efficiently, dynamically balancing the workload and optimizing data prefetching and eviction. Tests show a 42% improvement over existing solutions.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE