## Adaptive-K: efficient routing for MoE models A new routing system, called Adaptive-K, has been developed, aiming to reduce the computational load of Mixture of Experts (MoE) models. Initial results indicate savings between 30% and 52% on models such as Mixtral, Qwen, and OLMoE. ## Resources and implementation The source code for the project is available on GitHub. You can also test a live demo on Hugging Face. NVIDIA engineers are evaluating the integration of Adaptive-K into TensorRT-LLM, as highlighted by the relevant pull request. MoE models, as the name suggests, use a combination of smaller models (the "experts") to handle different aspects of a complex problem. Routing, in this context, is the process of assigning specific inputs to the most suitable experts, with the aim of optimizing both accuracy and computational efficiency.

Adaptive-K routing: up to 52% compute savings on MoE models

💬 Commenti (0)

📚 Approfondimenti

Approfondisci su LLM On-Premise

dUltra: un nuovo passo avanti per i modelli di diffusione

MoEBlaze: nuovo framework per training efficiente di MoE su GPU

L'anno dell'intelligenza artificiale: cosa aspettarsi nel 2026