In a standard transformer, every token passes through every parameter in the feed-forward network. MoE replaces the single FFN with N parallel experts and a router that selects the top-K for each token. This decouples model capacity (total parameters) from compute cost (active parameters).
How the Router Works
The router is a small linear layer that takes the token's hidden state as input and outputs N logits (one per expert). The top-K (typically K=2 or K=4) experts with the highest scores are selected. Only those experts' weights are loaded and activated. The outputs are weighted by the router scores and summed. The router is trainable and learns to specialise experts during pretraining.
Notable MoE Models
| Model | Total Params | Active Params | Experts | Top-K |
|---|---|---|---|---|
| Mixtral 8×7B | 46.7B | 12.9B | 8 | 2 |
| Mixtral 8×22B | 141B | 39B | 8 | 2 |
| DeepSeek-V3 | 671B | 37B | 256 | 8 |
| Llama 4 Scout | 109B | 17B | 16 | 1 |
| Llama 4 Maverick | 400B | 52B | 128 | 1 |
| Grok 3 | ~314B | ~50B | — | — |
On-Premise Challenges
MoE models require loading all expert weights into memory even though only K are active per token. Mixtral 8×7B needs ~47GB in FP16 — more than a single 40GB A100. Strategies:
- GGUF Q4_K_M: Mixtral 8×7B fits in ~26GB — two consumer 16GB GPUs or one 32GB workstation GPU.
- Expert offloading: llama.cpp can offload unused experts to CPU RAM, swapping as needed. Slower but functional.
- Multi-GPU tensor parallelism: Split experts across GPUs with vLLM. Requires NVLink for best performance.
Why It Matters for On-Premise
MoE gives you the quality of a very large dense model at the inference cost of a much smaller one. DeepSeek-V3 (671B total, 37B active) matches GPT-4o quality while using compute equivalent to a 37B dense model. For on-premise, a well-quantised Mixtral 8×7B running on two consumer GPUs outperforms a dense 7B model significantly — making it the best quality-per-watt architecture for many deployments.