Mixture of Experts (MoE) – LLM Glossary

In a standard transformer, every token passes through every parameter in the feed-forward network. MoE replaces the single FFN with N parallel experts and a router that selects the top-K for each token. This decouples model capacity (total parameters) from compute cost (active parameters).

How the Router Works

The router is a small linear layer that takes the token's hidden state as input and outputs N logits (one per expert). The top-K (typically K=2 or K=4) experts with the highest scores are selected. Only those experts' weights are loaded and activated. The outputs are weighted by the router scores and summed. The router is trainable and learns to specialise experts during pretraining.

Notable MoE Models

Model	Total Params	Active Params	Experts	Top-K
Mixtral 8×7B	46.7B	12.9B	8	2
Mixtral 8×22B	141B	39B	8	2
DeepSeek-V3	671B	37B	256	8
Llama 4 Scout	109B	17B	16	1
Llama 4 Maverick	400B	52B	128	1
Grok 3	~314B	~50B	—	—

On-Premise Challenges

MoE models require loading all expert weights into memory even though only K are active per token. Mixtral 8×7B needs ~47GB in FP16 — more than a single 40GB A100. Strategies:

GGUF Q4_K_M: Mixtral 8×7B fits in ~26GB — two consumer 16GB GPUs or one 32GB workstation GPU.
Expert offloading: llama.cpp can offload unused experts to CPU RAM, swapping as needed. Slower but functional.
Multi-GPU tensor parallelism: Split experts across GPUs with vLLM. Requires NVLink for best performance.

Why It Matters for On-Premise

MoE gives you the quality of a very large dense model at the inference cost of a much smaller one. DeepSeek-V3 (671B total, 37B active) matches GPT-4o quality while using compute equivalent to a 37B dense model. For on-premise, a well-quantised Mixtral 8×7B running on two consumer GPUs outperforms a dense 7B model significantly — making it the best quality-per-watt architecture for many deployments.