Mixture of Experts (MoE)

Architecture NEW

An architecture where the model has multiple parallel FFN "expert" layers per transformer block, with a router selecting only a subset per token — giving huge parameter counts with low active compute.

In a standard transformer, every token passes through every parameter in the feed-forward network. MoE replaces the single FFN with N parallel experts and a router that selects the top-K for each token. This decouples model capacity (total parameters) from compute cost (active parameters).

How the Router Works

The router is a small linear layer that takes the token's hidden state as input and outputs N logits (one per expert). The top-K (typically K=2 or K=4) experts with the highest scores are selected. Only those experts' weights are loaded and activated. The outputs are weighted by the router scores and summed. The router is trainable and learns to specialise experts during pretraining.

Notable MoE Models

ModelTotal ParamsActive ParamsExpertsTop-K
Mixtral 8×7B46.7B12.9B82
Mixtral 8×22B141B39B82
DeepSeek-V3671B37B2568
Llama 4 Scout109B17B161
Llama 4 Maverick400B52B1281
Grok 3~314B~50B

On-Premise Challenges

MoE models require loading all expert weights into memory even though only K are active per token. Mixtral 8×7B needs ~47GB in FP16 — more than a single 40GB A100. Strategies:

  • GGUF Q4_K_M: Mixtral 8×7B fits in ~26GB — two consumer 16GB GPUs or one 32GB workstation GPU.
  • Expert offloading: llama.cpp can offload unused experts to CPU RAM, swapping as needed. Slower but functional.
  • Multi-GPU tensor parallelism: Split experts across GPUs with vLLM. Requires NVLink for best performance.

Why It Matters for On-Premise

MoE gives you the quality of a very large dense model at the inference cost of a much smaller one. DeepSeek-V3 (671B total, 37B active) matches GPT-4o quality while using compute equivalent to a 37B dense model. For on-premise, a well-quantised Mixtral 8×7B running on two consumer GPUs outperforms a dense 7B model significantly — making it the best quality-per-watt architecture for many deployments.