⟨MoE⟩ MIXTURE_OF_EXPERTS :: DEPLOYMENT_GUIDE

MoE Models On-Premise

Mixture of Experts models (Qwen3.6-35B-A3B, DeepSeek-V3, Mixtral) behave very differently from dense models on consumer hardware. Understanding sparse routing, active parameter counts, and memory bandwidth is essential before deploying MoE on-premise.

> MOE_ARCHITECTURE_PRIMER

In a standard dense Transformer, every token is processed by every parameter. In a Mixture of Experts model, each token is routed to only a small subset of "expert" FFN blocks by a learned router network.

This means a model with 35B total parameters may only activate 3.7B parameters per token — with inference compute cost closer to a 4B dense model.

The catch: all experts must be loaded into memory simultaneously, even if most are idle for any given token. VRAM or RAM requirement is still based on total parameters, not active parameters.

> QWEN3.6-35B-A3.7B ANATOMY
Total parameters35B
Active per token3.7B
Number of experts128
Active experts/token8
VRAM needed (Q4_K_M)~22 GB
Dense equivalent compute~4B
Quality vs dense34B-class

> MOE_MODELS_AVAILABLE_2026

MODEL TOTAL / ACTIVE VRAM (Q4) LICENSE TOK/S (RTX 4090) NOTES
Qwen3.6-35B-A3.7B 35B / 3.7B ~22 GB Apache 2.0 ~35-45 Best on single RTX 4090; thinking mode; fits in 24GB with Q4_K_M
Qwen3.6-30B-A3B 30B / 3B ~18 GB Apache 2.0 ~40-50 Fits RTX 4080 Super (16GB) with Q4_K_S; good for agentic workflows
Mistral 8x7B (Mixtral) 47B / 13B ~26 GB Apache 2.0 ~25-35* *Needs GPU offload with 24GB. Proven, widely tested with llama.cpp
Mistral 8x22B 141B / 39B ~80 GB+ Apache 2.0 ~8-15** **Requires multi-GPU (2×A6000) or heavy CPU offload. For datacenters.
DeepSeek-V3 (MoE) 671B / 37B ~400 GB+ MIT N/A* *Consumer hardware: not feasible. Data center multi-node only (8×H100).

> MOE_SPECIFIC_FAILURE_MODES

Problems that don't appear in cloud demos but surface on local hardware

EXPERT LOADING LATENCY

If experts don't fit in VRAM, llama.cpp offloads some to RAM. Each expert swap causes a PCIe memory transfer. On large batches, this becomes the bottleneck — not compute.

FIX: Use Q4_K_M / Q4_K_S; ensure all experts fit in VRAM. Prefer 24GB+ GPU for 35B MoE.
ROUTING COLLAPSE

Some quantization levels cause the router to consistently select the same experts, effectively turning the MoE into a small dense model. Quality degrades dramatically without warning.

FIX: Use Q4_K_M or Q6_K, not Q2_K. Run benchmark tasks to verify output quality post-quantization.
MEMORY FRAGMENTATION

MoE models with many experts cause VRAM fragmentation over time, especially with variable-length inputs. Sessions may crash with OOM errors even if peak usage looked acceptable.

FIX: Set --flash-attn in llama.cpp; limit max batch size; restart server periodically in production.
ORCHESTRATION THROUGHPUT DROP

In agentic pipelines that chain multiple MoE calls, inter-call latency accumulates. A 35B MoE at 40 tok/s with 5-tool agent steps produces much lower effective throughput than a 7B dense model.

FIX: Use MoE for planning/reasoning steps; smaller dense models for tool execution and formatting.

> PRE_DEPLOYMENT_CHECKLIST

> HARDWARE_VERIFICATION
  • VRAM ≥ (total_params × quant_factor) + KV cache
  • PCIe Gen 4 16x bandwidth (not 8x riser)
  • RAM ≥ 64GB if CPU offload needed
  • Flash Attention 2 supported (Ampere+ GPU)
  • Cooling adequate for sustained 80%+ GPU load
> QUANTIZATION_SELECTION
  • Q4_K_M — balanced quality/size (recommended)
  • Q6_K — near-lossless, high quality
  • Q4_K_S — smaller, acceptable for reasoning
  • Q2_K — routing collapse risk on MoE models
  • Run quality benchmark after quantization change
> RUNTIME_CONFIG
--n-gpu-layers -1 # all to GPU
--flash-attn # reduce VRAM ~20%
--ctx-size 8192 # balance ctx vs VRAM
--batch-size 512 # tune for your GPU
--mlock # prevent RAM swapping
--numa # multi-socket servers
RELATED SECTIONS
Model Cards → Hardware Matrix → Agentic AI → SLM Guide →