MoE Models On-Premise
Mixture of Experts models (Qwen3.6-35B-A3B, DeepSeek-V3, Mixtral) behave very differently from dense models on consumer hardware. Understanding sparse routing, active parameter counts, and memory bandwidth is essential before deploying MoE on-premise.
> MOE_ARCHITECTURE_PRIMER
In a standard dense Transformer, every token is processed by every parameter. In a Mixture of Experts model, each token is routed to only a small subset of "expert" FFN blocks by a learned router network.
This means a model with 35B total parameters may only activate 3.7B parameters per token — with inference compute cost closer to a 4B dense model.
The catch: all experts must be loaded into memory simultaneously, even if most are idle for any given token. VRAM or RAM requirement is still based on total parameters, not active parameters.
> MOE_MODELS_AVAILABLE_2026
| MODEL | TOTAL / ACTIVE | VRAM (Q4) | LICENSE | TOK/S (RTX 4090) | NOTES |
|---|---|---|---|---|---|
| Qwen3.6-35B-A3.7B | 35B / 3.7B | ~22 GB | Apache 2.0 | ~35-45 | Best on single RTX 4090; thinking mode; fits in 24GB with Q4_K_M |
| Qwen3.6-30B-A3B | 30B / 3B | ~18 GB | Apache 2.0 | ~40-50 | Fits RTX 4080 Super (16GB) with Q4_K_S; good for agentic workflows |
| Mistral 8x7B (Mixtral) | 47B / 13B | ~26 GB | Apache 2.0 | ~25-35* | *Needs GPU offload with 24GB. Proven, widely tested with llama.cpp |
| Mistral 8x22B | 141B / 39B | ~80 GB+ | Apache 2.0 | ~8-15** | **Requires multi-GPU (2×A6000) or heavy CPU offload. For datacenters. |
| DeepSeek-V3 (MoE) | 671B / 37B | ~400 GB+ | MIT | N/A* | *Consumer hardware: not feasible. Data center multi-node only (8×H100). |
> MOE_SPECIFIC_FAILURE_MODES
Problems that don't appear in cloud demos but surface on local hardware
If experts don't fit in VRAM, llama.cpp offloads some to RAM. Each expert swap causes a PCIe memory transfer. On large batches, this becomes the bottleneck — not compute.
Some quantization levels cause the router to consistently select the same experts, effectively turning the MoE into a small dense model. Quality degrades dramatically without warning.
MoE models with many experts cause VRAM fragmentation over time, especially with variable-length inputs. Sessions may crash with OOM errors even if peak usage looked acceptable.
In agentic pipelines that chain multiple MoE calls, inter-call latency accumulates. A 35B MoE at 40 tok/s with 5-tool agent steps produces much lower effective throughput than a 7B dense model.
> PRE_DEPLOYMENT_CHECKLIST
- □ VRAM ≥ (total_params × quant_factor) + KV cache
- □ PCIe Gen 4 16x bandwidth (not 8x riser)
- □ RAM ≥ 64GB if CPU offload needed
- □ Flash Attention 2 supported (Ampere+ GPU)
- □ Cooling adequate for sustained 80%+ GPU load
- ✓ Q4_K_M — balanced quality/size (recommended)
- ✓ Q6_K — near-lossless, high quality
- ⚡ Q4_K_S — smaller, acceptable for reasoning
- ✗ Q2_K — routing collapse risk on MoE models
- □ Run quality benchmark after quantization change