Unsloth AI has introduced new optimizations for training Mixture of Experts (MoE) models, achieving a speed increase of up to 12x and a VRAM consumption reduction of over 35%. These optimizations were achieved through custom Triton kernels and mathematical optimizations, without compromising the accuracy of the models.

Implementation Details

Unsloth's optimizations support various MoE architectures, including gpt-oss, Qwen3 (30B, 235B, VL, Coder), DeepSeek R1/V3, and GLM (4.6-Air, 4.7, Flash). A gpt-oss-20b model can be fine-tuned with only 12.8GB of VRAM, while Qwen3-30B-A3B (with 16-bit LoRA) requires 63GB.

Unsloth's kernels are compatible with both data center GPUs (B200, H100) and consumer GPUs, including older RTX 3090s. The effectiveness of the optimizations increases with model size and context length.

Integration and Performance

In collaboration with Hugging Face, Unsloth has standardized MoE training using PyTorch's torch._grouped_mm function. Unsloth's optimizations, combined with improvements to Transformers v5, offer a further speed increase and VRAM consumption reduction compared to previous versions.