Mixture of Experts (MoEs) in Transformers
read at source ↗ huggingface.co
Mixture of Experts (MoEs) in Transformers
Source: HuggingFace Date: 2026-02-26 URL: https://huggingface.co/blog/moe-transformers
Summary
Library update: Transformers v5 introduces major MoE-specific engineering — a WeightConverter abstraction for dynamic checkpoint loading (66s → 10s for Qwen 110B on a single A100, a 6.6x speedup), a pluggable expert backend system with three execution modes (eager/batched_mm/grouped_mm), expert parallelism via DistributedConfig, and MoE training optimizations with Unsloth (12x faster, 35%+ VRAM reduction, 6x longer context). Key illustration: a 21B-parameter MoE model with 4 active experts achieves ~115 tok/s on M3 Ultra Mac — equivalent to serving 3.6B-dense compute with 21B capacity.
Implications
Transformers library trajectory. MoE models being first-class citizens in Transformers v5 — not hacked in as custom implementations — is a prerequisite for the ecosystem to properly adopt MoE architectures as they proliferate. The 6.6x weight loading speedup for 110B models is immediately practical: loading a DeepSeek-V3-class model for inference no longer requires a 66-second wait.
Open-weights ecosystem health. Expert parallelism support in Transformers with a single enable_expert_parallel=True flag democratizes MoE serving beyond teams that can write custom distributed inference code. If this works reliably across the major MoE releases (DeepSeek, Qwen MoE, Mixtral), it significantly lowers the operational complexity of running large sparse models.