2026-01-20 · HuggingFace

Differential Transformer V2

protocolsmodelscapitalresearchinfrastructure

Differential Transformer V2

Source: HuggingFace Date: 2026-01-20 URL: https://huggingface.co/blog/microsoft/diff-attn-v2

Summary

Research summary: Microsoft’s Differential Transformer V2 redesigns the V1 architecture to remove custom kernel requirements (now FlashAttention-compatible), improves training stability by removing per-head RMSNorm (a source of gradient spikes at large learning rates), and replaces global lambda with token-specific head-wise projected lambda. Preliminary results on production-scale dense + 30B MoE models: 0.02–0.03 lower language modeling loss at 1T tokens. Standard decoding speed; experiments ongoing.

Implications

Thread: transformers library trajectory. The key improvement over V1 is engineering practicality: FlashAttention compatibility without custom kernels makes DIFF V2 actually deployable. The 0.02–0.03 loss gap at 1T tokens is small but consistent — at scale, architectural efficiency gains of this magnitude compound. The attention sink elimination via relaxed softmax magnitude constraint addresses a known quality problem in long-context settings. If the preliminary results hold in long-context benchmarks, this is a compelling drop-in attention modification for any team training models from scratch. Watch for the formal paper to see if the gains transfer to downstream task performance.

← all signals