2026-04-22 · Google

Decoupled DiLoCo: A new frontier for resilient, distributed AI training

protocolsmodelsinfrastructure

Decoupled DiLoCo: A new frontier for resilient, distributed AI training

Source: DeepMind Date: 2026-04-22 URL: https://deepmind.google/blog/decoupled-diloco/

Summary

Google DeepMind published Decoupled DiLoCo, a distributed training architecture combining Pathways (asynchronous data flow) with DiLoCo (bandwidth reduction) that reduces inter-datacenter bandwidth requirements from 198 Gbps to 0.84 Gbps across 8 datacenters. In high-failure simulations it maintains 88% goodput vs. 27% for standard data-parallel methods. Trained a 12B parameter model across four US regions 20x faster than synchronization-based methods. Achieved 64.1% accuracy vs. 64.4% baseline on Gemma 4 models with mixed TPU hardware generations.

Implications

0.84 Gbps at internet-scale bandwidth enables geographically distributed training. Reducing cross-datacenter bandwidth 235x (198 → 0.84 Gbps) means training large models doesn’t require co-located, high-speed datacenter interconnect. That opens training to internet-scale connectivity — which has massive implications for distributed ownership and geographic resilience of AI training infrastructure.

88% goodput under failure vs. 27% is the reliability story. Conventional data-parallel training grinds to a halt when nodes fail (27% goodput in high-failure scenarios). 88% goodput under the same conditions means Decoupled DiLoCo can train through hardware failures rather than requiring restarts. At frontier model scale where training runs take months, that’s an enormous operational advantage.

Mixed TPU hardware generation support extends compute lifespan. Being able to mix different TPU versions in a single training run means older hardware doesn’t become instantly worthless when a new TPU generation ships. That’s economically significant for Google’s TPU fleet management and potentially for future multi-org distributed training scenarios.

Watch:

Whether Decoupled DiLoCo gets applied to Gemini 4 or 5 training runs — the Gemma 4 validation suggests it’s being adopted
Implications for multi-organization training collaborations — if 0.84 Gbps is sufficient, geographically and organizationally distributed model training becomes practical
OpenAI and Anthropic’s distributed training infrastructure responses — this is infrastructure capability competition

← all signals