Skip to main content

On This Page

Google DeepMind’s Decoupled DiLoCo: Scaling AI Training with 88% Goodput and Asynchronous Fault Tolerance

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Training Architecture Achieving 88% Goodput Under High Hardware Failure Rates

Google DeepMind has introduced Decoupled DiLoCo, a distributed training architecture designed to solve the fragility of large-scale AI synchronization. The system maintained 88% goodput in simulations with 1.2 million chips, significantly outperforming the 27% achieved by standard methods.

Why This Matters

Traditional data-parallel training suffers from blocking synchronization, where the slowest chip or a single failure stalls thousands of accelerators. As models scale toward hundreds of billions of parameters, the 198 Gbps bandwidth requirement and the lack of fault isolation make geographically distributed training across standard internet infrastructure nearly impossible. Decoupled DiLoCo addresses these physical constraints by decoupling compute into asynchronous, fault-isolated learner units. This shift from synchronous AllReduce steps to asynchronous outer optimization allows training to proceed even when hardware components fail or performance fluctuates across different regions.

Key Insights

  • Decoupled DiLoCo reduces inter-datacenter bandwidth requirements from 198 Gbps to 0.84 Gbps, a decrease of multiple orders of magnitude (DeepMind, 2026).
  • The architecture utilizes asynchronous data flow via Pathways, allowing learner units to perform local gradient steps without blocking on peers.
  • Chaos engineering tests demonstrated a self-healing capability where the system continued training despite losing entire learner units, reintegrating them upon recovery.
  • A 12 billion parameter model was successfully trained across four separate U.S. regions using only 2-5 Gbps of wide-area networking.
  • The system supports heterogeneous hardware, demonstrated by mixing TPU v6e and TPU v5p chips in a single training job without performance degradation.
  • Gemma 4 model experiments showed that resilience gains come with minimal degradation, achieving 64.1% accuracy vs 64.4% for the baseline.

Practical Applications

  • Use case: Training large-scale models across geographically distant data centers using commercial internet (2-5 Gbps) instead of custom high-speed WAN.
  • Pitfall: Conventional Data-Parallel synchronization causes blocking bottlenecks; one slow chip can reduce total training goodput to 27%.
  • Use case: Extending the lifecycle of older hardware by mixing generations, such as TPU v5p and v6e, in a single asynchronous training cluster.
  • Pitfall: Rigid synchronization requirements in standard distributed training prevent the seamless reintegration of chips after hardware failure, leading to resource waste.

References:

Continue reading

Next article

Mend.io Launches AI Security Governance Framework to Combat Shadow AI Risks

Related Content