Building Gigawatt-Scale AI Clusters with Backend Aggregation
These articles are AI-generated summaries. Please check the original sources for full details.
Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters
Meta’s Prometheus AI cluster is being built to deliver 1-gigawatt of capacity, and backend aggregation (BAG) plays a crucial role in connecting thousands of GPUs across multiple data centers and regions. By leveraging modular hardware, advanced routing, and resilient topologies, BAG ensures both performance and reliability at unprecedented scale, with inter-BAG capacities reaching the petabit range.
Why This Matters
The technical reality of building gigawatt-scale AI clusters like Prometheus requires a robust and scalable networking infrastructure, which is often at odds with ideal models that prioritize simplicity and cost-effectiveness. The failure to design and implement such infrastructure can result in significant costs and scalability limitations, as evidenced by the complexity of interconnecting tens of thousands of GPUs. For instance, a single misconfigured network switch can lead to a failure domain that affects an entire region, resulting in substantial downtime and revenue loss.
Key Insights
- BAG is a centralized Ethernet-based super spine network layer that interconnects multiple spine layer fabrics across various data centers and regions, with inter-BAG capacities reaching 16-48 Pbps per region pair.
- The use of modular hardware, such as Jericho3 (J3) ASIC line cards, enables high-capacity, scalable, and resilient interconnect, with each line card providing up to 432x800G ports.
- Routing within BAG uses eBGP with link bandwidth attributes, enabling Unequal Cost Multipath (UCMP) for efficient load balancing and robust failure handling, as seen in Meta’s implementation of BAG.
Working Example
# BAG Network Topology Example
## Planar Topology
* Connects BAG switches one-to-one between regions
* Offers simplified management but concentrates potential failure domains
## Spread Connection Topology
* Distributes links across multiple BAG switches/planes
* Enhances path diversity and resilience
Practical Applications
- Use Case: Meta’s Prometheus AI cluster uses BAG to connect thousands of GPUs across multiple data centers and regions, enabling seamless, high-capacity networking and ensuring the scalability and reliability of the cluster.
- Pitfall: Failure to carefully manage oversubscription ratios can lead to performance degradation and scalability limitations, as seen in cases where oversubscription from L2 to BAG exceeds 4.5:1.
References:
Continue reading
Next article
Cloudflare Introduces Vertical Microfrontend Template for Efficient Edge Routing
Related Content
Optimizing Multi-Subnet Kubernetes Networking with Tailscale and Cilium eBPF
Adam Leskis builds a 9-node Kubernetes cluster across multiple subnets using Tailscale and Cilium to visualize live eBPF traffic data via a custom SSE tool.
Open Source Is Good for the Environment
Meta’s 2025 Open Compute Project (OCP) Summit introduces AI-driven methods to reduce data center emissions via open hardware.
Visualize BGP with Containerlab and FRRouting Dashboard
Build a live BGP topology dashboard using Containerlab and FRRouting, enabling a four-router lab to run on just 350 MB of RAM.