Skip to main content

On This Page

OpenAI Releases MRC Protocol: Scaling AI Supercomputing to 131,000 GPUs

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

OpenAI Introduces MRC (Multipath Reliable Connection): A New Open Networking Protocol for Large-Scale AI Supercomputer Training Clusters

OpenAI has introduced MRC (Multipath Reliable Connection), a networking protocol developed over two years with partners including NVIDIA and AMD. The system enables supercomputers to connect over 131,000 GPUs using only two switch tiers, drastically reducing infrastructure overhead. This protocol is already in production training frontier models for ChatGPT and Codex.

Why This Matters

In large-scale AI training, a single delayed data transfer can cause massive GPU clusters to sit idle, creating significant cost and capability loss. MRC addresses the technical reality where network congestion and hardware failures are frequent at scale by shifting routing intelligence from the switch to the NIC. This allows training jobs to survive hardware failures that would traditionally trigger a full job termination, maintaining predictable performance even under duress.

Key Insights

  • Fact: MRC reduces infrastructure costs by requiring only 2/3 of the optics and 3/5 the number of switches compared to traditional three-tier networks (OpenAI, 2026).
  • Concept: Intelligent Packet-Spray Load Balancing spreads data across hundreds of paths simultaneously, eliminating the single-path congestion bottlenecks found in RoCEv2.
  • Tool: The protocol is implemented on hardware including NVIDIA ConnectX-8, AMD Pollara, and Broadcom Thor Ultra NICs, utilizing SRv6 for static source routing.
  • Concept: Microsecond-level failure recovery is achieved by disabling dynamic routing in switches to prevent interference with NIC-level adaptive mechanisms.
  • Fact: During recent frontier model training, OpenAI rebooted four tier-1 switches without coordinating with training teams, as MRC automatically rerouted traffic around the maintenance.

Practical Applications

  • Use case: Training frontier LLMs on NVIDIA GB200 supercomputers at Microsoft Fairwater and OCI sites using multi-plane 100Gb/s fabrics. Pitfall: Relying on traditional 800Gb/s single-link connections increases the ‘blast radius’ of failures and requires more switch tiers.
  • Use case: Implementing the NSCC congestion control algorithm within the Ultra Ethernet Consortium (UEC) framework for large-scale RDMA. Pitfall: Using standard dynamic routing on switches can interfere with host-based adaptive routing, leading to unstable network performance.

References:

Continue reading

Next article

Implementing High-Availability SIP Trunking for ViciDial

Related Content