Disaggregated Scheduled Fabric (DSF): Scaling Meta’s AI Infrastructure
These articles are AI-generated summaries. Please check the original sources for full details.
Summary: Meta’s Disaggregated Scheduled Fabric (DSF) for Scalable AI Training
Meta has developed Disaggregated Scheduled Fabric (DSF), a next-generation network fabric designed to overcome the limitations of traditional Clos-based networks in supporting large-scale AI training. DSF addresses challenges related to “elephant flows” and “low entropy” traffic patterns common in AI workloads. The architecture separates the network into an Ethernet domain and a “fabric” domain, using packet spraying and a credit-based congestion control algorithm for optimal load balancing. Key features include the ability to interconnect thousands of GPUs within a data center region and across multiple regions, along with a novel “Input Balanced Mode” for resilience against network failures. Future work focuses on further enhancing scalability, addressing heterogeneity, and improving port utilization.
Overview of DSF
DSF is a network fabric designed specifically for AI training workloads. It addresses the limitations of traditional network architectures like Clos networks, which struggle to efficiently handle the large, complex traffic patterns generated by AI models. DSF achieves scalability and performance by disaggregating network components into distinct, interconnected devices.
Key Concepts:
- Disaggregation: Separates network components (line cards and fabric cards) into independent hardware devices.
- Packet Spraying: Distributes traffic across all available paths within the fabric.
- Credit-Based Congestion Control: Dynamically allocates bandwidth based on current path availability and congestion.
- Virtual Output Queuing (VOQ): Fine-grained traffic management, scheduling independent queues for specific destination ports and service classes.
- FBOSS: Meta’s open-source network operating system that orchestrates the DSF fabric.
Challenges with Traditional IP Fabric for AI Training
Traditional IP fabric architectures encounter several challenges when supporting AI training:
- Elephant Flows: Long-duration, high-bandwidth flows that can cause congestion.
- Low Entropy: Limited variation in flow patterns, leading to hash collisions and sub-optimal load distribution.
- Suboptimal Fabric Utilization: Significant skew in bandwidth utilization across network paths.
Meta explored solutions like Border Gateway Protocol (BGP) policies and load-aware ECMP schemes, but these solutions presented challenges related to tuning complexity, out-of-order packets, and failure scenarios. Traffic engineering solutions proved too complex to scale with network size.
DSF Architecture
DSF employs a two-domain architecture:
- Ethernet Domain: Handles external connectivity and routing using traditional protocols.
- Fabric Domain: Dedicated to high-speed traffic distribution between servers, using packet spraying and credit-based control.
Key Components:
- Interface Nodes (INs) / Rack Disaggregated Switches (RDSWs): Network-facing components responsible for external connectivity and routing.
- Fabric Nodes (FNs) / Fabric Disaggregated Switches (FDSWs): Internal switching elements dedicated to high-speed traffic distribution within the fabric.
Topology:
- AI Zone: A building block consisting of multiple scaling units (SUx). Each SUx contains RDSWs connected via FDSWs.
- DSF L1 Zone: Connects multiple AI zones. Uses FDSWs to aggregate L1 zones.
- DSF L2 Zone: Connects multiple DSF L1 zones. Uses SDSWs (Spine Disaggregated Switches) to aggregate L2 zones.
- DSF Region: Connects multiple DSF L2 zones. Uses L3 super-spines for inter-region connectivity.
Input Balanced Mode (IBM)
A critical feature of DSF is Input Balanced Mode (IBM), designed to maintain balanced input capacity even during network link failures.
How IBM Works:
- Failure Detection: When a link fails, the affected device reduces its advertised reachability to the affected neighbor(s).
- Randomized Link Selection: The device randomly selects a subset of available links to stop advertising reachability, ensuring capacity is distributed.
- Propagation: The failure information propagates throughout the network, with devices adjusting their link advertisements to maintain balance.
Types of Failures and Propagation:
- FDSW <-> RDSW: The FDSW reduces its capacity to the RDSW, and the RDSW propagates the reduced capacity to other FDSWs.
- FDSW <-> SDSW: Similar to FDSW <-> RDSW, the FDSW reduces its capacity to the SDSW, and the SDSW propagates the reduced capacity to other FDSWs.
- RDSW <-> SDSW: The RDSW reduces its capacity to the SDSW, and the SDSW propagates the reduced capacity to other RDSWs.
Future Directions
- Inter-Region Connectivity: Connecting multiple DSF zones to create larger, interconnected clusters spanning multiple regions. This presents challenges related to heterogeneity in hardware and network configurations.
- Enhanced Port Utilization: Exploring technologies like “Hyperports” to combine multiple 800G ports into a single logical port, improving utilization and reducing the impact of link failures.
- Heterogeneity Management: Addressing the complexities of managing different hardware models within the DSF fabric.
Conclusion
DSF represents a significant advancement in network technology for AI training. By addressing the limitations of traditional networks and incorporating innovative features like Input Balanced Mode, DSF enables the creation of highly scalable, resilient, and efficient AI infrastructure. The ongoing development of DSF will continue to drive innovation in AI and accelerate the development of next-generation AI models.
Continue reading
Next article
Google Launches LLM-Evalkit for Data-Driven Prompt Engineering
Related Content
Scaling Remote Infrastructure: Beyond GUI Limitations
Professional infrastructure management requires moving beyond AnyDesk to Zero Trust tools like Teleport for secure, scalable terminal-native workflows.
Scaling PrestaShop: Solving Load Balancer and Auto-Scaling Challenges
Learn how to scale PrestaShop behind a load balancer, reducing SQL queries by up to 70% while managing 300k SKUs through strategic caching.
P2P vs. Broker: Scaling Multi-Agent Systems via Pilot Protocol
Multi-agent system inquiries surged 1,445% as teams hit broker bottlenecks, driving a shift toward P2P architectures like Pilot Protocol.