Reliability Is an Emergent Property, Not a Root Cause
These articles are AI-generated summaries. Please check the original sources for full details.
Looking for Root Causes is a False Path
In a recent podcast, David Blank-Edelman, program lead for Microsoft’s SRE Academy, argued that searching for a single root cause of system failures is counterproductive. He emphasized that reliability is an emergent property of complex systems, shaped by interactions between technical and socio-technical factors, not by isolating one failure point.
Why This Matters
Traditional root cause analysis assumes a single failure point, but modern systems are too interconnected for this approach. As Blank-Edelman explains, failures often stem from multiple contributing factors, including human decisions, design trade-offs, and unanticipated interactions. For example, the B-12 bomber crash in WWII was not caused by a single mechanical failure but by poorly designed cockpit switches, a socio-technical oversight. Focusing on root causes ignores these systemic issues, leading to recurring failures and missed opportunities for systemic improvement.
Key Insights
- “Reliability is an emergent property of an architecture and can include any property important to the customer, such as availability or durability.” (David Blank-Edelman, 2025)
- “Failures have multiple causes, some of which are socio-technological in nature.” (Podcast transcript, 2025)
- “Temporal used by Stripe, Coinbase” (Example of tools for managing distributed workflows)
Practical Applications
- Use Case: Microsoft’s SRE Academy trains Azure engineers to focus on systemic feedback loops rather than isolated incidents.
- Pitfall: Prematurely blaming human error or a single component in post-incident reviews can mask deeper systemic issues, leading to recurring outages.
References:
Continue reading
Next article
Prevent a page from scrolling while a dialog is open
Related Content
P2P vs. Broker: Scaling Multi-Agent Systems via Pilot Protocol
Multi-agent system inquiries surged 1,445% as teams hit broker bottlenecks, driving a shift toward P2P architectures like Pilot Protocol.
Prioritizing Service Level Indicators Over Objectives for Effective Reliability
Samson Tanimawo argues that SLIs are more critical than SLOs, as poor indicators like healthcheck status fail to reflect true user experience.
System Reliability Lessons from Nigeria's ₦1.92 Trillion Market Crash
Nigeria's stock market lost ₦1.92 trillion following a single regulatory change, offering a masterclass in single points of failure and eventual consistency.