Skip to main content
postmortem

The Rules Without Origin

8 min read Chapter 1 of 38

The Rules Without Origin

Every engineering team has a list of rules. Some are written in style guides. Some live in linter configurations. Some exist only as oral tradition, passed from senior engineer to junior engineer during code review with the phrase “we don’t do that here” and no further explanation.

Do not cast between floating point and integer types without explicit range checking. Do not deploy on Fridays. Never reuse feature flags. Always test your backups. Never trust client-supplied units without validation. Do not allow unbounded recursion in string matching. Keep humans in the loop for safety-critical decisions.

These rules have the weight of commandments and the shelf life of sticky notes. Under deadline pressure, an engineer who cannot explain why a rule exists will break it. Not out of recklessness, but out of rational prioritization. A rule that sounds like a preference gets treated like a preference. A rule that sounds like it was written in the aftermath of someone’s death gets treated differently.

Every rule in the preceding list was written in the aftermath of a specific failure. Some of those failures destroyed equipment. Some destroyed companies. Some killed people. This book investigates twelve of them.

What This Book Does

Each chapter reconstructs one engineering failure from the available historical record. The reconstruction follows a fixed structure:

The system as its engineers understood it. What the system was designed to do. What the architects believed about its behavior. What the documentation and safety analyses claimed. Written from the perspective of the team before the failure, without foreshadowing. The system looked reasonable. The chapter shows why.

The chain. The sequence of decisions, events, and conditions that led to the failure. Stated as discrete events with timestamps where the record supports them. The reader can trace the chain and identify the point where the outcome became inevitable, and understand why nobody saw it at the time.

The mechanism. The technical explanation at the code, architecture, or physics level. This is the deepest section. Race conditions shown as interleaving diagrams. Overflow errors shown with actual numeric values. Cascading failures shown as dependency graphs. This section does not simplify. It explains the failure at the level required to understand it.

What the review missed. What the official investigation concluded, and where that conclusion was incomplete or misdirected. When a review was thorough and correct, the chapter says so. When a review identified a person instead of a system condition, the chapter says that too.

What changed. The traceable consequences: the standard that was written, the language feature that was added, the regulation that was passed, the tool that was built. Direct causation where it exists. Honest uncertainty where it does not.

The rule. A single sentence stating the engineering principle this failure produced, followed by the failure it came from. So the rule is never separated from its origin again.

Four Positions

This book has a point of view and states it here.

The engineers were not incompetent. This is the most important position and it is never contradicted. The Therac-25 software was written by a programmer who had worked on the previous model’s software and understood radiation therapy machines. The Ariane 5 guidance system was built by a team that had produced a successful guidance system for the Ariane 4. The Knight Capital deployment was executed by engineers who had deployed hundreds of times before. In every case, the people involved were experienced, skilled, and making decisions that were reasonable given what they knew at the time. The purpose of investigating these failures is not to find the person who made the mistake. It is to find the system condition that made the mistake invisible until it was too late.

Hindsight is not analysis. Knowing the outcome makes every warning sign look obvious. The investigative discipline in this book requires reconstructing what was visible to the engineers at the time, what was not, and why the gap existed. Every chapter respects this constraint.

Every engineering rule has a failure behind it. The rule “never reuse a feature flag for a different purpose than its original deployment” sounds like organizational tidiness. It is the lesson extracted from a firm that lost $440 million in 45 minutes because a deployment script repurposed an old flag and activated dead code on production servers. The rule “always validate unit consistency at integration boundaries” sounds like defensive programming. It is the lesson extracted from a $327 million spacecraft that burned up in the Martian atmosphere because one team used pounds and another used newtons and no check existed at the interface.

Rules without origin stories are cargo cult engineering. The engineer follows them when convenient and discards them when pressed. Rules with a body count are different. They persist.

Failures are not random. Twelve failures spanning five decades, six industries, and a dozen technology stacks. The same patterns recur: untested assumptions carried forward from a previous system. Race conditions in systems where the designers believed concurrency was not a factor. Implicit contracts between components that were never written down and never tested. Cost-cutting decisions that removed the redundancy designed to contain exactly the failure that occurred. The specific technologies change. The failure modes do not.

The industry learned something from each. Not always the right thing. Not always quickly. But the line from failure to changed practice is traceable. The Therac-25 accidents led directly to IEC 62304, the standard governing medical device software lifecycle processes. The Ariane 5 explosion contributed to the adoption of stronger typing disciplines and explicit exception handling requirements in safety-critical systems. Knight Capital’s collapse accelerated the adoption of deployment automation with rollback capability. Log4Shell forced a reckoning with transitive dependency management that the industry had ignored for fifteen years. The learning happened. This book documents the cost of each lesson.

What This Book Is Not

Not a collection of cautionary tales designed to make engineers nervous. Fear is a poor teacher and a worse engineering practice. The purpose is comprehension, not anxiety.

Not a blame assignment exercise. Every chapter identifies system conditions, not guilty individuals. When an official review blamed a person, the chapter examines whether that conclusion was supported by the evidence or whether it was a convenient substitute for systemic analysis.

Not a comprehensive history of software failures. Twelve failures were selected because they are well-documented, technically instructive, and traceable to lasting changes in engineering practice. Hundreds of other failures could have been included. These twelve were chosen because they illuminate the patterns most clearly.

Not a compliance guide. Standards are referenced where failures produced them, but this book does not teach compliance. It teaches the engineering reasoning that compliance frameworks attempt to encode.

How to Read This Book

The chapters are grouped by theme but each stands alone. An engineer interested in deployment failures can read the Knight Capital chapter without reading the Therac-25 chapter first. An engineer interested in supply chain risk can read the Log4Shell and Left-Pad chapters as a pair.

The recommended path through the book is sequential. The patterns accumulate. By the time the reader reaches the conclusion, the individual failures have composed into a taxonomy of system failure modes that applies to any software system the reader will build next.

The code in this book is evidence, not instruction. When a chapter shows C code from the Therac-25 control software, the purpose is to demonstrate the race condition that killed patients, not to teach C programming. When a chapter shows the Ada type conversion that destroyed Ariane 5, the purpose is to show the exact numeric values at the exact moment the exception was raised. The code blocks are annotated with // FAILURE POINT and, where the code is reconstructed from incident reports rather than taken from primary sources, with // RECONSTRUCTED FROM INCIDENT REPORT.

Every diagram in this book shows something that prose cannot show as clearly. A timeline that makes the inevitability visible. An interleaving that reveals the race window. A propagation graph that shows why the failure could not be contained. If a diagram could be replaced by a sentence, it would be.

The rule at the end of each chapter is designed to be remembered. It is one sentence. It is followed by the name of the failure it came from. When an engineer remembers the rule, they remember the failure. When they remember the failure, they understand why the rule exists. That is the mechanism this book relies on, and it is the only mechanism that works under deadline pressure.