What the Review Missed

The initial response to the accidents was not a systemic investigation. It was a sequence of isolated reactions.

After the first reported overdose at the Kennestone Regional Oncology Center in 1985, the hospital contacted AECL. AECL’s engineers could not reproduce the malfunction. They responded that the machine’s safety systems made an overdose “impossible.” This was not a lie. It was a conclusion derived from a safety analysis that assumed the software was correct. The safety analysis was the safety case for removing the hardware interlocks. The safety case was circular: the software is safe because the safety analysis says it is safe, and the safety analysis says it is safe because the software has worked before.

AECL investigated by reviewing the code and running the machine through its standard test procedures. The standard test procedures did not include an operator correcting a mode selection quickly at the terminal. They tested the machine’s behavior under the assumption that operator inputs arrive at a pace slow enough for the setup task to process each one before the next arrives. The race condition is invisible to any test that makes this assumption.

After the Tyler accidents in 1986, where two patients were overdosed within weeks of each other, the pattern became visible. Both accidents involved the same sequence: mode entry, correction, rapid SET key press. An AECL engineer eventually identified the race condition by examining the code, but only after the second Tyler accident made the trigger sequence clear through operator testimony.

The FDA’s investigation was more thorough but constrained by the era’s regulatory framework. The FDA did not have the authority or the precedent to demand source code review for medical device software. The regulatory category for the Therac-25 treated software as a component of the device, not as a safety-critical system requiring independent verification. The FDA’s primary enforcement mechanism was requiring the manufacturer to issue corrective actions, which AECL did, in stages, over more than two years.

AECL’s corrective actions were incremental. They added a software fix for the race condition. They added hardware interlocks, reinstating the protection that had been removed. They improved error messages. Each fix addressed a specific symptom. No fix addressed the systemic issue: a safety-critical system with a single software safety layer, no formal specification, no independent review, and no testing methodology designed to find concurrency errors.

The total count across all documented Therac-25 accidents was six known overdose incidents between 1985 and 1987. At least three patients died as a direct result. Others suffered severe radiation injuries.

What Changed

The Therac-25 accidents became the most cited case study in software safety engineering. The consequences are traceable across three domains.

Regulatory. The FDA fundamentally changed its approach to software in medical devices. The 1997 FDA guidance on software validation, and subsequently the international standard IEC 62304 (Medical device software, Software life cycle processes), established requirements for software development processes in medical devices. IEC 62304 requires software classification by safety risk, mandates documented software development plans, requires traceability from requirements to verification, and demands independent review of safety-critical code. These requirements did not exist before the Therac-25 accidents. They exist because of them.

Engineering practice. The Therac-25 case established several principles that are now standard in safety-critical software engineering:

Software must never be the sole safety layer in a system where a software failure can cause physical harm. Independent hardware interlocks or independent software monitors, running on separate hardware, must provide a backup safety check. This is defense in depth, and its adoption as a mandatory requirement in safety-critical systems traces directly to the removal of hardware interlocks on the Therac-25.

Safety cases must account for software failure modes, not just software correctness. The Therac-25 safety case assumed the software was correct and therefore concluded that the hardware interlocks were redundant. A safety case that asks “what happens if the software has a bug” would have required independent verification at the hardware level.

Concurrency errors require concurrency-aware testing. Standard functional testing, where inputs are provided and outputs are checked, cannot find race conditions because race conditions depend on timing, not on input values. Testing for concurrency errors requires techniques that vary timing: stress testing, randomized scheduling, and formal methods.

Academic. Nancy Leveson’s analysis of the Therac-25, published with Clark Turner in 1993, became the foundational text of software safety as an engineering discipline. Leveson’s subsequent work on system safety engineering, including her STAMP (Systems-Theoretic Accident Model and Processes) framework, built directly on the analytical approach developed during the Therac-25 investigation. The move from component-failure models (where you ask “which part broke?”) to systemic models (where you ask “what system condition allowed a component failure to propagate to harm?”) begins with this case.

The Rule

Never rely on software as the sole safety mechanism in a system where software failure can cause physical harm. Independent safety checks, on independent hardware, must verify that the system state is safe before every safety-critical action.

This rule comes from the Therac-25, where removing hardware interlocks because the software “had worked before” created a machine that could deliver 100 times the intended radiation dose when an experienced operator corrected a typo too quickly.