Skip to main content
postmortem

What the Review Missed and What Changed

5 min read Chapter 25 of 38

What the Review Missed

GitLab’s post-incident review is one of the most thorough and transparent in the history of the software industry. The company published a detailed incident report, a public Google Doc tracking the recovery in real-time, and a live YouTube stream of the database restoration. The review correctly identified every cause: the terminal confusion, the pg_dump failure, the replication corruption, the unconfigured snapshots, the dependent S3 uploads. The review did not miss the causes. It identified them exhaustively.

What the review acknowledged honestly, and what makes it valuable as a case study, is that GitLab’s team knew backups were important. They had designed a five-layer backup strategy. They had allocated engineering effort to building backup mechanisms. What they had not done was verify that those mechanisms worked. The gap between “we have backups” and “we have verified, tested, restorable backups” is the lesson.

The review also addressed the human factors component without blame. The engineer who executed the command in the wrong terminal was not disciplined. GitLab publicly stated that the error was a predictable outcome of the tooling and workflow, not an individual failure. This position, consistent with the principles stated in this book’s introduction, treats the engineer’s action as a symptom of system design rather than a cause of the failure.

What Changed

Backup verification as a practice. The GitLab incident became the industry’s canonical example of untested backups. The phrase “your backup strategy is not your backup strategy until you have tested a restore” existed before the incident. After the incident, it had a face and a body count (707 users’ lost data).

The practice that emerged: backup verification must include periodic automated restore tests. A backup that has not been restored is not a backup. It is a hope. The restore test must verify not just that the backup file exists and is non-empty, but that the data can be loaded into a functioning database and that the application can operate against the restored data. This end-to-end verification is the only way to detect the class of silent failures that affected GitLab: a backup file that exists but is corrupted, a backup process that runs but produces empty output, a backup that restores but is missing tables or constraints.

Monitoring of backup processes. The silent pg_dump failure was not detected because no monitoring was configured for the backup process. After GitLab, the practice of monitoring backup process exit codes, output file sizes, and timestamps became standard. A backup process that fails must generate an alert. A backup output file that is smaller than expected must generate an alert. A backup that is older than the expected schedule must generate an alert. These alerts are not difficult to implement. Their absence at GitLab was not a technical limitation. It was a prioritization decision: building features for users took precedence over building monitoring for infrastructure, until the infrastructure failed.

Production access controls. The rm -rf command was executed by a human directly connected to a production server via SSH. After the incident, the industry accelerated its adoption of two practices:

Production database access through bastion hosts or privileged access management systems that record sessions, require multi-factor authentication, and display clear visual indicators of which environment the operator is connected to.

Prohibition of destructive commands on production systems without confirmation. Tools like molly-guard (which requires typing the hostname before allowing shutdown or reboot on a server) and custom wrappers around rm that require confirmation for operations on critical directories are not new, but their adoption increased after the GitLab incident.

Infrastructure as Code for backup configuration. The LVM snapshots and Azure disk snapshots were “not configured,” meaning the capability existed but the configuration was never applied. After the incident, the practice of defining backup configurations in infrastructure-as-code (Terraform, Ansible, or similar tools) and verifying their presence through automated compliance checks became more widely adopted. If the backup configuration is code, it can be reviewed, versioned, and tested. If it is a manual step in a setup checklist, it can be skipped.

GitLab’s own response. GitLab implemented comprehensive backup verification, including automated daily restore tests, monitoring of all backup processes with alerting, and separation of backup storage from the systems being backed up. The company’s transparency during the incident, including the live-streamed recovery, set a standard for incident communication that influenced the broader industry’s approach to public post-mortems.

The Rule

A backup that has never been restored is not a backup. It is an untested hypothesis. Verify backup integrity through automated, periodic restore tests, and monitor every backup process for silent failures. The time to discover that your backups do not work is not during the incident.

This rule comes from GitLab, where five independent backup mechanisms failed simultaneously because none had been tested under the conditions that mattered: an actual need to restore production data.