Migration Rollback and the Pipeline Gate That Catches Breaking Changes
Migration Rollback and the Pipeline Gate That Catches Breaking Changes
The Failure
The catalog team merged a migration that added a NOT NULL constraint to the description column. Existing rows had null descriptions for internal-only products. The migration failed in staging because 847 rows violated the constraint. The team fixed the data in staging and re-ran the migration. Then it failed in production because production had 2,341 rows with null descriptions that staging did not have.
The CI pipeline did not catch this because the test database was empty. Empty databases have no constraint violations. The gate must test migrations against a database with realistic data.
The Mechanism
Rollback Strategy Matrix
| Migration Type | Rollback Approach | Automatic? | Data Loss Risk |
|---|---|---|---|
| Add column | Drop column | Yes | None (new column was empty or backfilled) |
| Drop column | Cannot undo | No | Data lost permanently |
| Rename column | Rename back | Yes | None |
| Add constraint | Drop constraint | Yes | None |
| Add index | Drop index | Yes | None |
| Change column type | Change back | Maybe | Precision loss possible |
| Table split | Drop new tables | Yes | None (old table still has data) |
| Data migration | Reverse update | Maybe | Depends on reversibility |
The critical rule: never drop a column or table in the same deployment that removes the code using it. Always wait at least one deployment cycle. If the code rollback needs the column, it must still exist.
CI Gate Design
The backward-compatibility gate must:
- Start a database with the current production schema
- Seed it with a representative data set (anonymized production snapshot or synthetic data)
- Apply the new migration
- Run the current production code’s tests against the new schema
- Verify no existing queries fail
The Implementation
Backward-Compatibility Check
# HARDENED: Migration backward-compatibility gate
name: migration-check
on:
pull_request:
paths:
- "migrations/**"
jobs:
backward-compat:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:16
env:
POSTGRES_DB: migration_test
POSTGRES_USER: test
POSTGRES_PASSWORD: test
options: >-
--health-cmd="pg_isready -U test"
--health-interval=5s
--health-timeout=5s
--health-retries=10
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Load production schema
run: |
# Use the schema dump from the last production deployment
psql -h localhost -U test -d migration_test \
-f schema/production-schema.sql
- name: Seed with test data
run: |
psql -h localhost -U test -d migration_test \
-f schema/seed-data.sql
- name: Apply new migrations
run: |
# Only apply migrations that are new in this PR
NEW_MIGRATIONS=$(git diff --name-only origin/main...HEAD -- migrations/ | sort)
for migration in $NEW_MIGRATIONS; do
echo "Applying $migration..."
psql -h localhost -U test -d migration_test -f "$migration"
done
- name: Test current code against new schema
run: |
# Pull the current production image and run its tests
docker run --rm --network=host \
-e DATABASE_URL=postgresql://test:test@localhost:5432/migration_test \
ghcr.io/acme/checkout-service:production-latest \
./run-tests.sh integration
- name: Test new code against new schema
run: |
docker build -t checkout-test .
docker run --rm --network=host \
-e DATABASE_URL=postgresql://test:test@localhost:5432/migration_test \
checkout-test \
./run-tests.sh integration
Dangerous Migration Detector
#!/bin/bash
# HARDENED: Detect dangerous migration patterns
set -euo pipefail
MIGRATION_FILE=$1
ERRORS=0
# Check for column drops without a preceding expand phase
if grep -qi "DROP COLUMN" "$MIGRATION_FILE"; then
COLUMN=$(grep -oi "DROP COLUMN [a-z_]*" "$MIGRATION_FILE" | head -1 | awk '{print $3}')
echo "::warning::Migration drops column '$COLUMN'. Verify expand-contract pattern was followed."
# Check if the column is still referenced in code
if grep -r "$COLUMN" src/ --include="*.java" --include="*.go" --include="*.js" -l; then
echo "::error::Column '$COLUMN' is still referenced in code. Cannot drop."
ERRORS=$((ERRORS + 1))
fi
fi
# Check for NOT NULL additions without DEFAULT
if grep -qi "SET NOT NULL" "$MIGRATION_FILE"; then
if ! grep -qi "SET DEFAULT" "$MIGRATION_FILE"; then
echo "::error::NOT NULL constraint added without DEFAULT value. Existing rows may violate constraint."
ERRORS=$((ERRORS + 1))
fi
fi
# Check for table renames (always dangerous)
if grep -qi "RENAME TABLE\|RENAME TO" "$MIGRATION_FILE"; then
echo "::error::Table rename detected. Use expand-contract pattern instead."
ERRORS=$((ERRORS + 1))
fi
# Check for column type changes (may lock table)
if grep -qi "ALTER COLUMN.*TYPE" "$MIGRATION_FILE"; then
echo "::warning::Column type change detected. Verify this won't lock the table on large datasets."
fi
exit $ERRORS
Migration Review Checklist (PR Template)
## Migration Review Checklist
- [ ] Migration is backward-compatible (old code works with new schema)
- [ ] Migration is forward-compatible (new code works if migration is rolled back)
- [ ] No column or table drops without prior expand-contract cycle
- [ ] No NOT NULL constraints without DEFAULT values
- [ ] No column renames (use expand-contract instead)
- [ ] Backfill operations use batched updates (not single UPDATE for all rows)
- [ ] Reverse migration exists and has been tested
- [ ] Migration tested against seed data with realistic volume
- [ ] No `ALTER TABLE ... TYPE` on tables with > 100k rows (use expand-contract)
The Gate
The migration-check workflow runs on every PR that modifies the migrations/ directory. It must pass before the PR can be merged. The dangerous migration detector runs as a separate step that flags patterns known to cause outages.
The combination of automated testing (backward-compat check) and static analysis (dangerous migration detector) catches most migration issues before they reach staging.
The Recovery
Migration fails in production but not in staging: The data is different. The seed data in CI must be updated to cover the edge case. Add the failing scenario to the seed data and fix the migration.
Migration succeeds but causes performance degradation: Common with new indexes on large tables. PostgreSQL’s CREATE INDEX CONCURRENTLY avoids table locks but the index build consumes I/O. Schedule large index creations during low-traffic windows using a time-gated deployment.
Need to roll back a migration that deleted data: You cannot. This is why the dangerous migration detector blocks column and table drops. Always keep a backup of the data before running contract migrations in production. Use pg_dump --table=<table> before the contract phase.