Skip to main content
ship it and sleep

Migration Rollback and the Pipeline Gate That Catches Breaking Changes

6 min read Chapter 27 of 66

Migration Rollback and the Pipeline Gate That Catches Breaking Changes

The Failure

The catalog team merged a migration that added a NOT NULL constraint to the description column. Existing rows had null descriptions for internal-only products. The migration failed in staging because 847 rows violated the constraint. The team fixed the data in staging and re-ran the migration. Then it failed in production because production had 2,341 rows with null descriptions that staging did not have.

The CI pipeline did not catch this because the test database was empty. Empty databases have no constraint violations. The gate must test migrations against a database with realistic data.

The Mechanism

Rollback Strategy Matrix

Migration TypeRollback ApproachAutomatic?Data Loss Risk
Add columnDrop columnYesNone (new column was empty or backfilled)
Drop columnCannot undoNoData lost permanently
Rename columnRename backYesNone
Add constraintDrop constraintYesNone
Add indexDrop indexYesNone
Change column typeChange backMaybePrecision loss possible
Table splitDrop new tablesYesNone (old table still has data)
Data migrationReverse updateMaybeDepends on reversibility

The critical rule: never drop a column or table in the same deployment that removes the code using it. Always wait at least one deployment cycle. If the code rollback needs the column, it must still exist.

CI Gate Design

The backward-compatibility gate must:

  1. Start a database with the current production schema
  2. Seed it with a representative data set (anonymized production snapshot or synthetic data)
  3. Apply the new migration
  4. Run the current production code’s tests against the new schema
  5. Verify no existing queries fail

The Implementation

Backward-Compatibility Check

# HARDENED: Migration backward-compatibility gate
name: migration-check
on:
  pull_request:
    paths:
      - "migrations/**"

jobs:
  backward-compat:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: migration_test
          POSTGRES_USER: test
          POSTGRES_PASSWORD: test
        options: >-
          --health-cmd="pg_isready -U test"
          --health-interval=5s
          --health-timeout=5s
          --health-retries=10

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Load production schema
        run: |
          # Use the schema dump from the last production deployment
          psql -h localhost -U test -d migration_test \
            -f schema/production-schema.sql

      - name: Seed with test data
        run: |
          psql -h localhost -U test -d migration_test \
            -f schema/seed-data.sql

      - name: Apply new migrations
        run: |
          # Only apply migrations that are new in this PR
          NEW_MIGRATIONS=$(git diff --name-only origin/main...HEAD -- migrations/ | sort)
          for migration in $NEW_MIGRATIONS; do
            echo "Applying $migration..."
            psql -h localhost -U test -d migration_test -f "$migration"
          done

      - name: Test current code against new schema
        run: |
          # Pull the current production image and run its tests
          docker run --rm --network=host \
            -e DATABASE_URL=postgresql://test:test@localhost:5432/migration_test \
            ghcr.io/acme/checkout-service:production-latest \
            ./run-tests.sh integration

      - name: Test new code against new schema
        run: |
          docker build -t checkout-test .
          docker run --rm --network=host \
            -e DATABASE_URL=postgresql://test:test@localhost:5432/migration_test \
            checkout-test \
            ./run-tests.sh integration

Dangerous Migration Detector

#!/bin/bash
# HARDENED: Detect dangerous migration patterns
set -euo pipefail

MIGRATION_FILE=$1
ERRORS=0

# Check for column drops without a preceding expand phase
if grep -qi "DROP COLUMN" "$MIGRATION_FILE"; then
  COLUMN=$(grep -oi "DROP COLUMN [a-z_]*" "$MIGRATION_FILE" | head -1 | awk '{print $3}')
  echo "::warning::Migration drops column '$COLUMN'. Verify expand-contract pattern was followed."

  # Check if the column is still referenced in code
  if grep -r "$COLUMN" src/ --include="*.java" --include="*.go" --include="*.js" -l; then
    echo "::error::Column '$COLUMN' is still referenced in code. Cannot drop."
    ERRORS=$((ERRORS + 1))
  fi
fi

# Check for NOT NULL additions without DEFAULT
if grep -qi "SET NOT NULL" "$MIGRATION_FILE"; then
  if ! grep -qi "SET DEFAULT" "$MIGRATION_FILE"; then
    echo "::error::NOT NULL constraint added without DEFAULT value. Existing rows may violate constraint."
    ERRORS=$((ERRORS + 1))
  fi
fi

# Check for table renames (always dangerous)
if grep -qi "RENAME TABLE\|RENAME TO" "$MIGRATION_FILE"; then
  echo "::error::Table rename detected. Use expand-contract pattern instead."
  ERRORS=$((ERRORS + 1))
fi

# Check for column type changes (may lock table)
if grep -qi "ALTER COLUMN.*TYPE" "$MIGRATION_FILE"; then
  echo "::warning::Column type change detected. Verify this won't lock the table on large datasets."
fi

exit $ERRORS

Migration Review Checklist (PR Template)

## Migration Review Checklist

- [ ] Migration is backward-compatible (old code works with new schema)
- [ ] Migration is forward-compatible (new code works if migration is rolled back)
- [ ] No column or table drops without prior expand-contract cycle
- [ ] No NOT NULL constraints without DEFAULT values
- [ ] No column renames (use expand-contract instead)
- [ ] Backfill operations use batched updates (not single UPDATE for all rows)
- [ ] Reverse migration exists and has been tested
- [ ] Migration tested against seed data with realistic volume
- [ ] No `ALTER TABLE ... TYPE` on tables with > 100k rows (use expand-contract)

The Gate

The migration-check workflow runs on every PR that modifies the migrations/ directory. It must pass before the PR can be merged. The dangerous migration detector runs as a separate step that flags patterns known to cause outages.

The combination of automated testing (backward-compat check) and static analysis (dangerous migration detector) catches most migration issues before they reach staging.

The Recovery

Migration fails in production but not in staging: The data is different. The seed data in CI must be updated to cover the edge case. Add the failing scenario to the seed data and fix the migration.

Migration succeeds but causes performance degradation: Common with new indexes on large tables. PostgreSQL’s CREATE INDEX CONCURRENTLY avoids table locks but the index build consumes I/O. Schedule large index creations during low-traffic windows using a time-gated deployment.

Need to roll back a migration that deleted data: You cannot. This is why the dangerous migration detector blocks column and table drops. Always keep a backup of the data before running contract migrations in production. Use pg_dump --table=<table> before the contract phase.