Migration Rollback and the Pipeline Gate That Catches Breaking Changes

The Failure

The catalog team merged a migration that added a NOT NULL constraint to the description column. Existing rows had null descriptions for internal-only products. The migration failed in staging because 847 rows violated the constraint. The team fixed the data in staging and re-ran the migration. Then it failed in production because production had 2,341 rows with null descriptions that staging did not have.

The CI pipeline did not catch this because the test database was empty. Empty databases have no constraint violations. The gate must test migrations against a database with realistic data.

The Mechanism

Rollback Strategy Matrix

Migration Type	Rollback Approach	Automatic?	Data Loss Risk
Add column	Drop column	Yes	None (new column was empty or backfilled)
Drop column	Cannot undo	No	Data lost permanently
Rename column	Rename back	Yes	None
Add constraint	Drop constraint	Yes	None
Add index	Drop index	Yes	None
Change column type	Change back	Maybe	Precision loss possible
Table split	Drop new tables	Yes	None (old table still has data)
Data migration	Reverse update	Maybe	Depends on reversibility

The critical rule: never drop a column or table in the same deployment that removes the code using it. Always wait at least one deployment cycle. If the code rollback needs the column, it must still exist.

CI Gate Design

The backward-compatibility gate must:

Start a database with the current production schema
Seed it with a representative data set (anonymized production snapshot or synthetic data)
Apply the new migration
Run the current production code’s tests against the new schema
Verify no existing queries fail

The Implementation

Backward-Compatibility Check

# HARDENED: Migration backward-compatibility gate
name: migration-check
on:
  pull_request:
    paths:
      - "migrations/**"

jobs:
  backward-compat:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: migration_test
          POSTGRES_USER: test
          POSTGRES_PASSWORD: test
        options: >-
          --health-cmd="pg_isready -U test"
          --health-interval=5s
          --health-timeout=5s
          --health-retries=10

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Load production schema
        run: |
          # Use the schema dump from the last production deployment
          psql -h localhost -U test -d migration_test \
            -f schema/production-schema.sql

      - name: Seed with test data
        run: |
          psql -h localhost -U test -d migration_test \
            -f schema/seed-data.sql

      - name: Apply new migrations
        run: |
          # Only apply migrations that are new in this PR
          NEW_MIGRATIONS=$(git diff --name-only origin/main...HEAD -- migrations/ | sort)
          for migration in $NEW_MIGRATIONS; do
            echo "Applying $migration..."
            psql -h localhost -U test -d migration_test -f "$migration"
          done

      - name: Test current code against new schema
        run: |
          # Pull the current production image and run its tests
          docker run --rm --network=host \
            -e DATABASE_URL=postgresql://test:test@localhost:5432/migration_test \
            ghcr.io/acme/checkout-service:production-latest \
            ./run-tests.sh integration

      - name: Test new code against new schema
        run: |
          docker build -t checkout-test .
          docker run --rm --network=host \
            -e DATABASE_URL=postgresql://test:test@localhost:5432/migration_test \
            checkout-test \
            ./run-tests.sh integration

Dangerous Migration Detector

#!/bin/bash
# HARDENED: Detect dangerous migration patterns
set -euo pipefail

MIGRATION_FILE=$1
ERRORS=0

# Check for column drops without a preceding expand phase
if grep -qi "DROP COLUMN" "$MIGRATION_FILE"; then
  COLUMN=$(grep -oi "DROP COLUMN [a-z_]*" "$MIGRATION_FILE" | head -1 | awk '{print $3}')
  echo "::warning::Migration drops column '$COLUMN'. Verify expand-contract pattern was followed."

  # Check if the column is still referenced in code
  if grep -r "$COLUMN" src/ --include="*.java" --include="*.go" --include="*.js" -l; then
    echo "::error::Column '$COLUMN' is still referenced in code. Cannot drop."
    ERRORS=$((ERRORS + 1))
  fi
fi

# Check for NOT NULL additions without DEFAULT
if grep -qi "SET NOT NULL" "$MIGRATION_FILE"; then
  if ! grep -qi "SET DEFAULT" "$MIGRATION_FILE"; then
    echo "::error::NOT NULL constraint added without DEFAULT value. Existing rows may violate constraint."
    ERRORS=$((ERRORS + 1))
  fi
fi

# Check for table renames (always dangerous)
if grep -qi "RENAME TABLE\|RENAME TO" "$MIGRATION_FILE"; then
  echo "::error::Table rename detected. Use expand-contract pattern instead."
  ERRORS=$((ERRORS + 1))
fi

# Check for column type changes (may lock table)
if grep -qi "ALTER COLUMN.*TYPE" "$MIGRATION_FILE"; then
  echo "::warning::Column type change detected. Verify this won't lock the table on large datasets."
fi

exit $ERRORS

Migration Review Checklist (PR Template)

## Migration Review Checklist

- [ ] Migration is backward-compatible (old code works with new schema)
- [ ] Migration is forward-compatible (new code works if migration is rolled back)
- [ ] No column or table drops without prior expand-contract cycle
- [ ] No NOT NULL constraints without DEFAULT values
- [ ] No column renames (use expand-contract instead)
- [ ] Backfill operations use batched updates (not single UPDATE for all rows)
- [ ] Reverse migration exists and has been tested
- [ ] Migration tested against seed data with realistic volume
- [ ] No `ALTER TABLE ... TYPE` on tables with > 100k rows (use expand-contract)

The Gate

The migration-check workflow runs on every PR that modifies the migrations/ directory. It must pass before the PR can be merged. The dangerous migration detector runs as a separate step that flags patterns known to cause outages.

The combination of automated testing (backward-compat check) and static analysis (dangerous migration detector) catches most migration issues before they reach staging.

The Recovery

Migration fails in production but not in staging: The data is different. The seed data in CI must be updated to cover the edge case. Add the failing scenario to the seed data and fix the migration.

Migration succeeds but causes performance degradation: Common with new indexes on large tables. PostgreSQL’s CREATE INDEX CONCURRENTLY avoids table locks but the index build consumes I/O. Schedule large index creations during low-traffic windows using a time-gated deployment.

Need to roll back a migration that deleted data: You cannot. This is why the dangerous migration detector blocks column and table drops. Always keep a backup of the data before running contract migrations in production. Use pg_dump --table=<table> before the contract phase.