Skip to main content
ship it and sleep

Database Migrations in a CD World: Zero-Downtime Schema Changes

5 min read Chapter 25 of 66

Database Migrations in a CD World

Database migrations are the hardest part of continuous deployment. Code deploys take seconds. Database schema changes can lock tables for minutes. Code can run two versions simultaneously during a rolling update. A database has one schema at a time.

The rule: the old code must work with the new schema, and the new code must work with the old schema. If a migration breaks either direction, it is not compatible with continuous deployment.

Expand-contract migration timeline

The Failure

The payments team renamed the amount column to total_amount in the charges table. They created a migration, tested it locally, and deployed. During the rolling update, old pods (expecting amount) and new pods (expecting total_amount) ran simultaneously. Every query from old pods failed with “column ‘amount’ does not exist.” The team rolled back the code, but the database migration had already run. Rolling back the code did not roll back the database. They had to write a reverse migration at 2am.

The Mechanism

The Expand-Contract Pattern

Instead of renaming a column in one step, use three deployments:

Deployment 1 (Expand):

  • Add the new column total_amount
  • Backfill total_amount from amount
  • Add a trigger to keep both columns in sync
  • Code reads from amount, writes to both

Deployment 2 (Migrate):

  • Code reads from total_amount, writes to both
  • All new data goes to total_amount
  • Old code (if rolled back) still works because amount is still populated

Deployment 3 (Contract):

  • Remove the trigger
  • Drop the amount column
  • Code reads and writes only total_amount

Each deployment is independently deployable and rollback-safe. At no point does old code break.

The Implementation

Migration Pipeline Stage

# HARDENED: Database migration as a separate pipeline stage
name: deploy
on:
  push:
    branches: [main]

jobs:
  migrate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Validate migrations
        run: |
          # Check that all migrations are backward-compatible
          docker run --rm \
            -v $PWD/migrations:/migrations \
            ghcr.io/acme/migration-validator:latest \
            validate --dir=/migrations --check-backward-compat

      - name: Run migration against staging
        run: |
          docker run --rm \
            -e DATABASE_URL=${{ secrets.STAGING_DATABASE_URL }} \
            ghcr.io/acme/checkout-service:${{ github.sha }} \
            ./run-migrations.sh --target=staging

      - name: Verify schema state
        run: |
          docker run --rm \
            -e DATABASE_URL=${{ secrets.STAGING_DATABASE_URL }} \
            ghcr.io/acme/checkout-service:${{ github.sha }} \
            ./verify-schema.sh

  deploy-staging:
    needs: [migrate]
    runs-on: ubuntu-latest
    steps:
      - name: Update staging image
        run: echo "Deploy code after migration succeeds"

Expand Migration Example

-- Migration: V2024.03.15.1__expand_total_amount.sql
-- Phase: EXPAND
-- Backward compatible: YES (old code reads 'amount', new column is additive)

-- Step 1: Add new column
ALTER TABLE charges ADD COLUMN total_amount DECIMAL(10,2);

-- Step 2: Backfill from existing data
UPDATE charges SET total_amount = amount WHERE total_amount IS NULL;

-- Step 3: Add sync trigger (PostgreSQL)
CREATE OR REPLACE FUNCTION sync_amount_columns()
RETURNS TRIGGER AS $$
BEGIN
  IF TG_OP = 'INSERT' OR TG_OP = 'UPDATE' THEN
    IF NEW.total_amount IS NOT NULL AND NEW.amount IS NULL THEN
      NEW.amount := NEW.total_amount;
    ELSIF NEW.amount IS NOT NULL AND NEW.total_amount IS NULL THEN
      NEW.total_amount := NEW.amount;
    END IF;
  END IF;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_sync_amount
  BEFORE INSERT OR UPDATE ON charges
  FOR EACH ROW
  EXECUTE FUNCTION sync_amount_columns();

-- Step 4: Add NOT NULL constraint after backfill
ALTER TABLE charges ALTER COLUMN total_amount SET NOT NULL;

Migration Validation in CI

# HARDENED: Migration backward-compatibility check
- name: Test migration backward compatibility
  run: |
    # Start a test database with the current schema
    docker compose -f docker-compose.migration-test.yml up -d postgres
    docker compose -f docker-compose.migration-test.yml exec -T postgres \
      pg_restore --dbname=testdb /schema/current-schema.dump

    # Apply the new migration
    docker run --rm --network=host \
      -e DATABASE_URL=postgresql://test:test@localhost:5432/testdb \
      ghcr.io/acme/checkout-service:${{ github.sha }} \
      ./run-migrations.sh

    # Run the OLD version's tests against the NEW schema
    docker run --rm --network=host \
      -e DATABASE_URL=postgresql://test:test@localhost:5432/testdb \
      ghcr.io/acme/checkout-service:CURRENT_PROD_SHA \
      ./run-tests.sh integration

    echo "Old code works with new schema ✓"

The Gate

Migration validation must pass before the migration runs against staging. The validation:

  1. Applies the migration to a test database
  2. Runs the current production version’s integration tests against the new schema
  3. If old code breaks with the new schema, the migration is not backward-compatible

The pipeline does not proceed to staging deployment until the migration succeeds and the old code is verified to work with the new schema.

The Recovery

Migration fails halfway: Most migration tools (Flyway, Liquibase, golang-migrate) wrap each migration in a transaction. If the migration fails, the transaction rolls back. The database stays at the previous version.

Migration succeeds but the new code fails: The expand-contract pattern ensures the old code still works with the new schema. Roll back the code. The database migration does not need to be reversed because both old and new code work with the expanded schema.

Need to reverse a completed migration: Write a reverse migration that undoes the expand step (drop the new column, remove the trigger). Deploy it. This is why every expand migration should have a corresponding contract-reverse migration ready before deployment.