Skip to main content
ship it and sleep

Promotion Gates and Approval Workflows

6 min read Chapter 21 of 66

Promotion Gates and Approval Workflows

The Failure

The frontend team set up a promotion workflow with a manual approval step. The team lead approved promotions by clicking a button in the GitHub Actions UI. On a busy Thursday, the team lead approved three promotions in 20 minutes without checking staging health. The third promotion deployed a version with a JavaScript bundle that was 4MB larger than the previous version (a debug library left in the build). Page load times increased by 3 seconds. The team lead had approved based on the PR description, not on staging validation results.

Manual approval without automated validation is a rubber stamp. Automated validation without manual approval for production is reckless. The combination of both creates a reliable promotion gate.

The Mechanism

Gate Types

GateTypeWhenBlocks On
Health checkAutomatedAfter deploy to target envUnhealthy pods for > 2 min
Smoke testAutomatedAfter health check passesAny critical path failure
Performance baselineAutomatedAfter smoke tests passp99 latency > 120% of baseline
Bundle size checkAutomatedDuring CISize increase > 5% without explanation
Manual approvalHumanBefore production deployReviewer does not approve
Wait timerAutomatedAfter approvalMust wait 10 min before deploy proceeds

GitHub Environment Protection Rules

GitHub environments support:

  • Required reviewers: One or more GitHub users or teams must approve
  • Wait timer: Minimum delay before the job can proceed (0-43200 minutes)
  • Deployment branches: Restrict which branches can deploy to the environment
  • Custom deployment protection rules: External validation via webhooks

The Implementation

Environment Configuration

Configure environments in GitHub Settings → Environments:

# This is configured via GitHub API, shown here for documentation
environments:
  dev:
    deployment_branch_policy:
      protected_branches: false
      custom_branch_policies: true
    # Any branch can deploy to dev

  staging:
    deployment_branch_policy:
      protected_branches: true
    # Only default branch (main) deploys to staging
    wait_timer: 0

  production:
    deployment_branch_policy:
      protected_branches: true
    required_reviewers:
      - team: platform-leads
        minimum_approvals: 1
    wait_timer: 10 # 10 minutes after approval before deploy starts

Automated Validation Before Promotion

# HARDENED: Post-deploy validation that gates promotion
name: validate-staging
on:
  workflow_run:
    workflows: ["deploy-staging"]
    types: [completed]

jobs:
  health-check:
    runs-on: ubuntu-latest
    if: ${{ github.event.workflow_run.conclusion == 'success' }}
    steps:
      - name: Wait for rollout
        run: |
          kubectl --context=staging -n staging \
            rollout status deployment/${{ github.event.workflow_run.head_branch }} \
            --timeout=120s

      - name: Verify health endpoints
        run: |
          for i in {1..10}; do
            STATUS=$(curl -s -o /dev/null -w '%{http_code}' \
              https://staging.acme.com/$SERVICE/health/ready)
            if [[ "$STATUS" != "200" ]]; then
              echo "Health check failed: HTTP $STATUS (attempt $i/10)"
              sleep 10
              continue
            fi
            echo "Health check passed"
            break
          done
          [[ "$STATUS" == "200" ]] || exit 1

  smoke-test:
    runs-on: ubuntu-latest
    needs: [health-check]
    steps:
      - uses: actions/checkout@v4
      - name: Run smoke tests against staging
        run: |
          cd tests/smoke
          ENVIRONMENT=staging npm run smoke

  performance-baseline:
    runs-on: ubuntu-latest
    needs: [smoke-test]
    steps:
      - name: Run lightweight performance check
        run: |
          # Quick Locust run: 10 users, 60 seconds
          docker run --rm \
            -e TARGET_HOST=https://staging.acme.com \
            -e USERS=10 \
            -e DURATION=60 \
            -v $PWD/locust-results:/results \
            ghcr.io/acme/locust-suite:latest

      - name: Compare with baseline
        run: |
          P99=$(jq -r '.p99_latency' locust-results/stats.json)
          BASELINE=$(curl -s https://metrics.acme.com/api/v1/baseline/$SERVICE | jq -r '.p99_latency')
          THRESHOLD=$(echo "$BASELINE * 1.2" | bc)

          if (( $(echo "$P99 > $THRESHOLD" | bc -l) )); then
            echo "::error::p99 latency $P99 ms exceeds threshold $THRESHOLD ms (baseline: $BASELINE ms)"
            exit 1
          fi
          echo "Performance check passed: p99=$P99 ms (baseline=$BASELINE ms, threshold=$THRESHOLD ms)"

  record-validation:
    runs-on: ubuntu-latest
    needs: [smoke-test, performance-baseline]
    steps:
      - name: Record validation result
        run: |
          gh api repos/acme/ecommerce-infra/dispatches \
            -f event_type=staging-validated \
            -f client_payload[service]=$SERVICE \
            -f client_payload[image_tag]=$IMAGE_TAG \
            -f client_payload[validated_at]=$(date -u +%Y-%m-%dT%H:%M:%SZ)

Production Promotion with Approval

# HARDENED: Production promotion with manual approval gate
name: promote-to-production
on:
  repository_dispatch:
    types: [staging-validated]

jobs:
  promote:
    runs-on: ubuntu-latest
    environment: production # Triggers approval workflow
    steps:
      - uses: actions/checkout@v4
        with:
          token: ${{ secrets.INFRA_REPO_TOKEN }}

      - name: Update production image
        run: |
          SERVICE=${{ github.event.client_payload.service }}
          TAG=${{ github.event.client_payload.image_tag }}
          cd apps/$SERVICE/overlays/production
          kustomize edit set image \
            $SERVICE=ghcr.io/acme/$SERVICE:$TAG

      - name: Commit promotion
        run: |
          git config user.name "promotion-bot"
          git config user.email "[email protected]"
          git add -A
          git commit -m "promote(production): $SERVICE → $TAG

          Validated at: ${{ github.event.client_payload.validated_at }}
          Approved by: ${{ github.actor }}
          Workflow: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
          git push

Emergency Bypass

For critical incidents, the platform team can bypass the normal promotion flow:

# HARDENED: Emergency promotion with audit trail
name: emergency-promote
on:
  workflow_dispatch:
    inputs:
      service:
        required: true
        type: choice
        options:
          [
            checkout-service,
            catalog-service,
            inventory-service,
            payments-service,
          ]
      image-tag:
        required: true
        type: string
      incident-id:
        required: true
        type: string
        description: "Incident ID from PagerDuty or incident tracker"

jobs:
  emergency-promote:
    runs-on: ubuntu-latest
    environment: production-emergency # Separate env, requires platform-oncall approval
    steps:
      - uses: actions/checkout@v4
        with:
          token: ${{ secrets.INFRA_REPO_TOKEN }}

      - name: Update production image
        run: |
          cd apps/${{ inputs.service }}/overlays/production
          kustomize edit set image \
            ${{ inputs.service }}=ghcr.io/acme/${{ inputs.service }}:${{ inputs.image-tag }}

      - name: Commit with incident reference
        run: |
          git config user.name "emergency-bot"
          git config user.email "[email protected]"
          git add -A
          git commit -m "EMERGENCY promote(production): ${{ inputs.service }} → ${{ inputs.image-tag }}

          Incident: ${{ inputs.incident-id }}
          Approved by: ${{ github.actor }}
          ⚠️ Bypassed normal promotion gates"
          git push

      - name: Create follow-up issue
        run: |
          gh issue create \
            --repo acme/ecommerce-infra \
            --title "Post-incident: Review emergency promotion for ${{ inputs.service }}" \
            --body "Emergency promotion was used during incident ${{ inputs.incident-id }}.
            Verify that all normal gates would have passed.
            Review whether the emergency was justified."

The Gate

The production promotion job uses environment: production, which activates GitHub’s environment protection rules. The reviewer sees the validation results from staging (health check, smoke test, performance baseline) in the workflow summary before approving.

The 10-minute wait timer after approval gives the team time to cancel if they realize something is wrong. It also prevents rapid-fire approvals.

The Recovery

Approved promotion causes an incident: Roll back by reverting the infra repo commit. File a post-incident review to understand why staging validation did not catch the issue. Tighten the gates.

Reviewer is unavailable: Add multiple reviewers to the environment. Any one reviewer can approve. If all reviewers are unavailable, use the emergency promotion workflow with the production-emergency environment (which has a different reviewer list, typically the on-call team).

Emergency bypass is used too frequently: Track emergency promotions. If they happen more than twice a month, the normal promotion flow is too slow or too restrictive. Fix the flow, do not normalize the bypass.