Promotion Gates and Approval Workflows

The Failure

The frontend team set up a promotion workflow with a manual approval step. The team lead approved promotions by clicking a button in the GitHub Actions UI. On a busy Thursday, the team lead approved three promotions in 20 minutes without checking staging health. The third promotion deployed a version with a JavaScript bundle that was 4MB larger than the previous version (a debug library left in the build). Page load times increased by 3 seconds. The team lead had approved based on the PR description, not on staging validation results.

Manual approval without automated validation is a rubber stamp. Automated validation without manual approval for production is reckless. The combination of both creates a reliable promotion gate.

The Mechanism

Gate Types

Gate	Type	When	Blocks On
Health check	Automated	After deploy to target env	Unhealthy pods for > 2 min
Smoke test	Automated	After health check passes	Any critical path failure
Performance baseline	Automated	After smoke tests pass	p99 latency > 120% of baseline
Bundle size check	Automated	During CI	Size increase > 5% without explanation
Manual approval	Human	Before production deploy	Reviewer does not approve
Wait timer	Automated	After approval	Must wait 10 min before deploy proceeds

GitHub Environment Protection Rules

GitHub environments support:

Required reviewers: One or more GitHub users or teams must approve
Wait timer: Minimum delay before the job can proceed (0-43200 minutes)
Deployment branches: Restrict which branches can deploy to the environment
Custom deployment protection rules: External validation via webhooks

The Implementation

Environment Configuration

Configure environments in GitHub Settings → Environments:

# This is configured via GitHub API, shown here for documentation
environments:
  dev:
    deployment_branch_policy:
      protected_branches: false
      custom_branch_policies: true
    # Any branch can deploy to dev

  staging:
    deployment_branch_policy:
      protected_branches: true
    # Only default branch (main) deploys to staging
    wait_timer: 0

  production:
    deployment_branch_policy:
      protected_branches: true
    required_reviewers:
      - team: platform-leads
        minimum_approvals: 1
    wait_timer: 10 # 10 minutes after approval before deploy starts

Automated Validation Before Promotion

# HARDENED: Post-deploy validation that gates promotion
name: validate-staging
on:
  workflow_run:
    workflows: ["deploy-staging"]
    types: [completed]

jobs:
  health-check:
    runs-on: ubuntu-latest
    if: ${{ github.event.workflow_run.conclusion == 'success' }}
    steps:
      - name: Wait for rollout
        run: |
          kubectl --context=staging -n staging \
            rollout status deployment/${{ github.event.workflow_run.head_branch }} \
            --timeout=120s

      - name: Verify health endpoints
        run: |
          for i in {1..10}; do
            STATUS=$(curl -s -o /dev/null -w '%{http_code}' \
              https://staging.acme.com/$SERVICE/health/ready)
            if [[ "$STATUS" != "200" ]]; then
              echo "Health check failed: HTTP $STATUS (attempt $i/10)"
              sleep 10
              continue
            fi
            echo "Health check passed"
            break
          done
          [[ "$STATUS" == "200" ]] || exit 1

  smoke-test:
    runs-on: ubuntu-latest
    needs: [health-check]
    steps:
      - uses: actions/checkout@v4
      - name: Run smoke tests against staging
        run: |
          cd tests/smoke
          ENVIRONMENT=staging npm run smoke

  performance-baseline:
    runs-on: ubuntu-latest
    needs: [smoke-test]
    steps:
      - name: Run lightweight performance check
        run: |
          # Quick Locust run: 10 users, 60 seconds
          docker run --rm \
            -e TARGET_HOST=https://staging.acme.com \
            -e USERS=10 \
            -e DURATION=60 \
            -v $PWD/locust-results:/results \
            ghcr.io/acme/locust-suite:latest

      - name: Compare with baseline
        run: |
          P99=$(jq -r '.p99_latency' locust-results/stats.json)
          BASELINE=$(curl -s https://metrics.acme.com/api/v1/baseline/$SERVICE | jq -r '.p99_latency')
          THRESHOLD=$(echo "$BASELINE * 1.2" | bc)

          if (( $(echo "$P99 > $THRESHOLD" | bc -l) )); then
            echo "::error::p99 latency $P99 ms exceeds threshold $THRESHOLD ms (baseline: $BASELINE ms)"
            exit 1
          fi
          echo "Performance check passed: p99=$P99 ms (baseline=$BASELINE ms, threshold=$THRESHOLD ms)"

  record-validation:
    runs-on: ubuntu-latest
    needs: [smoke-test, performance-baseline]
    steps:
      - name: Record validation result
        run: |
          gh api repos/acme/ecommerce-infra/dispatches \
            -f event_type=staging-validated \
            -f client_payload[service]=$SERVICE \
            -f client_payload[image_tag]=$IMAGE_TAG \
            -f client_payload[validated_at]=$(date -u +%Y-%m-%dT%H:%M:%SZ)

Production Promotion with Approval

# HARDENED: Production promotion with manual approval gate
name: promote-to-production
on:
  repository_dispatch:
    types: [staging-validated]

jobs:
  promote:
    runs-on: ubuntu-latest
    environment: production # Triggers approval workflow
    steps:
      - uses: actions/checkout@v4
        with:
          token: ${{ secrets.INFRA_REPO_TOKEN }}

      - name: Update production image
        run: |
          SERVICE=${{ github.event.client_payload.service }}
          TAG=${{ github.event.client_payload.image_tag }}
          cd apps/$SERVICE/overlays/production
          kustomize edit set image \
            $SERVICE=ghcr.io/acme/$SERVICE:$TAG

      - name: Commit promotion
        run: |
          git config user.name "promotion-bot"
          git config user.email "[email protected]"
          git add -A
          git commit -m "promote(production): $SERVICE → $TAG

          Validated at: ${{ github.event.client_payload.validated_at }}
          Approved by: ${{ github.actor }}
          Workflow: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
          git push

Emergency Bypass

For critical incidents, the platform team can bypass the normal promotion flow:

# HARDENED: Emergency promotion with audit trail
name: emergency-promote
on:
  workflow_dispatch:
    inputs:
      service:
        required: true
        type: choice
        options:
          [
            checkout-service,
            catalog-service,
            inventory-service,
            payments-service,
          ]
      image-tag:
        required: true
        type: string
      incident-id:
        required: true
        type: string
        description: "Incident ID from PagerDuty or incident tracker"

jobs:
  emergency-promote:
    runs-on: ubuntu-latest
    environment: production-emergency # Separate env, requires platform-oncall approval
    steps:
      - uses: actions/checkout@v4
        with:
          token: ${{ secrets.INFRA_REPO_TOKEN }}

      - name: Update production image
        run: |
          cd apps/${{ inputs.service }}/overlays/production
          kustomize edit set image \
            ${{ inputs.service }}=ghcr.io/acme/${{ inputs.service }}:${{ inputs.image-tag }}

      - name: Commit with incident reference
        run: |
          git config user.name "emergency-bot"
          git config user.email "[email protected]"
          git add -A
          git commit -m "EMERGENCY promote(production): ${{ inputs.service }} → ${{ inputs.image-tag }}

          Incident: ${{ inputs.incident-id }}
          Approved by: ${{ github.actor }}
          ⚠️ Bypassed normal promotion gates"
          git push

      - name: Create follow-up issue
        run: |
          gh issue create \
            --repo acme/ecommerce-infra \
            --title "Post-incident: Review emergency promotion for ${{ inputs.service }}" \
            --body "Emergency promotion was used during incident ${{ inputs.incident-id }}.
            Verify that all normal gates would have passed.
            Review whether the emergency was justified."

The Gate

The production promotion job uses environment: production, which activates GitHub’s environment protection rules. The reviewer sees the validation results from staging (health check, smoke test, performance baseline) in the workflow summary before approving.

The 10-minute wait timer after approval gives the team time to cancel if they realize something is wrong. It also prevents rapid-fire approvals.

The Recovery

Approved promotion causes an incident: Roll back by reverting the infra repo commit. File a post-incident review to understand why staging validation did not catch the issue. Tighten the gates.

Reviewer is unavailable: Add multiple reviewers to the environment. Any one reviewer can approve. If all reviewers are unavailable, use the emergency promotion workflow with the production-emergency environment (which has a different reviewer list, typically the on-call team).

Emergency bypass is used too frequently: Track emergency promotions. If they happen more than twice a month, the normal promotion flow is too slow or too restrictive. Fix the flow, do not normalize the bypass.