Promotion Gates and Approval Workflows
Promotion Gates and Approval Workflows
The Failure
The frontend team set up a promotion workflow with a manual approval step. The team lead approved promotions by clicking a button in the GitHub Actions UI. On a busy Thursday, the team lead approved three promotions in 20 minutes without checking staging health. The third promotion deployed a version with a JavaScript bundle that was 4MB larger than the previous version (a debug library left in the build). Page load times increased by 3 seconds. The team lead had approved based on the PR description, not on staging validation results.
Manual approval without automated validation is a rubber stamp. Automated validation without manual approval for production is reckless. The combination of both creates a reliable promotion gate.
The Mechanism
Gate Types
| Gate | Type | When | Blocks On |
|---|---|---|---|
| Health check | Automated | After deploy to target env | Unhealthy pods for > 2 min |
| Smoke test | Automated | After health check passes | Any critical path failure |
| Performance baseline | Automated | After smoke tests pass | p99 latency > 120% of baseline |
| Bundle size check | Automated | During CI | Size increase > 5% without explanation |
| Manual approval | Human | Before production deploy | Reviewer does not approve |
| Wait timer | Automated | After approval | Must wait 10 min before deploy proceeds |
GitHub Environment Protection Rules
GitHub environments support:
- Required reviewers: One or more GitHub users or teams must approve
- Wait timer: Minimum delay before the job can proceed (0-43200 minutes)
- Deployment branches: Restrict which branches can deploy to the environment
- Custom deployment protection rules: External validation via webhooks
The Implementation
Environment Configuration
Configure environments in GitHub Settings → Environments:
# This is configured via GitHub API, shown here for documentation
environments:
dev:
deployment_branch_policy:
protected_branches: false
custom_branch_policies: true
# Any branch can deploy to dev
staging:
deployment_branch_policy:
protected_branches: true
# Only default branch (main) deploys to staging
wait_timer: 0
production:
deployment_branch_policy:
protected_branches: true
required_reviewers:
- team: platform-leads
minimum_approvals: 1
wait_timer: 10 # 10 minutes after approval before deploy starts
Automated Validation Before Promotion
# HARDENED: Post-deploy validation that gates promotion
name: validate-staging
on:
workflow_run:
workflows: ["deploy-staging"]
types: [completed]
jobs:
health-check:
runs-on: ubuntu-latest
if: ${{ github.event.workflow_run.conclusion == 'success' }}
steps:
- name: Wait for rollout
run: |
kubectl --context=staging -n staging \
rollout status deployment/${{ github.event.workflow_run.head_branch }} \
--timeout=120s
- name: Verify health endpoints
run: |
for i in {1..10}; do
STATUS=$(curl -s -o /dev/null -w '%{http_code}' \
https://staging.acme.com/$SERVICE/health/ready)
if [[ "$STATUS" != "200" ]]; then
echo "Health check failed: HTTP $STATUS (attempt $i/10)"
sleep 10
continue
fi
echo "Health check passed"
break
done
[[ "$STATUS" == "200" ]] || exit 1
smoke-test:
runs-on: ubuntu-latest
needs: [health-check]
steps:
- uses: actions/checkout@v4
- name: Run smoke tests against staging
run: |
cd tests/smoke
ENVIRONMENT=staging npm run smoke
performance-baseline:
runs-on: ubuntu-latest
needs: [smoke-test]
steps:
- name: Run lightweight performance check
run: |
# Quick Locust run: 10 users, 60 seconds
docker run --rm \
-e TARGET_HOST=https://staging.acme.com \
-e USERS=10 \
-e DURATION=60 \
-v $PWD/locust-results:/results \
ghcr.io/acme/locust-suite:latest
- name: Compare with baseline
run: |
P99=$(jq -r '.p99_latency' locust-results/stats.json)
BASELINE=$(curl -s https://metrics.acme.com/api/v1/baseline/$SERVICE | jq -r '.p99_latency')
THRESHOLD=$(echo "$BASELINE * 1.2" | bc)
if (( $(echo "$P99 > $THRESHOLD" | bc -l) )); then
echo "::error::p99 latency $P99 ms exceeds threshold $THRESHOLD ms (baseline: $BASELINE ms)"
exit 1
fi
echo "Performance check passed: p99=$P99 ms (baseline=$BASELINE ms, threshold=$THRESHOLD ms)"
record-validation:
runs-on: ubuntu-latest
needs: [smoke-test, performance-baseline]
steps:
- name: Record validation result
run: |
gh api repos/acme/ecommerce-infra/dispatches \
-f event_type=staging-validated \
-f client_payload[service]=$SERVICE \
-f client_payload[image_tag]=$IMAGE_TAG \
-f client_payload[validated_at]=$(date -u +%Y-%m-%dT%H:%M:%SZ)
Production Promotion with Approval
# HARDENED: Production promotion with manual approval gate
name: promote-to-production
on:
repository_dispatch:
types: [staging-validated]
jobs:
promote:
runs-on: ubuntu-latest
environment: production # Triggers approval workflow
steps:
- uses: actions/checkout@v4
with:
token: ${{ secrets.INFRA_REPO_TOKEN }}
- name: Update production image
run: |
SERVICE=${{ github.event.client_payload.service }}
TAG=${{ github.event.client_payload.image_tag }}
cd apps/$SERVICE/overlays/production
kustomize edit set image \
$SERVICE=ghcr.io/acme/$SERVICE:$TAG
- name: Commit promotion
run: |
git config user.name "promotion-bot"
git config user.email "[email protected]"
git add -A
git commit -m "promote(production): $SERVICE → $TAG
Validated at: ${{ github.event.client_payload.validated_at }}
Approved by: ${{ github.actor }}
Workflow: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
git push
Emergency Bypass
For critical incidents, the platform team can bypass the normal promotion flow:
# HARDENED: Emergency promotion with audit trail
name: emergency-promote
on:
workflow_dispatch:
inputs:
service:
required: true
type: choice
options:
[
checkout-service,
catalog-service,
inventory-service,
payments-service,
]
image-tag:
required: true
type: string
incident-id:
required: true
type: string
description: "Incident ID from PagerDuty or incident tracker"
jobs:
emergency-promote:
runs-on: ubuntu-latest
environment: production-emergency # Separate env, requires platform-oncall approval
steps:
- uses: actions/checkout@v4
with:
token: ${{ secrets.INFRA_REPO_TOKEN }}
- name: Update production image
run: |
cd apps/${{ inputs.service }}/overlays/production
kustomize edit set image \
${{ inputs.service }}=ghcr.io/acme/${{ inputs.service }}:${{ inputs.image-tag }}
- name: Commit with incident reference
run: |
git config user.name "emergency-bot"
git config user.email "[email protected]"
git add -A
git commit -m "EMERGENCY promote(production): ${{ inputs.service }} → ${{ inputs.image-tag }}
Incident: ${{ inputs.incident-id }}
Approved by: ${{ github.actor }}
⚠️ Bypassed normal promotion gates"
git push
- name: Create follow-up issue
run: |
gh issue create \
--repo acme/ecommerce-infra \
--title "Post-incident: Review emergency promotion for ${{ inputs.service }}" \
--body "Emergency promotion was used during incident ${{ inputs.incident-id }}.
Verify that all normal gates would have passed.
Review whether the emergency was justified."
The Gate
The production promotion job uses environment: production, which activates GitHub’s environment protection rules. The reviewer sees the validation results from staging (health check, smoke test, performance baseline) in the workflow summary before approving.
The 10-minute wait timer after approval gives the team time to cancel if they realize something is wrong. It also prevents rapid-fire approvals.
The Recovery
Approved promotion causes an incident: Roll back by reverting the infra repo commit. File a post-incident review to understand why staging validation did not catch the issue. Tighten the gates.
Reviewer is unavailable: Add multiple reviewers to the environment. Any one reviewer can approve. If all reviewers are unavailable, use the emergency promotion workflow with the production-emergency environment (which has a different reviewer list, typically the on-call team).
Emergency bypass is used too frequently: Track emergency promotions. If they happen more than twice a month, the normal promotion flow is too slow or too restrictive. Fix the flow, do not normalize the bypass.