Hotfix Pipelines and Post-Incident Pipeline Improvements
Hotfix Pipelines and Post-Incident Pipeline Improvements
The Failure
The production incident required a one-line fix. The normal pipeline took 22 minutes: lint, unit tests, integration tests, security scan, performance test, staging deploy, staging smoke test, production deploy. The service was down for the entire 22 minutes because the team did not have a fast path.
A hotfix pipeline cuts the feedback loop to under 5 minutes by running only the gates that matter during an incident.
The Mechanism
Gate Reduction Rules
| Gate | Normal Pipeline | Hotfix Pipeline | Rationale |
|---|---|---|---|
| Lint | ✓ | ✗ | Style does not cause outages |
| Unit tests | ✓ | ✓ (short mode) | Catch regressions |
| Integration tests | ✓ | ✗ | Too slow for incidents |
| Security scan (critical) | ✓ | ✓ | Non-negotiable |
| Security scan (high) | ✓ | ✗ | Can wait |
| Performance test | ✓ | ✗ | Not relevant to hotfix |
| Staging deploy | ✓ | ✗ | Skip to production |
| Production deploy | ✓ | ✓ | The point |
Hotfix Branch Convention
hotfix/INCIDENT-123-fix-payment-timeout
Branch name prefix hotfix/ triggers the reduced pipeline. The hotfix must be merged to main within 24 hours via a normal PR with the full pipeline.
The Implementation
Hotfix Branch Rules
# checkout-service/.github/workflows/ci.yml
# HARDENED: Route to appropriate pipeline based on branch
name: CI
on:
push:
branches:
- main
- "hotfix/**"
pull_request:
branches: [main]
jobs:
determine-pipeline:
runs-on: ubuntu-latest
outputs:
is-hotfix: ${{ contains(github.ref, 'hotfix/') }}
steps:
- run: echo "Branch type determined"
full-pipeline:
needs: determine-pipeline
if: needs.determine-pipeline.outputs.is-hotfix != 'true'
uses: ./.github/workflows/full-ci.yml
hotfix-pipeline:
needs: determine-pipeline
if: needs.determine-pipeline.outputs.is-hotfix == 'true'
uses: ./.github/workflows/hotfix-ci.yml
Hotfix CI Workflow
# checkout-service/.github/workflows/hotfix-ci.yml
# HARDENED: Reduced pipeline for emergency fixes
name: Hotfix CI
on:
workflow_call:
jobs:
hotfix:
runs-on: ubuntu-latest
environment: production # Requires environment approval
steps:
- uses: actions/checkout@v4
- name: Unit tests (short)
run: go test -short -count=1 ./...
timeout-minutes: 3
- name: Critical security scan
uses: aquasecurity/trivy-action@master
with:
scan-type: "fs"
severity: "CRITICAL"
exit-code: "1"
timeout-minutes: 2
- name: Build and push
uses: docker/build-push-action@v5
with:
push: true
tags: ghcr.io/acme/checkout-service:hotfix-${{ github.sha }}
cache-from: type=gha
- name: Deploy to production
uses: peter-evans/repository-dispatch@v3
with:
token: ${{ secrets.INFRA_REPO_TOKEN }}
repository: acme/ecommerce-infra
event-type: deploy-service
client-payload: |
{
"service": "checkout-service",
"tag": "hotfix-${{ github.sha }}",
"environment": "production",
"hotfix": true
}
- name: Notify team
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": ":ambulance: Hotfix deployed: checkout-service hotfix-${{ github.sha }}"
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
Post-Incident Pipeline Audit
# .github/ISSUE_TEMPLATE/post-incident-pipeline.yml
# HARDENED: Structured pipeline improvement after incidents
name: Post-Incident Pipeline Review
description: Review and improve the pipeline after an incident
body:
- type: input
id: incident
attributes:
label: Incident ID
validations:
required: true
- type: textarea
id: timeline
attributes:
label: Deployment Timeline
value: |
- Deploy started:
- Issue detected:
- Rollback initiated:
- Service recovered:
- Root cause identified:
- Hotfix deployed:
- type: checkboxes
id: pipeline-gaps
attributes:
label: Pipeline Gaps
options:
- label: Could a test have caught this?
- label: Could canary analysis have caught this?
- label: Could a security scan have caught this?
- label: Was rollback fast enough?
- label: Did monitoring detect the issue?
- type: textarea
id: improvements
attributes:
label: Pipeline Improvements
description: Specific, actionable changes to the pipeline
value: |
1. Add test:
2. Tighten canary threshold:
3. Add monitoring:
4. Update runbook:
Runbook as Code
#!/bin/bash
# runbooks/payments-outage.sh
# HARDENED: Automated runbook for payments service outage
set -euo pipefail
echo "=== Payments Service Outage Runbook ==="
echo ""
echo "Step 1: Check service health"
argocd app get payments-production -o json | jq '.status.health'
echo ""
echo "Step 2: Check recent deploys"
git -C /tmp/ecommerce-infra log --oneline --grep="payments" -5
echo ""
read -p "Rollback to previous version? (y/n): " ROLLBACK
if [[ "$ROLLBACK" == "y" ]]; then
bash scripts/argocd-safe-rollback.sh payments-production
fi
echo ""
echo "Step 3: Check pod logs"
kubectl logs -n production -l app=payments-service --tail=50
echo ""
echo "Step 4: Check dependencies"
kubectl exec -n production deploy/payments-service -- \
curl -s http://checkout-service.production.svc.cluster.local/health
The Gate
The hotfix pipeline uses GitHub Environments with required reviewers. The production environment requires approval from at least one person on the on-call rotation. This prevents accidental hotfix deploys while keeping the process fast enough for emergencies.
The Recovery
Hotfix pipeline is abused for normal deploys: Monitor hotfix branch frequency. If more than 2 hotfixes per month, the normal pipeline is too slow or the team is taking shortcuts. Fix the root cause.
Hotfix breaks something else: The reduced test suite missed a regression. After the incident is resolved, backfill: create a PR from the hotfix branch to main and run the full pipeline. Any failures must be fixed before merge.
Runbook is outdated: Version runbooks alongside code. Review them during post-incident reviews. If the runbook did not help, update it. If no runbook existed, create one.