Blue-Green Deployments with ArgoCD Sync Waves
Blue-Green Deployments with ArgoCD Sync Waves
The Failure
The team implemented blue-green manually. They deployed the green version, waited for it to be healthy, then ran kubectl patch service to switch the selector. One Friday, the engineer running the deployment patched the wrong service. The internal admin service started pointing at the customer-facing deployment. Admin API calls hit the customer-facing service, which returned 404s for every admin endpoint. The engineer realized the mistake after 8 minutes and patched the correct service.
Manual traffic switching is error-prone. ArgoCD sync waves automate the sequence: deploy green, verify health, switch traffic, tear down blue. Each step depends on the previous step’s success.
The Mechanism
Sync Wave Ordering
ArgoCD sync waves define the order in which resources are applied during a sync operation. Resources with lower wave numbers are applied first. Resources in the same wave are applied in parallel. ArgoCD waits for resources in a wave to be healthy before proceeding to the next wave.
| Wave | Resource | Purpose |
|---|---|---|
| 0 | ConfigMap, Secret | Configuration for the new version |
| 1 | Deployment (green) | New version pods |
| 2 | Service (traffic switch) | Switch traffic to green |
| 3 | Job (smoke test) | Validate the switch |
| 4 | Deployment (blue cleanup) | Scale down old version |
Slot Management
Blue and green are not separate deployments that you create and delete. They are two persistent deployments with slot labels. The service selector determines which slot receives traffic. Only the image tag and the service selector change during deployment.
The Implementation
Complete Blue-Green Manifests
# HARDENED: Blue slot deployment (sync-wave 0, always exists)
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-blue
namespace: production
annotations:
argocd.argoproj.io/sync-wave: "0"
labels:
app: checkout-service
slot: blue
spec:
replicas: 3
selector:
matchLabels:
app: checkout-service
slot: blue
template:
metadata:
labels:
app: checkout-service
slot: blue
spec:
containers:
- name: checkout
image: ghcr.io/acme/checkout-service:CURRENT_SHA
ports:
- containerPort: 8080
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
---
# HARDENED: Green slot deployment (sync-wave 1, updated with new image)
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-green
namespace: production
annotations:
argocd.argoproj.io/sync-wave: "1"
labels:
app: checkout-service
slot: green
spec:
replicas: 3
selector:
matchLabels:
app: checkout-service
slot: green
template:
metadata:
labels:
app: checkout-service
slot: green
spec:
containers:
- name: checkout
image: ghcr.io/acme/checkout-service:NEW_SHA
ports:
- containerPort: 8080
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
---
# HARDENED: Service switches to green after green is healthy (sync-wave 2)
apiVersion: v1
kind: Service
metadata:
name: checkout-service
namespace: production
annotations:
argocd.argoproj.io/sync-wave: "2"
spec:
selector:
app: checkout-service
slot: green
ports:
- port: 80
targetPort: 8080
---
# HARDENED: Post-switch smoke test (sync-wave 3)
apiVersion: batch/v1
kind: Job
metadata:
name: checkout-smoke-test
namespace: production
annotations:
argocd.argoproj.io/sync-wave: "3"
argocd.argoproj.io/hook: Sync
argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
containers:
- name: smoke
image: ghcr.io/acme/smoke-tests:latest
env:
- name: TARGET_URL
value: http://checkout-service.production.svc.cluster.local
command: ["./run-smoke-tests.sh"]
Promotion Script
#!/bin/bash
# HARDENED: Blue-green promotion script for the infra repo
set -euo pipefail
SERVICE=$1
NEW_TAG=$2
INFRA_DIR="apps/$SERVICE/overlays/production"
# Determine current active slot
CURRENT_SLOT=$(yq '.spec.selector.slot' "$INFRA_DIR/service.yaml")
if [[ "$CURRENT_SLOT" == "blue" ]]; then
NEW_SLOT="green"
else
NEW_SLOT="blue"
fi
echo "Switching $SERVICE from $CURRENT_SLOT to $NEW_SLOT with image tag $NEW_TAG"
# Update the inactive slot's image
yq -i ".spec.template.spec.containers[0].image = \"ghcr.io/acme/$SERVICE:$NEW_TAG\"" \
"$INFRA_DIR/deployment-$NEW_SLOT.yaml"
# Update the service selector to point to the new slot
yq -i ".spec.selector.slot = \"$NEW_SLOT\"" "$INFRA_DIR/service.yaml"
echo "Updated $INFRA_DIR for blue-green switch: $CURRENT_SLOT → $NEW_SLOT"
The Gate
ArgoCD will not proceed to sync-wave 2 (traffic switch) until the green deployment in sync-wave 1 reports all pods as Ready. The readiness probe verifies the application is accepting HTTP requests. If the green deployment fails to become healthy within the sync timeout, ArgoCD marks the sync as failed and does not switch traffic.
The sync-wave 3 smoke test Job runs after the traffic switch. If it fails, the ArgoCD sync is marked as degraded. The team is notified and can initiate a rollback.
The Recovery
Green deployment fails health checks: ArgoCD never reaches sync-wave 2. Traffic stays on blue. Fix the image and push a new commit.
Smoke test fails after traffic switch: Run the promotion script again with the previous image tag, switching back to blue. Or revert the infra repo commit. ArgoCD syncs the service selector back to the old slot.
Need instant rollback during an incident: Change the service selector in Git back to the previous slot. Push. ArgoCD syncs in under 30 seconds. The old version is still running (it was not scaled down yet if you have not reached sync-wave 4).