Skip to main content
ship it and sleep

Blue-Green Deployments with ArgoCD Sync Waves

5 min read Chapter 23 of 66

Blue-Green Deployments with ArgoCD Sync Waves

The Failure

The team implemented blue-green manually. They deployed the green version, waited for it to be healthy, then ran kubectl patch service to switch the selector. One Friday, the engineer running the deployment patched the wrong service. The internal admin service started pointing at the customer-facing deployment. Admin API calls hit the customer-facing service, which returned 404s for every admin endpoint. The engineer realized the mistake after 8 minutes and patched the correct service.

Manual traffic switching is error-prone. ArgoCD sync waves automate the sequence: deploy green, verify health, switch traffic, tear down blue. Each step depends on the previous step’s success.

The Mechanism

Sync Wave Ordering

ArgoCD sync waves define the order in which resources are applied during a sync operation. Resources with lower wave numbers are applied first. Resources in the same wave are applied in parallel. ArgoCD waits for resources in a wave to be healthy before proceeding to the next wave.

WaveResourcePurpose
0ConfigMap, SecretConfiguration for the new version
1Deployment (green)New version pods
2Service (traffic switch)Switch traffic to green
3Job (smoke test)Validate the switch
4Deployment (blue cleanup)Scale down old version

Slot Management

Blue and green are not separate deployments that you create and delete. They are two persistent deployments with slot labels. The service selector determines which slot receives traffic. Only the image tag and the service selector change during deployment.

The Implementation

Complete Blue-Green Manifests

# HARDENED: Blue slot deployment (sync-wave 0, always exists)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-blue
  namespace: production
  annotations:
    argocd.argoproj.io/sync-wave: "0"
  labels:
    app: checkout-service
    slot: blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: checkout-service
      slot: blue
  template:
    metadata:
      labels:
        app: checkout-service
        slot: blue
    spec:
      containers:
        - name: checkout
          image: ghcr.io/acme/checkout-service:CURRENT_SHA
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: 1000m
              memory: 1Gi
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 20
---
# HARDENED: Green slot deployment (sync-wave 1, updated with new image)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-green
  namespace: production
  annotations:
    argocd.argoproj.io/sync-wave: "1"
  labels:
    app: checkout-service
    slot: green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: checkout-service
      slot: green
  template:
    metadata:
      labels:
        app: checkout-service
        slot: green
    spec:
      containers:
        - name: checkout
          image: ghcr.io/acme/checkout-service:NEW_SHA
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: 1000m
              memory: 1Gi
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 20
---
# HARDENED: Service switches to green after green is healthy (sync-wave 2)
apiVersion: v1
kind: Service
metadata:
  name: checkout-service
  namespace: production
  annotations:
    argocd.argoproj.io/sync-wave: "2"
spec:
  selector:
    app: checkout-service
    slot: green
  ports:
    - port: 80
      targetPort: 8080
---
# HARDENED: Post-switch smoke test (sync-wave 3)
apiVersion: batch/v1
kind: Job
metadata:
  name: checkout-smoke-test
  namespace: production
  annotations:
    argocd.argoproj.io/sync-wave: "3"
    argocd.argoproj.io/hook: Sync
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
spec:
  backoffLimit: 0
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: smoke
          image: ghcr.io/acme/smoke-tests:latest
          env:
            - name: TARGET_URL
              value: http://checkout-service.production.svc.cluster.local
          command: ["./run-smoke-tests.sh"]

Promotion Script

#!/bin/bash
# HARDENED: Blue-green promotion script for the infra repo
set -euo pipefail

SERVICE=$1
NEW_TAG=$2
INFRA_DIR="apps/$SERVICE/overlays/production"

# Determine current active slot
CURRENT_SLOT=$(yq '.spec.selector.slot' "$INFRA_DIR/service.yaml")
if [[ "$CURRENT_SLOT" == "blue" ]]; then
  NEW_SLOT="green"
else
  NEW_SLOT="blue"
fi

echo "Switching $SERVICE from $CURRENT_SLOT to $NEW_SLOT with image tag $NEW_TAG"

# Update the inactive slot's image
yq -i ".spec.template.spec.containers[0].image = \"ghcr.io/acme/$SERVICE:$NEW_TAG\"" \
  "$INFRA_DIR/deployment-$NEW_SLOT.yaml"

# Update the service selector to point to the new slot
yq -i ".spec.selector.slot = \"$NEW_SLOT\"" "$INFRA_DIR/service.yaml"

echo "Updated $INFRA_DIR for blue-green switch: $CURRENT_SLOT$NEW_SLOT"

The Gate

ArgoCD will not proceed to sync-wave 2 (traffic switch) until the green deployment in sync-wave 1 reports all pods as Ready. The readiness probe verifies the application is accepting HTTP requests. If the green deployment fails to become healthy within the sync timeout, ArgoCD marks the sync as failed and does not switch traffic.

The sync-wave 3 smoke test Job runs after the traffic switch. If it fails, the ArgoCD sync is marked as degraded. The team is notified and can initiate a rollback.

The Recovery

Green deployment fails health checks: ArgoCD never reaches sync-wave 2. Traffic stays on blue. Fix the image and push a new commit.

Smoke test fails after traffic switch: Run the promotion script again with the previous image tag, switching back to blue. Or revert the infra repo commit. ArgoCD syncs the service selector back to the old slot.

Need instant rollback during an incident: Change the service selector in Git back to the previous slot. Push. ArgoCD syncs in under 30 seconds. The old version is still running (it was not scaled down yet if you have not reached sync-wave 4).