Skip to main content
ship it and sleep

Deployment Strategies: Blue-Green, Canary, and Feature Flags as a Release Primitive

5 min read Chapter 22 of 66

Deployment Strategies

Three deployment strategies exist for Kubernetes workloads. Rolling updates replace pods incrementally (the default). Blue-green deployments run two complete environments and switch traffic atomically. Canary deployments send a small percentage of traffic to the new version and gradually increase it.

Rolling updates are fine for stateless services with backward-compatible changes. Blue-green is appropriate when you need instant rollback and can afford double the resources. Canary is appropriate when you need to validate with real traffic before committing.

Feature flags are not a deployment strategy. They are a release strategy. You can deploy code with a feature flag disabled, then enable the flag without deploying. This separates deployment (shipping code) from release (exposing functionality).

Canary traffic split timeline

The Failure

The catalog team used rolling updates for every deployment. A new version introduced a bug in the search API that caused results to return in random order instead of by relevance. During the rolling update, some users hit the old version (correct results) and some hit the new version (random results). The team noticed after 15 minutes when support tickets arrived. They initiated a rollback, which took another 5 minutes of rolling update in reverse. Total exposure: 20 minutes, all users affected for 10 of those minutes.

With a canary deployment, the team would have sent 5% of traffic to the new version. Automated analysis would have detected the latency increase (randomized results took longer) and rolled back automatically. Total exposure: 2 minutes, 5% of users.

The Mechanism

Strategy Comparison

StrategyResource CostRollback SpeedRisk ExposureComplexity
Rolling update1x + surgeMinutesAll users during rolloutLow
Blue-green2xSecondsAll users after switchMedium
Canary1x + canary podsSeconds% of usersHigh
Feature flag1xMillisecondsUsers with flag onMedium (code complexity)

When to Use Which

  • Rolling update: Internal services, batch jobs, non-critical paths
  • Blue-green: Services requiring instant rollback, database-coupled deploys (CH9)
  • Canary: User-facing services, services with measurable SLOs
  • Feature flag: New features, A/B tests, gradual rollouts independent of deployment

The Implementation

Blue-Green with ArgoCD Sync Waves

# HARDENED: Blue-green deployment using ArgoCD sync waves
# The green (new) deployment syncs first, then the service switches traffic
apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-service-green
  namespace: production
  annotations:
    argocd.argoproj.io/sync-wave: "1"
  labels:
    app: checkout-service
    slot: green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: checkout-service
      slot: green
  template:
    metadata:
      labels:
        app: checkout-service
        slot: green
    spec:
      containers:
        - name: checkout
          image: ghcr.io/acme/checkout-service:NEW_SHA
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: 1000m
              memory: 1Gi
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
---
# Service switches to green after green deployment is healthy (sync-wave 2)
apiVersion: v1
kind: Service
metadata:
  name: checkout-service
  namespace: production
  annotations:
    argocd.argoproj.io/sync-wave: "2"
spec:
  selector:
    app: checkout-service
    slot: green # Switch traffic to green
  ports:
    - port: 80
      targetPort: 8080

Canary with Argo Rollouts

# HARDENED: Canary rollout with automated analysis
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout-service
  namespace: production
spec:
  replicas: 5
  strategy:
    canary:
      canaryService: checkout-service-canary
      stableService: checkout-service-stable
      trafficRouting:
        nginx:
          stableIngress: checkout-service
          additionalIngressAnnotations:
            canary-by-header: X-Canary
      steps:
        - setWeight: 5
        - pause: { duration: 2m }
        - analysis:
            templates:
              - templateName: checkout-success-rate
            args:
              - name: service-name
                value: checkout-service-canary
        - setWeight: 20
        - pause: { duration: 3m }
        - analysis:
            templates:
              - templateName: checkout-success-rate
        - setWeight: 50
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: checkout-success-rate
              - templateName: checkout-latency
        - setWeight: 100
      rollbackWindow:
        revisions: 2
  selector:
    matchLabels:
      app: checkout-service
  template:
    metadata:
      labels:
        app: checkout-service
    spec:
      containers:
        - name: checkout
          image: ghcr.io/acme/checkout-service:NEW_SHA
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: 1000m
              memory: 1Gi

Analysis Template with Prometheus

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: checkout-success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 30s
      count: 3
      successCondition: result[0] > 0.99
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",code=~"2.."}[2m]))
            /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[2m]))

The Gate

During canary rollout, analysis templates run at each step. If the success rate drops below 99% or latency exceeds the threshold, Argo Rollouts automatically aborts the rollout and scales the canary to zero. Traffic returns to the stable version in seconds.

The analysis runs three queries at 30-second intervals. If any single query fails, the analysis fails. One failure out of three attempts is allowed (failureLimit: 1). Two failures abort the rollout.

The Recovery

Canary analysis fails: Argo Rollouts reverts automatically. No manual action needed. Check the analysis run in ArgoCD dashboard to see which metric failed.

Blue-green switch causes errors: Update the service selector back to slot: blue (the previous version). Push the change to Git. ArgoCD syncs in seconds. Investigate the green deployment.

Feature flag causes issues after enabling: Disable the flag. The code is still deployed but the feature is hidden. No deployment or rollback needed. Fix the code, deploy, then re-enable the flag.