Skip to main content
ship it and sleep

Progressive Delivery: Argo Rollouts, Traffic Splitting, and Automated Rollback

5 min read Chapter 40 of 66

Progressive Delivery with Argo Rollouts

Progressive delivery is canary deployments with automated decision-making. Instead of a human watching dashboards and deciding when to promote, analysis templates query metrics and make the decision programmatically. If metrics are good, promote. If metrics are bad, rollback. No human in the loop.

Argo Rollouts is a Kubernetes controller that replaces the built-in Deployment controller with a Rollout controller that supports canary, blue-green, and experiment strategies with integrated analysis.

Argo Rollouts progressive delivery timeline

The Failure

The team implemented canary deployments with manual observation. The process:

  1. Deploy canary (5% traffic)
  2. Engineer watches Grafana for 10 minutes
  3. Engineer increases to 25% traffic
  4. Engineer watches Grafana for 10 minutes
  5. Engineer promotes to 100%

The process took 30 minutes of an engineer’s focused attention. On Friday afternoons, the engineer watched for 5 minutes instead of 10. On one Friday, the canary had a gradual memory leak that only became visible after 15 minutes. The engineer promoted at 5 minutes. Production OOM-killed after 2 hours.

Automated analysis does not get tired on Fridays. It runs the same checks every time, for the same duration, with the same thresholds.

The Mechanism

Argo Rollouts CRDs

CRDPurpose
RolloutReplaces Deployment, defines canary/blue-green strategy
AnalysisTemplateReusable metric query definition
AnalysisRunInstance of an AnalysisTemplate for a specific rollout
ExperimentRuns multiple ReplicaSets temporarily for A/B comparison

Rollout Lifecycle

  1. New image pushed → Rollout creates canary ReplicaSet
  2. Traffic routing updated (5% to canary)
  3. AnalysisRun created from AnalysisTemplate
  4. Analysis queries Prometheus at defined intervals
  5. If all metrics pass → increase traffic weight
  6. Repeat steps 3-5 at each stage
  7. At 100% → scale down old ReplicaSet
  8. If any analysis fails → abort, scale down canary, restore stable

The Implementation

Complete Rollout Specification

# HARDENED: Full Argo Rollout with progressive delivery
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout-service
  namespace: production
  labels:
    app.kubernetes.io/name: checkout-service
    app.kubernetes.io/part-of: ecommerce
spec:
  replicas: 5
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: checkout-service
  strategy:
    canary:
      canaryService: checkout-canary
      stableService: checkout-stable
      trafficRouting:
        nginx:
          stableIngress: checkout-ingress
          annotationPrefix: nginx.ingress.kubernetes.io
      analysis:
        successfulRunHistoryLimit: 5
        unsuccessfulRunHistoryLimit: 5
      steps:
        # Stage 1: Smoke test (5% traffic, 2 min)
        - setWeight: 5
        - pause: { duration: 2m }
        - analysis:
            templates:
              - templateName: error-rate
              - templateName: latency-p99

        # Stage 2: Initial validation (20% traffic, 5 min)
        - setWeight: 20
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: error-rate
              - templateName: latency-p99
              - templateName: memory-usage

        # Stage 3: Load validation (50% traffic, 10 min)
        - setWeight: 50
        - pause: { duration: 10m }
        - analysis:
            templates:
              - templateName: error-rate
              - templateName: latency-p99
              - templateName: memory-usage
              - templateName: throughput-comparison

        # Stage 4: Full promotion
        - setWeight: 100
      rollbackWindow:
        revisions: 3
      abortScaleDownDelaySeconds: 30
      dynamicStableScale: true
  template:
    metadata:
      labels:
        app: checkout-service
    spec:
      containers:
        - name: checkout
          image: ghcr.io/acme/checkout-service:PLACEHOLDER
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: 1000m
              memory: 1Gi

Analysis Templates

# Error rate must stay below 1%
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate
spec:
  metrics:
    - name: error-rate
      interval: 30s
      count: 5
      successCondition: result[0] < 0.01
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{
              app="checkout-service",
              revision="{{args.revision}}",
              code=~"5.."}[2m]))
            /
            sum(rate(http_requests_total{
              app="checkout-service",
              revision="{{args.revision}}"}[2m]))
---
# p99 latency must stay below 500ms
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: latency-p99
spec:
  metrics:
    - name: latency
      interval: 30s
      count: 5
      successCondition: result[0] < 500
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{
                app="checkout-service",
                revision="{{args.revision}}"}[2m])) by (le)) * 1000
---
# Memory must stay below 80% of limit
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: memory-usage
spec:
  metrics:
    - name: memory
      interval: 60s
      count: 5
      successCondition: result[0] < 0.8
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            avg(container_memory_working_set_bytes{
              pod=~"checkout-service-.*",
              container="checkout",
              namespace="production"})
            /
            avg(kube_pod_container_resource_limits{
              pod=~"checkout-service-.*",
              container="checkout",
              namespace="production",
              resource="memory"})
---
# Canary throughput must be within 10% of stable
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: throughput-comparison
spec:
  metrics:
    - name: throughput-ratio
      interval: 60s
      count: 3
      successCondition: result[0] > 0.9
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{
              app="checkout-service",
              revision="canary"}[5m]))
            /
            sum(rate(http_requests_total{
              app="checkout-service",
              revision="stable"}[5m]))

Services for Traffic Splitting

apiVersion: v1
kind: Service
metadata:
  name: checkout-stable
  namespace: production
spec:
  selector:
    app: checkout-service
  ports:
    - port: 80
      targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: checkout-canary
  namespace: production
spec:
  selector:
    app: checkout-service
  ports:
    - port: 80
      targetPort: 8080

The Gate

Each analysis step is a gate. The rollout only proceeds to the next traffic weight if all analysis templates pass. The analysis runs count queries at interval intervals. If more than failureLimit queries fail, the analysis fails and the rollout aborts.

The three-stage approach tests different concerns at different scales:

  • Stage 1 (5%): Basic health — is the service responding without errors?
  • Stage 2 (20%): Resource behavior — is the service using memory and CPU within bounds?
  • Stage 3 (50%): Throughput parity — is the canary handling traffic at the same rate as stable?

The Recovery

Rollout aborted: Argo Rollouts automatically scales down canary pods and routes all traffic to stable. Check the AnalysisRun to see which metric failed: kubectl get analysisrun -n production.

Rollout stuck in Paused state: A previous analysis completed but the next step is a manual pause. Promote manually: kubectl argo rollouts promote checkout-service -n production.

Need to rollback after full promotion: kubectl argo rollouts undo checkout-service -n production. Argo Rollouts reverts to the previous revision’s ReplicaSet.

Analysis templates return no data: Prometheus query returns empty results (no traffic to the canary). The analysis defaults to “Inconclusive.” Configure inconclusiveLimit to handle this: inconclusiveLimit: 3 allows 3 empty results before failing.