Skip to main content
ship it and sleep

Traffic Splitting with Istio/Nginx and Automated Rollback Triggers

4 min read Chapter 42 of 66

Traffic Splitting with Istio/Nginx and Automated Rollback Triggers

The Failure

The team configured traffic splitting with Nginx ingress annotations. The canary weight was set to 10%, but they noticed that some users hit the canary 50% of the time while others never hit it. Nginx’s canary routing is probabilistic per-request, not per-user. A user making 10 requests might get 1 canary response or 5. For checkout flows that span multiple requests (add to cart → checkout → payment), a user could start on stable and finish on canary, or vice versa.

Session affinity during canary rollouts ensures a user stays on the same version for the duration of their session. Istio provides this with consistent hashing. Nginx provides it with the canary-by-cookie annotation.

The Mechanism

Traffic Routing Options

RouterMechanismSession AffinityWeighted RoutingHeader Routing
Nginx IngressCanary annotationsCookie-basedYes (weight annotation)Yes (header annotation)
Istio VirtualServiceTraffic rulesConsistent hashYes (weight field)Yes (match rules)
TraefikWeighted servicesCookie-basedYes (weight)Yes (headers)
AWS ALBTarget groupsCookie-basedYes (weight)No

The Implementation

Nginx Traffic Splitting

# HARDENED: Nginx ingress with canary routing
# Stable ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: checkout-ingress
  namespace: production
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "30"
spec:
  ingressClassName: nginx
  rules:
    - host: api.acme.com
      http:
        paths:
          - path: /api/checkout
            pathType: Prefix
            backend:
              service:
                name: checkout-stable
                port:
                  number: 80
---
# Canary ingress (managed by Argo Rollouts)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: checkout-ingress-canary
  namespace: production
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "10"
    # Header-based override for testing
    nginx.ingress.kubernetes.io/canary-by-header: "X-Canary"
    nginx.ingress.kubernetes.io/canary-by-header-value: "true"
    # Session affinity: once routed to canary, stay on canary
    nginx.ingress.kubernetes.io/canary-by-cookie: "canary-session"
spec:
  ingressClassName: nginx
  rules:
    - host: api.acme.com
      http:
        paths:
          - path: /api/checkout
            pathType: Prefix
            backend:
              service:
                name: checkout-canary
                port:
                  number: 80

Istio VirtualService Traffic Splitting

# HARDENED: Istio VirtualService for weighted routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: checkout-service
  namespace: production
spec:
  hosts:
    - checkout-service.production.svc.cluster.local
  http:
    # Header-based routing for canary testing
    - match:
        - headers:
            X-Canary:
              exact: "true"
      route:
        - destination:
            host: checkout-canary
            port:
              number: 80

    # Weighted routing for canary traffic
    - route:
        - destination:
            host: checkout-stable
            port:
              number: 80
          weight: 90
        - destination:
            host: checkout-canary
            port:
              number: 80
          weight: 10
---
# Session affinity via consistent hashing
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: checkout-service
  namespace: production
spec:
  host: checkout-service.production.svc.cluster.local
  trafficPolicy:
    loadBalancer:
      consistentHash:
        httpCookie:
          name: canary-session
          ttl: 3600s

Argo Rollouts with Istio

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout-service
spec:
  strategy:
    canary:
      canaryService: checkout-canary
      stableService: checkout-stable
      trafficRouting:
        istio:
          virtualServices:
            - name: checkout-service
              routes:
                - primary
          destinationRule:
            name: checkout-service
            canarySubsetName: canary
            stableSubsetName: stable
      steps:
        - setWeight: 5
        - pause: { duration: 2m }
        - analysis:
            templates:
              - templateName: error-rate
        - setWeight: 20
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: error-rate
              - templateName: latency-p99
        - setWeight: 50
        - pause: { duration: 10m }
        - setWeight: 100

Custom Rollback Triggers

Beyond Prometheus metrics, trigger rollback from external systems:

# Rollback on PagerDuty incident creation
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: no-active-incidents
spec:
  metrics:
    - name: pagerduty-check
      interval: 60s
      count: 5
      successCondition: result == "0"
      failureLimit: 0
      provider:
        web:
          url: "https://api.pagerduty.com/incidents?statuses[]=triggered&statuses[]=acknowledged&service_ids[]={{args.pd-service-id}}"
          headers:
            - key: Authorization
              value: "Token token={{args.pd-token}}"
          jsonPath: "{$.incidents.length}"
# Manual rollback abort trigger
kubectl argo rollouts abort checkout-service -n production

# Force immediate rollback to previous version
kubectl argo rollouts undo checkout-service -n production

The Gate

Traffic splitting is the mechanism that enables progressive gating. At each stage, a larger percentage of users validate the new version. The combination of automated analysis and traffic splitting creates a multi-layered gate:

  1. Technical gate: Metrics (error rate, latency, memory)
  2. Operational gate: No active incidents (PagerDuty check)
  3. User experience gate: Session-based routing ensures users have consistent experiences

The Recovery

Traffic split not working (all traffic goes to stable): Check the ingress annotations or VirtualService configuration. Common issue: the canary ingress is not in the same ingress class as the stable ingress. Verify with kubectl get ingress -n production.

Session affinity causes uneven distribution: Cookie-based affinity means returning users always hit the same version. If most traffic is from returning users, the canary gets less traffic than the weight suggests. Increase the canary weight or use a shorter cookie TTL.

Canary receives traffic before it is ready: Add setHeaderRoute to Argo Rollouts to send only test traffic (via header) before enabling weighted routing. Validate with test traffic first, then open to real users.