Skip to main content
ship it and sleep

Argo Rollouts Spec, Analysis Templates, and Metric Providers

4 min read Chapter 41 of 66

Argo Rollouts Spec, Analysis Templates, and Metric Providers

The Failure

The team created one AnalysisTemplate per service with hardcoded service names in Prometheus queries. When they added a sixth service, they had 6 copies of the same template with one string changed. When the Prometheus query needed updating (label rename), they updated 5 of 6 templates. The sixth service’s canary analysis stopped working because the old label did not exist.

ClusterAnalysisTemplates with arguments solve this. One template, parameterized by service name, reused across all services.

The Mechanism

Analysis Lifecycle

  1. Rollout reaches an analysis step
  2. Argo Rollouts creates an AnalysisRun from the template
  3. AnalysisRun runs metrics at the specified interval
  4. Each measurement is evaluated against successCondition
  5. Results: Successful, Failed, Inconclusive, Error
  6. When count measurements are complete or failureLimit is exceeded → Analysis completes
  7. If Analysis succeeds → Rollout proceeds to next step
  8. If Analysis fails → Rollout aborts

Metric Provider Types

ProviderUse CaseConfiguration
PrometheusInfrastructure and application metricsQuery URL + PromQL
DatadogAPM metrics, custom metricsAPI key + query
WebCustom validation endpointsURL + expected response
JobRun a Kubernetes Job as analysisJob spec
KayentaStatistical canary analysisKayenta server URL

The Implementation

ClusterAnalysisTemplate with Arguments

# HARDENED: Reusable analysis template across all services
apiVersion: argoproj.io/v1alpha1
kind: ClusterAnalysisTemplate
metadata:
  name: service-health
spec:
  args:
    - name: service-name
    - name: namespace
      value: production
    - name: error-threshold
      value: "0.01"
    - name: latency-threshold-ms
      value: "500"
  metrics:
    - name: error-rate
      interval: 30s
      count: 10
      successCondition: "result[0] < {{args.error-threshold}}"
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{
              app="{{args.service-name}}",
              namespace="{{args.namespace}}",
              code=~"5.."}[2m]))
            /
            sum(rate(http_requests_total{
              app="{{args.service-name}}",
              namespace="{{args.namespace}}"}[2m]))

    - name: latency-p99
      interval: 30s
      count: 10
      successCondition: "result[0] < {{args.latency-threshold-ms}}"
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{
                app="{{args.service-name}}",
                namespace="{{args.namespace}}"}[2m])) by (le)) * 1000

Rollout Using ClusterAnalysisTemplate

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout-service
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: { duration: 2m }
        - analysis:
            templates:
              - clusterScope: true
                templateName: service-health
            args:
              - name: service-name
                value: checkout-service
              - name: error-threshold
                value: "0.005" # Stricter for checkout
              - name: latency-threshold-ms
                value: "300" # Stricter for checkout

Web Provider for Custom Validation

# HARDENED: Custom validation endpoint
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: checkout-smoke
spec:
  metrics:
    - name: smoke-test
      interval: 60s
      count: 3
      successCondition: result.status == "pass"
      failureLimit: 0
      provider:
        web:
          url: http://smoke-test-service.production.svc.cluster.local/validate
          method: POST
          headers:
            - key: Content-Type
              value: application/json
          body: |
            {
              "service": "checkout-service",
              "endpoint": "http://checkout-canary.production.svc.cluster.local",
              "tests": ["health", "cart-add", "cart-checkout"]
            }
          jsonPath: "{$.status}"
          timeoutSeconds: 30

Dry-Run Analysis

Test analysis templates without triggering a rollout:

# Create a standalone AnalysisRun to test the template
kubectl apply -f - <<EOF
apiVersion: argoproj.io/v1alpha1
kind: AnalysisRun
metadata:
  name: test-service-health
  namespace: production
spec:
  metrics:
    - name: error-rate
      interval: 30s
      count: 3
      successCondition: "result[0] < 0.01"
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{app="checkout-service",code=~"5.."}[2m]))
            / sum(rate(http_requests_total{app="checkout-service"}[2m]))
EOF

# Watch the results
kubectl get analysisrun test-service-health -n production -w

The Gate

Each AnalysisRun is a gate. The Rollout controller evaluates the AnalysisRun’s status at each step. The status is computed from the individual metric measurements:

  • All metrics succeed → AnalysisRun Successful → Rollout proceeds
  • Any metric exceeds failureLimit → AnalysisRun Failed → Rollout aborts
  • Metric returns no data → Inconclusive → counts toward inconclusiveLimit

Configure inconclusiveLimit for services with low traffic where Prometheus queries may return empty results during off-hours.

The Recovery

AnalysisRun returns Inconclusive: The Prometheus query returns no data points. Either the canary has no traffic or the query is wrong. Test the query directly against Prometheus. If the service has low traffic, increase the query time window or reduce the canary validation to peak-hours only.

All analyses pass but service is still broken: The analysis metrics do not cover the failure mode. Add more metrics. Common gaps: database connection pool exhaustion, downstream service errors, response body correctness (not just HTTP status codes).

Analysis takes too long: Reduce count or increase interval. A 5-minute analysis at 30-second intervals needs only 10 measurements. Balance thoroughness with deployment speed.