Argo Rollouts Spec, Analysis Templates, and Metric Providers

The Failure

The team created one AnalysisTemplate per service with hardcoded service names in Prometheus queries. When they added a sixth service, they had 6 copies of the same template with one string changed. When the Prometheus query needed updating (label rename), they updated 5 of 6 templates. The sixth service’s canary analysis stopped working because the old label did not exist.

ClusterAnalysisTemplates with arguments solve this. One template, parameterized by service name, reused across all services.

The Mechanism

Analysis Lifecycle

Rollout reaches an analysis step
Argo Rollouts creates an AnalysisRun from the template
AnalysisRun runs metrics at the specified interval
Each measurement is evaluated against successCondition
Results: Successful, Failed, Inconclusive, Error
When count measurements are complete or failureLimit is exceeded → Analysis completes
If Analysis succeeds → Rollout proceeds to next step
If Analysis fails → Rollout aborts

Metric Provider Types

Provider	Use Case	Configuration
Prometheus	Infrastructure and application metrics	Query URL + PromQL
Datadog	APM metrics, custom metrics	API key + query
Web	Custom validation endpoints	URL + expected response
Job	Run a Kubernetes Job as analysis	Job spec
Kayenta	Statistical canary analysis	Kayenta server URL

The Implementation

ClusterAnalysisTemplate with Arguments

# HARDENED: Reusable analysis template across all services
apiVersion: argoproj.io/v1alpha1
kind: ClusterAnalysisTemplate
metadata:
  name: service-health
spec:
  args:
    - name: service-name
    - name: namespace
      value: production
    - name: error-threshold
      value: "0.01"
    - name: latency-threshold-ms
      value: "500"
  metrics:
    - name: error-rate
      interval: 30s
      count: 10
      successCondition: "result[0] < {{args.error-threshold}}"
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{
              app="{{args.service-name}}",
              namespace="{{args.namespace}}",
              code=~"5.."}[2m]))
            /
            sum(rate(http_requests_total{
              app="{{args.service-name}}",
              namespace="{{args.namespace}}"}[2m]))

    - name: latency-p99
      interval: 30s
      count: 10
      successCondition: "result[0] < {{args.latency-threshold-ms}}"
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{
                app="{{args.service-name}}",
                namespace="{{args.namespace}}"}[2m])) by (le)) * 1000

Rollout Using ClusterAnalysisTemplate

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout-service
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: { duration: 2m }
        - analysis:
            templates:
              - clusterScope: true
                templateName: service-health
            args:
              - name: service-name
                value: checkout-service
              - name: error-threshold
                value: "0.005" # Stricter for checkout
              - name: latency-threshold-ms
                value: "300" # Stricter for checkout

Web Provider for Custom Validation

# HARDENED: Custom validation endpoint
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: checkout-smoke
spec:
  metrics:
    - name: smoke-test
      interval: 60s
      count: 3
      successCondition: result.status == "pass"
      failureLimit: 0
      provider:
        web:
          url: http://smoke-test-service.production.svc.cluster.local/validate
          method: POST
          headers:
            - key: Content-Type
              value: application/json
          body: |
            {
              "service": "checkout-service",
              "endpoint": "http://checkout-canary.production.svc.cluster.local",
              "tests": ["health", "cart-add", "cart-checkout"]
            }
          jsonPath: "{$.status}"
          timeoutSeconds: 30

Dry-Run Analysis

Test analysis templates without triggering a rollout:

# Create a standalone AnalysisRun to test the template
kubectl apply -f - <<EOF
apiVersion: argoproj.io/v1alpha1
kind: AnalysisRun
metadata:
  name: test-service-health
  namespace: production
spec:
  metrics:
    - name: error-rate
      interval: 30s
      count: 3
      successCondition: "result[0] < 0.01"
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{app="checkout-service",code=~"5.."}[2m]))
            / sum(rate(http_requests_total{app="checkout-service"}[2m]))
EOF

# Watch the results
kubectl get analysisrun test-service-health -n production -w

The Gate

Each AnalysisRun is a gate. The Rollout controller evaluates the AnalysisRun’s status at each step. The status is computed from the individual metric measurements:

All metrics succeed → AnalysisRun Successful → Rollout proceeds
Any metric exceeds failureLimit → AnalysisRun Failed → Rollout aborts
Metric returns no data → Inconclusive → counts toward inconclusiveLimit

Configure inconclusiveLimit for services with low traffic where Prometheus queries may return empty results during off-hours.

The Recovery

AnalysisRun returns Inconclusive: The Prometheus query returns no data points. Either the canary has no traffic or the query is wrong. Test the query directly against Prometheus. If the service has low traffic, increase the query time window or reduce the canary validation to peak-hours only.

All analyses pass but service is still broken: The analysis metrics do not cover the failure mode. Add more metrics. Common gaps: database connection pool exhaustion, downstream service errors, response body correctness (not just HTTP status codes).

Analysis takes too long: Reduce count or increase interval. A 5-minute analysis at 30-second intervals needs only 10 measurements. Balance thoroughness with deployment speed.