Argo Rollouts Spec, Analysis Templates, and Metric Providers
Argo Rollouts Spec, Analysis Templates, and Metric Providers
The Failure
The team created one AnalysisTemplate per service with hardcoded service names in Prometheus queries. When they added a sixth service, they had 6 copies of the same template with one string changed. When the Prometheus query needed updating (label rename), they updated 5 of 6 templates. The sixth service’s canary analysis stopped working because the old label did not exist.
ClusterAnalysisTemplates with arguments solve this. One template, parameterized by service name, reused across all services.
The Mechanism
Analysis Lifecycle
- Rollout reaches an
analysisstep - Argo Rollouts creates an AnalysisRun from the template
- AnalysisRun runs metrics at the specified interval
- Each measurement is evaluated against
successCondition - Results: Successful, Failed, Inconclusive, Error
- When
countmeasurements are complete orfailureLimitis exceeded → Analysis completes - If Analysis succeeds → Rollout proceeds to next step
- If Analysis fails → Rollout aborts
Metric Provider Types
| Provider | Use Case | Configuration |
|---|---|---|
| Prometheus | Infrastructure and application metrics | Query URL + PromQL |
| Datadog | APM metrics, custom metrics | API key + query |
| Web | Custom validation endpoints | URL + expected response |
| Job | Run a Kubernetes Job as analysis | Job spec |
| Kayenta | Statistical canary analysis | Kayenta server URL |
The Implementation
ClusterAnalysisTemplate with Arguments
# HARDENED: Reusable analysis template across all services
apiVersion: argoproj.io/v1alpha1
kind: ClusterAnalysisTemplate
metadata:
name: service-health
spec:
args:
- name: service-name
- name: namespace
value: production
- name: error-threshold
value: "0.01"
- name: latency-threshold-ms
value: "500"
metrics:
- name: error-rate
interval: 30s
count: 10
successCondition: "result[0] < {{args.error-threshold}}"
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{
app="{{args.service-name}}",
namespace="{{args.namespace}}",
code=~"5.."}[2m]))
/
sum(rate(http_requests_total{
app="{{args.service-name}}",
namespace="{{args.namespace}}"}[2m]))
- name: latency-p99
interval: 30s
count: 10
successCondition: "result[0] < {{args.latency-threshold-ms}}"
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{
app="{{args.service-name}}",
namespace="{{args.namespace}}"}[2m])) by (le)) * 1000
Rollout Using ClusterAnalysisTemplate
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: checkout-service
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: { duration: 2m }
- analysis:
templates:
- clusterScope: true
templateName: service-health
args:
- name: service-name
value: checkout-service
- name: error-threshold
value: "0.005" # Stricter for checkout
- name: latency-threshold-ms
value: "300" # Stricter for checkout
Web Provider for Custom Validation
# HARDENED: Custom validation endpoint
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: checkout-smoke
spec:
metrics:
- name: smoke-test
interval: 60s
count: 3
successCondition: result.status == "pass"
failureLimit: 0
provider:
web:
url: http://smoke-test-service.production.svc.cluster.local/validate
method: POST
headers:
- key: Content-Type
value: application/json
body: |
{
"service": "checkout-service",
"endpoint": "http://checkout-canary.production.svc.cluster.local",
"tests": ["health", "cart-add", "cart-checkout"]
}
jsonPath: "{$.status}"
timeoutSeconds: 30
Dry-Run Analysis
Test analysis templates without triggering a rollout:
# Create a standalone AnalysisRun to test the template
kubectl apply -f - <<EOF
apiVersion: argoproj.io/v1alpha1
kind: AnalysisRun
metadata:
name: test-service-health
namespace: production
spec:
metrics:
- name: error-rate
interval: 30s
count: 3
successCondition: "result[0] < 0.01"
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{app="checkout-service",code=~"5.."}[2m]))
/ sum(rate(http_requests_total{app="checkout-service"}[2m]))
EOF
# Watch the results
kubectl get analysisrun test-service-health -n production -w
The Gate
Each AnalysisRun is a gate. The Rollout controller evaluates the AnalysisRun’s status at each step. The status is computed from the individual metric measurements:
- All metrics succeed → AnalysisRun Successful → Rollout proceeds
- Any metric exceeds failureLimit → AnalysisRun Failed → Rollout aborts
- Metric returns no data → Inconclusive → counts toward inconclusiveLimit
Configure inconclusiveLimit for services with low traffic where Prometheus queries may return empty results during off-hours.
The Recovery
AnalysisRun returns Inconclusive: The Prometheus query returns no data points. Either the canary has no traffic or the query is wrong. Test the query directly against Prometheus. If the service has low traffic, increase the query time window or reduce the canary validation to peak-hours only.
All analyses pass but service is still broken: The analysis metrics do not cover the failure mode. Add more metrics. Common gaps: database connection pool exhaustion, downstream service errors, response body correctness (not just HTTP status codes).
Analysis takes too long: Reduce count or increase interval. A 5-minute analysis at 30-second intervals needs only 10 measurements. Balance thoroughness with deployment speed.