Prometheus and Grafana Dashboards for Pipeline Health
Prometheus and Grafana Dashboards for Pipeline Health
The Failure
The team discussed pipeline performance in retros using anecdotes. “CI feels slow.” “I waited a while for my build.” No data. No trends. No agreement on whether things were getting better or worse. Without a dashboard, pipeline health was invisible.
A Grafana dashboard with 4 panels transforms pipeline health from anecdote to data.
The Mechanism
Four Essential Panels
| Panel | Metric | Visualization |
|---|---|---|
| Build Duration | ci_build_duration_seconds | Time series with p50/p90 |
| Success Rate | ci_build_total{conclusion="success"} | Gauge (%) |
| Queue Time | ci_queue_duration_seconds | Time series |
| Flaky Test Count | ci_flaky_tests_total | Stat panel |
Metrics Flow
GitHub Actions → Pushgateway → Prometheus → Grafana
↑
workflow_run event
The Implementation
Prometheus Configuration
# prometheus/prometheus.yml
# HARDENED: Scrape CI metrics from Pushgateway
global:
scrape_interval: 15s
scrape_configs:
- job_name: "pushgateway"
honor_labels: true
static_configs:
- targets: ["pushgateway:9091"]
Pushgateway Metrics Format
# Push metrics from CI
cat <<EOF | curl --data-binary @- \
http://pushgateway:9091/metrics/job/ci/workflow/checkout-service
# HELP ci_build_duration_seconds Total build duration
# TYPE ci_build_duration_seconds gauge
ci_build_duration_seconds{conclusion="success"} 480
# HELP ci_queue_duration_seconds Time waiting for runner
# TYPE ci_queue_duration_seconds gauge
ci_queue_duration_seconds 12
# HELP ci_build_total Total builds
# TYPE ci_build_total counter
ci_build_total{conclusion="success"} 1
# HELP ci_test_total Total tests run
# TYPE ci_test_total gauge
ci_test_total{status="passed"} 1188
ci_test_total{status="failed"} 0
ci_test_total{status="flaky"} 3
EOF
Grafana Dashboard JSON
{
"dashboard": {
"title": "CI/CD Pipeline Health",
"panels": [
{
"title": "Build Duration (p50 / p90)",
"type": "timeseries",
"targets": [
{
"expr": "quantile_over_time(0.5, ci_build_duration_seconds[1d])",
"legendFormat": "p50"
},
{
"expr": "quantile_over_time(0.9, ci_build_duration_seconds[1d])",
"legendFormat": "p90"
}
],
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"steps": [
{ "value": 0, "color": "green" },
{ "value": 600, "color": "yellow" },
{ "value": 900, "color": "red" }
]
}
}
}
},
{
"title": "Success Rate",
"type": "gauge",
"targets": [
{
"expr": "sum(ci_build_total{conclusion='success'}) / sum(ci_build_total) * 100"
}
],
"gridPos": { "h": 8, "w": 6, "x": 12, "y": 0 },
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{ "value": 0, "color": "red" },
{ "value": 80, "color": "yellow" },
{ "value": 95, "color": "green" }
]
}
}
}
},
{
"title": "Queue Time",
"type": "timeseries",
"targets": [
{
"expr": "ci_queue_duration_seconds",
"legendFormat": "{{workflow}}"
}
],
"gridPos": { "h": 8, "w": 6, "x": 18, "y": 0 }
},
{
"title": "Flaky Tests (Quarantined)",
"type": "stat",
"targets": [
{
"expr": "ci_test_total{status='flaky'}"
}
],
"gridPos": { "h": 4, "w": 6, "x": 12, "y": 8 },
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{ "value": 0, "color": "green" },
{ "value": 5, "color": "yellow" },
{ "value": 15, "color": "red" }
]
}
}
}
}
]
}
}
Alert Rules
# prometheus/alerts.yml
# HARDENED: Alert on pipeline degradation
groups:
- name: ci-pipeline
rules:
- alert: BuildDurationHigh
expr: ci_build_duration_seconds > 900
for: 5m
labels:
severity: warning
annotations:
summary: "CI build duration exceeded 15 minutes"
description: "Workflow {{ $labels.workflow }} took {{ $value }}s"
- alert: SuccessRateLow
expr: >
(sum(ci_build_total{conclusion="success"}) /
sum(ci_build_total)) * 100 < 80
for: 1h
labels:
severity: critical
annotations:
summary: "CI success rate below 80%"
- alert: QueueTimeHigh
expr: ci_queue_duration_seconds > 300
for: 10m
labels:
severity: warning
annotations:
summary: "CI queue time exceeded 5 minutes"
Team Health Score
# scripts/team-health-score.py
# HARDENED: Composite pipeline health metric
def compute_health_score(metrics):
"""Score from 0-100 based on pipeline health indicators."""
score = 100
# Build duration penalty
if metrics["duration_p90_s"] > 900:
score -= 20
elif metrics["duration_p90_s"] > 600:
score -= 10
# Success rate penalty
if metrics["success_rate"] < 80:
score -= 30
elif metrics["success_rate"] < 95:
score -= 15
# Flaky test penalty
flaky_count = metrics.get("flaky_test_count", 0)
score -= min(flaky_count * 2, 20)
# Queue time penalty
if metrics.get("queue_p90_s", 0) > 300:
score -= 10
return max(score, 0)
The Gate
The dashboard is not a gate—it is visibility. But it enables a process gate: when the health score drops below 70, the team allocates sprint capacity to pipeline maintenance. This is a team commitment, not an automated blocker.
The Recovery
Pushgateway loses data on restart: Pushgateway is not a long-term store. Prometheus scrapes it every 15 seconds. As long as Prometheus is running, data is preserved in its TSDB. Add persistent storage to Prometheus.
Dashboard shows gaps in data: The workflow_run event does not fire for cancelled runs. Add explicit metric pushes in the CI workflow’s cleanup step using if: always().
Too many alerts cause fatigue: Start with one alert: success rate below 80%. Add more only after the first one has been proven useful.