Prometheus and Grafana Dashboards for Pipeline Health

The Failure

The team discussed pipeline performance in retros using anecdotes. “CI feels slow.” “I waited a while for my build.” No data. No trends. No agreement on whether things were getting better or worse. Without a dashboard, pipeline health was invisible.

A Grafana dashboard with 4 panels transforms pipeline health from anecdote to data.

The Mechanism

Four Essential Panels

Panel	Metric	Visualization
Build Duration	`ci_build_duration_seconds`	Time series with p50/p90
Success Rate	`ci_build_total{conclusion="success"}`	Gauge (%)
Queue Time	`ci_queue_duration_seconds`	Time series
Flaky Test Count	`ci_flaky_tests_total`	Stat panel

Metrics Flow

GitHub Actions → Pushgateway → Prometheus → Grafana
                     ↑
              workflow_run event

The Implementation

Prometheus Configuration

# prometheus/prometheus.yml
# HARDENED: Scrape CI metrics from Pushgateway
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: "pushgateway"
    honor_labels: true
    static_configs:
      - targets: ["pushgateway:9091"]

Pushgateway Metrics Format

# Push metrics from CI
cat <<EOF | curl --data-binary @- \
  http://pushgateway:9091/metrics/job/ci/workflow/checkout-service

# HELP ci_build_duration_seconds Total build duration
# TYPE ci_build_duration_seconds gauge
ci_build_duration_seconds{conclusion="success"} 480

# HELP ci_queue_duration_seconds Time waiting for runner
# TYPE ci_queue_duration_seconds gauge
ci_queue_duration_seconds 12

# HELP ci_build_total Total builds
# TYPE ci_build_total counter
ci_build_total{conclusion="success"} 1

# HELP ci_test_total Total tests run
# TYPE ci_test_total gauge
ci_test_total{status="passed"} 1188
ci_test_total{status="failed"} 0
ci_test_total{status="flaky"} 3
EOF

Grafana Dashboard JSON

{
  "dashboard": {
    "title": "CI/CD Pipeline Health",
    "panels": [
      {
        "title": "Build Duration (p50 / p90)",
        "type": "timeseries",
        "targets": [
          {
            "expr": "quantile_over_time(0.5, ci_build_duration_seconds[1d])",
            "legendFormat": "p50"
          },
          {
            "expr": "quantile_over_time(0.9, ci_build_duration_seconds[1d])",
            "legendFormat": "p90"
          }
        ],
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
        "fieldConfig": {
          "defaults": {
            "unit": "s",
            "thresholds": {
              "steps": [
                { "value": 0, "color": "green" },
                { "value": 600, "color": "yellow" },
                { "value": 900, "color": "red" }
              ]
            }
          }
        }
      },
      {
        "title": "Success Rate",
        "type": "gauge",
        "targets": [
          {
            "expr": "sum(ci_build_total{conclusion='success'}) / sum(ci_build_total) * 100"
          }
        ],
        "gridPos": { "h": 8, "w": 6, "x": 12, "y": 0 },
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                { "value": 0, "color": "red" },
                { "value": 80, "color": "yellow" },
                { "value": 95, "color": "green" }
              ]
            }
          }
        }
      },
      {
        "title": "Queue Time",
        "type": "timeseries",
        "targets": [
          {
            "expr": "ci_queue_duration_seconds",
            "legendFormat": "{{workflow}}"
          }
        ],
        "gridPos": { "h": 8, "w": 6, "x": 18, "y": 0 }
      },
      {
        "title": "Flaky Tests (Quarantined)",
        "type": "stat",
        "targets": [
          {
            "expr": "ci_test_total{status='flaky'}"
          }
        ],
        "gridPos": { "h": 4, "w": 6, "x": 12, "y": 8 },
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                { "value": 0, "color": "green" },
                { "value": 5, "color": "yellow" },
                { "value": 15, "color": "red" }
              ]
            }
          }
        }
      }
    ]
  }
}

Alert Rules

# prometheus/alerts.yml
# HARDENED: Alert on pipeline degradation
groups:
  - name: ci-pipeline
    rules:
      - alert: BuildDurationHigh
        expr: ci_build_duration_seconds > 900
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CI build duration exceeded 15 minutes"
          description: "Workflow {{ $labels.workflow }} took {{ $value }}s"

      - alert: SuccessRateLow
        expr: >
          (sum(ci_build_total{conclusion="success"}) /
           sum(ci_build_total)) * 100 < 80
        for: 1h
        labels:
          severity: critical
        annotations:
          summary: "CI success rate below 80%"

      - alert: QueueTimeHigh
        expr: ci_queue_duration_seconds > 300
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "CI queue time exceeded 5 minutes"

Team Health Score

# scripts/team-health-score.py
# HARDENED: Composite pipeline health metric
def compute_health_score(metrics):
    """Score from 0-100 based on pipeline health indicators."""
    score = 100

    # Build duration penalty
    if metrics["duration_p90_s"] > 900:
        score -= 20
    elif metrics["duration_p90_s"] > 600:
        score -= 10

    # Success rate penalty
    if metrics["success_rate"] < 80:
        score -= 30
    elif metrics["success_rate"] < 95:
        score -= 15

    # Flaky test penalty
    flaky_count = metrics.get("flaky_test_count", 0)
    score -= min(flaky_count * 2, 20)

    # Queue time penalty
    if metrics.get("queue_p90_s", 0) > 300:
        score -= 10

    return max(score, 0)

The Gate

The dashboard is not a gate—it is visibility. But it enables a process gate: when the health score drops below 70, the team allocates sprint capacity to pipeline maintenance. This is a team commitment, not an automated blocker.

The Recovery

Pushgateway loses data on restart: Pushgateway is not a long-term store. Prometheus scrapes it every 15 seconds. As long as Prometheus is running, data is preserved in its TSDB. Add persistent storage to Prometheus.

Dashboard shows gaps in data: The workflow_run event does not fire for cancelled runs. Add explicit metric pushes in the CI workflow’s cleanup step using if: always().

Too many alerts cause fatigue: Start with one alert: success rate below 80%. Add more only after the first one has been proven useful.