Pipeline Observability: Metrics, Flaky Tests, and Dashboards

The pipeline is infrastructure. It needs monitoring like any other infrastructure. When the build takes 15 minutes, someone should be alerted, not surprised.

Pipeline observability stack

The Failure

The team’s CI pipeline averaged 8 minutes. Over three months, it crept to 22 minutes. No one noticed because no one tracked it. A developer complained during a retro. The team investigated and found: Docker layer caching broke two months ago (added 6 minutes), a flaky test was retried 3 times on every run (added 4 minutes), and a dependency mirror was slow (added 4 minutes). Three independent issues, each small enough to ignore, combined to nearly triple build time.

Pipeline metrics would have caught each regression within days.

The Mechanism

Key Pipeline Metrics

Metric	Description	Alert Threshold
Build duration	Total wall-clock time	> 2x baseline
Queue time	Time waiting for a runner	> 5 minutes
Success rate	% of builds that pass	< 90%
Flaky rate	% of tests that pass on retry	> 5%
Cache hit rate	% of steps using cached results	< 80%
MTTR	Mean time to fix a broken build	> 2 hours

Flaky Test Detection

A flaky test is one that passes and fails on the same code. Detecting them requires tracking test results over time:

Test fails on PR → developer retries → test passes → PR merges
Record both results: the test is marked as flaky
After 3 flaky occurrences in 30 days → quarantine the test

The Implementation

Pipeline Metrics Collection

# .github/workflows/metrics.yml
# HARDENED: Collect pipeline metrics
name: Pipeline Metrics
on:
  workflow_run:
    workflows: ["CI"]
    types: [completed]

jobs:
  collect-metrics:
    runs-on: ubuntu-latest
    steps:
      - name: Collect workflow metrics
        uses: actions/github-script@v7
        with:
          script: |
            const run = context.payload.workflow_run;
            const duration = (new Date(run.updated_at) - new Date(run.created_at)) / 1000;
            const queueTime = (new Date(run.run_started_at) - new Date(run.created_at)) / 1000;

            const metrics = {
              workflow: run.name,
              conclusion: run.conclusion,
              duration_seconds: duration,
              queue_seconds: queueTime,
              branch: run.head_branch,
              sha: run.head_sha,
              timestamp: run.created_at,
            };

            // Push to Prometheus Pushgateway
            const body = [
              `ci_build_duration_seconds{workflow="${run.name}",conclusion="${run.conclusion}"} ${duration}`,
              `ci_queue_duration_seconds{workflow="${run.name}"} ${queueTime}`,
              `ci_build_total{workflow="${run.name}",conclusion="${run.conclusion}"} 1`,
            ].join('\n');

            await fetch(`${process.env.PUSHGATEWAY_URL}/metrics/job/ci/instance/${run.name}`, {
              method: 'POST',
              body: body,
            });
        env:
          PUSHGATEWAY_URL: ${{ secrets.PUSHGATEWAY_URL }}

Flaky Test Tracker

# scripts/flaky-tracker.py
# HARDENED: Track and quarantine flaky tests
import json
import sys
from pathlib import Path
from datetime import datetime, timedelta

FLAKY_DB = ".flaky-tests.json"
QUARANTINE_THRESHOLD = 3
WINDOW_DAYS = 30


def load_db():
    if Path(FLAKY_DB).exists():
        return json.loads(Path(FLAKY_DB).read_text())
    return {"tests": {}}


def save_db(db):
    Path(FLAKY_DB).write_text(json.dumps(db, indent=2))


def record_flaky(test_name):
    db = load_db()
    cutoff = (datetime.now() - timedelta(days=WINDOW_DAYS)).isoformat()

    if test_name not in db["tests"]:
        db["tests"][test_name] = {"occurrences": [], "quarantined": False}

    entry = db["tests"][test_name]
    entry["occurrences"].append(datetime.now().isoformat())

    # Remove old occurrences
    entry["occurrences"] = [o for o in entry["occurrences"] if o > cutoff]

    if len(entry["occurrences"]) >= QUARANTINE_THRESHOLD:
        entry["quarantined"] = True
        print(f"⚠ QUARANTINED: {test_name} "
              f"({len(entry['occurrences'])} flaky in {WINDOW_DAYS} days)")

    save_db(db)


def get_quarantined():
    db = load_db()
    return [name for name, data in db["tests"].items() if data.get("quarantined")]


if __name__ == "__main__":
    if sys.argv[1] == "record":
        record_flaky(sys.argv[2])
    elif sys.argv[1] == "list-quarantined":
        for t in get_quarantined():
            print(t)

JUnit XML Parser for Retry Detection

# In CI workflow
- name: Run tests with retry
  run: |
    pytest --junitxml=results.xml --retries=2

- name: Detect flaky tests
  if: always()
  run: |
    python scripts/detect-flaky.py results.xml

# scripts/detect-flaky.py
# HARDENED: Detect tests that passed on retry
import xml.etree.ElementTree as ET
import subprocess
import sys


def detect_flaky(junit_xml):
    tree = ET.parse(junit_xml)
    for testcase in tree.iter("testcase"):
        # If test has a rerun element, it was retried
        reruns = testcase.findall("rerun")
        if reruns:
            name = f"{testcase.get('classname')}.{testcase.get('name')}"
            print(f"Flaky: {name} (retried {len(reruns)} times)")
            subprocess.run(["python", "scripts/flaky-tracker.py", "record", name])


if __name__ == "__main__":
    detect_flaky(sys.argv[1])

The Gate

Pipeline health is not a PR gate—it is a team gate. When pipeline success rate drops below 90% or build duration exceeds 2x baseline, the team pauses feature work to fix the pipeline. This is a process gate, enforced by the dashboard and alerts, not by branch protection.

The Recovery

Metrics collection adds overhead to CI: The metrics job runs as a separate workflow triggered by workflow_run. It does not add time to the main pipeline.

Flaky test database conflicts: Store the flaky test database in a separate branch or use an external datastore (SQLite in an artifact, or a real database). Multiple concurrent PRs writing to the same file will conflict.

Dashboards show spikes but no root cause: Correlate build duration spikes with git log. Tag metrics with commit SHA. When duration spikes, git log the SHA range to find the change that caused it.