Pass/Fail Criteria as Code

The Failure

The team’s performance budget was documented in Confluence. The CI script checked p99 < 500ms for all endpoints. The checkout endpoint was naturally slower (p99 of 400ms) while the catalog endpoint was fast (p99 of 50ms). A regression doubled catalog latency to 100ms. Still under 500ms. The gate passed. Users noticed the catalog was slower. The global budget was too coarse to catch per-endpoint regressions.

Per-endpoint budgets catch regressions that global budgets miss.

The Mechanism

Budget Hierarchy

Global budget:
  p99 < 500ms
  error_rate < 0.5%

Per-endpoint budgets:
  /api/catalog/products:
    p99 < 80ms
  /api/checkout:
    p99 < 400ms
  /api/payments:
    p99 < 600ms

The checker validates both levels. A request must pass its endpoint budget AND the global budget.

The Implementation

Budget Configuration File

# tests/performance/budgets.yaml
# HARDENED: Performance budgets as code
global:
  p50_ms: 100
  p99_ms: 500
  error_rate_percent: 0.5
  min_rps: 50

endpoints:
  "/api/catalog/products":
    p50_ms: 30
    p99_ms: 80
  "/api/catalog/products/[id]":
    p50_ms: 20
    p99_ms: 50
  "/api/cart/items":
    p50_ms: 50
    p99_ms: 200
  "/api/checkout":
    p50_ms: 200
    p99_ms: 400

Enhanced Budget Checker

# tests/performance/check_budget.py
# HARDENED: Per-endpoint budget checking with baseline comparison
import csv
import json
import sys
import yaml
from pathlib import Path


def load_budgets(path="tests/performance/budgets.yaml"):
    with open(path) as f:
        return yaml.safe_load(f)


def load_baseline(path=".performance-baseline.json"):
    if Path(path).exists():
        with open(path) as f:
            return json.load(f)
    return None


def check(csv_path):
    budgets = load_budgets()
    baseline = load_baseline()
    failures = []
    results = {}

    with open(csv_path) as f:
        reader = csv.DictReader(f)
        for row in reader:
            name = row["Name"]
            p50 = float(row["50%"])
            p99 = float(row["99%"])
            total = int(row["Request Count"])
            fail_count = int(row["Failure Count"])
            error_rate = (fail_count / total * 100) if total > 0 else 0

            results[name] = {"p50": p50, "p99": p99, "error_rate": error_rate}

            # Check endpoint-specific budget
            if name in budgets.get("endpoints", {}):
                ep = budgets["endpoints"][name]
                if "p50_ms" in ep and p50 > ep["p50_ms"]:
                    failures.append(f"{name}: p50 {p50}ms > {ep['p50_ms']}ms")
                if "p99_ms" in ep and p99 > ep["p99_ms"]:
                    failures.append(f"{name}: p99 {p99}ms > {ep['p99_ms']}ms")

            # Check global budget
            if name == "Aggregated":
                g = budgets["global"]
                if p50 > g["p50_ms"]:
                    failures.append(f"Global p50 {p50}ms > {g['p50_ms']}ms")
                if p99 > g["p99_ms"]:
                    failures.append(f"Global p99 {p99}ms > {g['p99_ms']}ms")
                if error_rate > g["error_rate_percent"]:
                    failures.append(f"Error rate {error_rate:.2f}% > {g['error_rate_percent']}%")

            # Check regression from baseline
            if baseline and name in baseline:
                prev_p99 = baseline[name]["p99"]
                regression_pct = ((p99 - prev_p99) / prev_p99 * 100) if prev_p99 > 0 else 0
                if regression_pct > 20:
                    failures.append(
                        f"{name}: p99 regressed {regression_pct:.0f}% "
                        f"({prev_p99}ms → {p99}ms)"
                    )

    # Save current results as new baseline candidate
    with open(".performance-baseline-candidate.json", "w") as f:
        json.dump(results, f, indent=2)

    if failures:
        print("Performance budget EXCEEDED:")
        for f_msg in failures:
            print(f"  ✗ {f_msg}")
        sys.exit(1)
    else:
        print("Performance budget OK")
        # Promote candidate to baseline
        Path(".performance-baseline-candidate.json").rename(
            ".performance-baseline.json"
        )


if __name__ == "__main__":
    check(sys.argv[1])

Trend Detection

# tests/performance/trend.py
# HARDENED: Detect gradual performance degradation
import json
from pathlib import Path


def check_trend(history_dir=".performance-history"):
    """Check if p99 has been increasing over the last N runs."""
    history_path = Path(history_dir)
    if not history_path.exists():
        return

    files = sorted(history_path.glob("*.json"))[-10:]  # Last 10 runs
    if len(files) < 5:
        return

    aggregated_p99 = []
    for f in files:
        data = json.loads(f.read_text())
        if "Aggregated" in data:
            aggregated_p99.append(data["Aggregated"]["p99"])

    if len(aggregated_p99) < 5:
        return

    # Check if p99 has increased in 4 of the last 5 runs
    increases = sum(
        1 for i in range(1, len(aggregated_p99))
        if aggregated_p99[i] > aggregated_p99[i - 1]
    )

    if increases >= 4:
        first = aggregated_p99[0]
        last = aggregated_p99[-1]
        pct = ((last - first) / first * 100) if first > 0 else 0
        print(f"⚠ Performance trend warning: p99 increased {pct:.0f}% "
              f"over last {len(aggregated_p99)} runs ({first}ms → {last}ms)")

The Gate

The budget checker is a two-layer gate:

Absolute budget: Each endpoint must be under its defined maximum
Relative regression: No endpoint can regress more than 20% from baseline

Both must pass for the PR to merge. Trend warnings are advisory—they surface gradual degradation that individual PRs might not trigger.

The Recovery

Budget is too tight for CI environment: CI runners are slower than production hardware. Set CI budgets at 2x production budgets. The goal is to catch regressions, not validate absolute performance.

Baseline keeps ratcheting upward: If every PR is slightly slower and the baseline updates each time, the baseline drifts. Add the 20% regression check to prevent gradual drift.

Different results on each run: Performance tests have variance. Run the test 3 times and take the median. Or increase test duration to reduce variance.