Tracking Performance Trends and the GitLab CI Equivalent

The Symptom

The CI performance gate catches acute regressions: a PR that doubles p99 is blocked immediately. But the fare calculation endpoint has been getting slower for 6 months. Not dramatically. 2 milliseconds per week. No single PR triggers the gate because each PR’s increase is within the 10% threshold. The p99 was 120ms in January. It is 172ms in July. A 43% cumulative regression that arrived in 26 increments of 2ms each.

The product manager reports that fare estimates feel slower. The engineer checks the dashboard. Grafana shows a smooth upward slope over 6 months. No cliff. No single commit to blame. Git bisect is useless because the regression is distributed across hundreds of commits: a new validation rule here (0.3ms), a logging statement there (0.1ms), an extra field in the serialization (0.5ms), a slightly more complex query (1.2ms).

The CI gate compares each PR against a fixed baseline. The baseline was set in January at 120ms with a 200ms threshold. Every PR passes because the threshold is 200ms and the current p99 is 172ms. The gate will not fire until a single PR pushes the p99 past 200ms.

The Cause

The CI performance gate has a static baseline. It catches step-function regressions but is blind to linear drift. Detecting drift requires tracking performance over time and computing trends.

The solution is two-fold:

Store every CI performance run’s results in a database (SQLite, committed alongside the code)
Analyze the trend across the last N runs, flagging when the slope exceeds a threshold

A trend analysis that detects “the p99 for fare-estimate has increased by 15% over the last 30 builds” catches drift that no individual PR triggered.

The Baseline

Performance drift pattern for the fare calculation endpoint:

Month    p99 (ms)    Change    Cumulative
Jan      120         -         0%
Feb      128         +8ms      +7%
Mar      136         +8ms      +13%
Apr      144         +8ms      +20%
May      152         +8ms      +27%
Jun      160         +8ms      +33%
Jul      172         +12ms     +43%

No single month exceeds the 10% threshold from January’s baseline. The cumulative degradation is invisible to the per-PR gate.

The Fix

SQLite for performance trend storage

# SCALED: scripts/perf_trend.py - Store and analyze performance trends
import sqlite3
import csv
import json
import sys
import argparse
from datetime import datetime
from pathlib import Path

DB_PATH = "perf_history.db"

def init_db(db_path):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS perf_runs (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            commit_sha TEXT NOT NULL,
            branch TEXT NOT NULL,
            timestamp TEXT NOT NULL,
            endpoint TEXT NOT NULL,
            p50_ms REAL,
            p95_ms REAL,
            p99_ms REAL,
            avg_ms REAL,
            error_rate REAL,
            rps REAL,
            request_count INTEGER
        )
    """)
    conn.execute("""
        CREATE INDEX IF NOT EXISTS idx_endpoint_timestamp
        ON perf_runs(endpoint, timestamp)
    """)
    conn.commit()
    return conn

def store_results(conn, commit_sha, branch, csv_path):
    timestamp = datetime.utcnow().isoformat()

    with open(csv_path) as f:
        reader = csv.DictReader(f)
        for row in reader:
            name = row.get("Name", "").strip()
            if name == "Aggregated" or not name:
                continue

            request_count = int(row.get("Request Count", 0))
            failure_count = int(row.get("Failure Count", 0))

            conn.execute("""
                INSERT INTO perf_runs
                (commit_sha, branch, timestamp, endpoint,
                 p50_ms, p95_ms, p99_ms, avg_ms,
                 error_rate, rps, request_count)
                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            """, (
                commit_sha, branch, timestamp, name,
                float(row.get("50%", 0)),
                float(row.get("95%", 0)),
                float(row.get("99%", 0)),
                float(row.get("Average (ms)", 0)),
                failure_count / max(request_count, 1) * 100,
                float(row.get("Requests/s", 0)),
                request_count
            ))

    conn.commit()

def analyze_trends(conn, lookback_runs=30):
    endpoints = conn.execute("""
        SELECT DISTINCT endpoint FROM perf_runs
        WHERE branch = 'main'
        ORDER BY endpoint
    """).fetchall()

    report_lines = []
    report_lines.append("## Performance Trend Analysis")
    report_lines.append("")
    report_lines.append(
        f"Analyzing last {lookback_runs} builds on main branch."
    )
    report_lines.append("")

    drift_detected = False

    for (endpoint,) in endpoints:
        rows = conn.execute("""
            SELECT p99_ms, timestamp, commit_sha
            FROM perf_runs
            WHERE endpoint = ? AND branch = 'main'
            ORDER BY timestamp DESC
            LIMIT ?
        """, (endpoint, lookback_runs)).fetchall()

        if len(rows) < 5:
            continue

        rows.reverse()  # oldest first

        oldest_p99 = rows[0][0]
        newest_p99 = rows[-1][0]
        change_pct = ((newest_p99 - oldest_p99) / max(oldest_p99, 0.001)) * 100

        # Linear regression slope (ms per build)
        n = len(rows)
        sum_x = sum(range(n))
        sum_y = sum(r[0] for r in rows)
        sum_xy = sum(i * r[0] for i, r in enumerate(rows))
        sum_x2 = sum(i * i for i in range(n))

        denominator = n * sum_x2 - sum_x * sum_x
        slope = (
            (n * sum_xy - sum_x * sum_y) / denominator
            if denominator != 0 else 0
        )

        status = "OK"
        if change_pct > 15:
            status = "DRIFT"
            drift_detected = True
        elif change_pct > 10:
            status = "WARNING"

        report_lines.append(f"### `{endpoint}`")
        report_lines.append("")
        report_lines.append(
            f"- Oldest p99: {oldest_p99:.0f}ms "
            f"({rows[0][2][:7]} on {rows[0][1][:10]})"
        )
        report_lines.append(f"- Current p99: {newest_p99:.0f}ms")
        report_lines.append(f"- Change: {change_pct:+.1f}%")
        report_lines.append(
            f"- Slope: {slope:+.2f} ms/build"
        )
        report_lines.append(f"- Status: **{status}**")
        report_lines.append("")

    if drift_detected:
        report_lines.append("**DRIFT DETECTED.** "
            "Consider updating the baseline or investigating "
            "the gradual regression.")
    else:
        report_lines.append("**No significant drift detected.**")

    return "\n".join(report_lines), drift_detected

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    sub = parser.add_subparsers(dest="command")

    store_cmd = sub.add_parser("store")
    store_cmd.add_argument("--commit", required=True)
    store_cmd.add_argument("--branch", required=True)
    store_cmd.add_argument("--results", required=True)
    store_cmd.add_argument("--db", default=DB_PATH)

    trend_cmd = sub.add_parser("trend")
    trend_cmd.add_argument("--lookback", type=int, default=30)
    trend_cmd.add_argument("--db", default=DB_PATH)
    trend_cmd.add_argument("--output", default="trend_report.md")

    args = parser.parse_args()

    if args.command == "store":
        conn = init_db(args.db)
        store_results(conn, args.commit, args.branch, args.results)
        print(f"Stored results for {args.commit}")
        conn.close()

    elif args.command == "trend":
        conn = init_db(args.db)
        report, has_drift = analyze_trends(conn, args.lookback)
        with open(args.output, "w") as f:
            f.write(report)
        print(report)
        conn.close()
        if has_drift:
            sys.exit(1)

Integrating trend analysis into the GitHub Actions workflow

Add these steps after the existing comparison step:

- name: Store results in trend database
  if: github.ref == 'refs/heads/main'
  run: |
    python scripts/perf_trend.py store \
      --commit ${{ github.sha }} \
      --branch main \
      --results results/perf_stats.csv \
      --db perf_history.db

- name: Analyze trends
  if: github.ref == 'refs/heads/main'
  run: |
    python scripts/perf_trend.py trend \
      --lookback 30 \
      --db perf_history.db \
      --output results/trend_report.md
  continue-on-error: true

- name: Commit trend database
  if: github.ref == 'refs/heads/main'
  run: |
    git config user.name "CI Bot"
    git config user.email "[email protected]"
    git add perf_history.db
    git diff --cached --quiet || git commit -m "Update perf history [skip ci]"
    git push

The trend database is committed to the repository. Each merge to main adds a row. The trend analysis runs on every main build and alerts when drift exceeds 15% over 30 builds.

Gradual drift detection output

## Performance Trend Analysis

Analyzing last 30 builds on main branch.

### `/api/rides/fare-estimate`

- Oldest p99: 122ms (a1b2c3d on 2024-01-15)
- Current p99: 172ms
- Change: +41.0%
- Slope: +1.67 ms/build
- Status: **DRIFT**

### `/api/drivers/nearby`

- Oldest p99: 185ms (a1b2c3d on 2024-01-15)
- Current p99: 192ms
- Change: +3.8%
- Slope: +0.23 ms/build
- Status: **OK**

### `/api/rides/request`

- Oldest p99: 310ms (a1b2c3d on 2024-01-15)
- Current p99: 328ms
- Change: +5.8%
- Slope: +0.60 ms/build
- Status: **OK**

**DRIFT DETECTED.** Consider updating the baseline or investigating
the gradual regression.

The fare-estimate endpoint shows a +1.67ms per build slope. At 2 builds per week, that is +3.34ms per week, or +174ms per year. The drift detection caught it at +41% after 30 builds. Without trend tracking, it would continue until a single PR pushed the p99 past the 200ms threshold, at which point the developer who happened to write that PR would be blamed for the entire 6-month accumulation.

When to update the baseline vs when to investigate

Decision framework:

Scenario                                    Action
Architecture change (added caching)         Update baseline downward
New feature adds legitimate latency         Update baseline upward, document why
Gradual drift, no intentional changes       Investigate, fix root cause
Single PR causes >10% regression            Block PR, fix the code
Threshold too tight (frequent false pos)    Increase threshold, document why

Baseline updates require a PR that modifies perf_thresholds.json with a description of why the baseline changed. The git history of the threshold file is the performance contract’s changelog.

GitLab CI pipeline

# SCALED: .gitlab-ci.yml - Performance regression testing
stages:
  - build
  - test
  - performance
  - trend

variables:
  DOCKER_HOST: tcp://docker:2375
  DOCKER_TLS_CERTDIR: ""

build:
  stage: build
  image: docker:24
  services:
    - docker:24-dind
  script:
    - docker compose -f docker-compose.ci.yml build
  artifacts:
    paths:
      - docker-compose.ci.yml

performance-test:
  stage: performance
  image: docker:24
  services:
    - docker:24-dind
  before_script:
    - apk add --no-cache python3 py3-pip curl
    - pip3 install locust==2.28.0 --break-system-packages
  script:
    - docker compose -f docker-compose.ci.yml up -d
    - |
      echo "Waiting for application..."
      for i in $(seq 1 60); do
        curl -sf http://docker:8080/health/ready && break
        [ $i -eq 60 ] && exit 1
        sleep 2
      done
    - mkdir -p results
    - |
      locust -f locust_ci.py \
        --host=http://docker:8080 \
        --users 50 \
        --spawn-rate 10 \
        --run-time 60s \
        --headless \
        --csv=results/perf
    - python3 scripts/compare_perf.py
      --results results/perf_stats.csv
      --thresholds perf_thresholds.json
      --output results/comparison.md
  after_script:
    - docker compose -f docker-compose.ci.yml down -v
  artifacts:
    paths:
      - results/
    reports:
      performance: results/perf_stats.csv
    when: always

trend-analysis:
  stage: trend
  image: python:3.12-slim
  only:
    - main
  script:
    - pip install --quiet locust
    - python scripts/perf_trend.py store
      --commit $CI_COMMIT_SHA
      --branch $CI_COMMIT_REF_NAME
      --results results/perf_stats.csv
      --db perf_history.db
    - python scripts/perf_trend.py trend
      --lookback 30
      --db perf_history.db
      --output results/trend_report.md
  artifacts:
    paths:
      - perf_history.db
      - results/trend_report.md
    when: always
  cache:
    key: perf-history
    paths:
      - perf_history.db

GitLab CI differences from GitHub Actions:

Docker-in-Docker: GitLab CI uses docker:dind as a service. The application runs inside Docker-in-Docker. The Locust test connects to docker:8080 (the DinD service hostname) instead of localhost:8080.

Performance artifacts: GitLab has a built-in reports: performance feature that tracks performance metrics across pipelines and shows regressions in the merge request widget. The CSV file maps to GitLab’s expected format.

Cache for trend database: Instead of committing the SQLite database to the repo (which requires write access to the repository from CI), GitLab CI uses the cache mechanism. The perf_history.db file persists across pipeline runs via the CI cache. This is simpler but less durable: cache can be evicted. For critical trend data, use a shared artifact storage or an external database.

Grafana dashboard for performance trends

Query the SQLite database (or Prometheus if using pushgateway) to build a Grafana dashboard:

-- Endpoint p99 over time, grouped by commit
SELECT
  endpoint,
  timestamp,
  commit_sha,
  p99_ms,
  p95_ms,
  p50_ms
FROM perf_runs
WHERE branch = 'main'
  AND endpoint = '/api/rides/fare-estimate'
ORDER BY timestamp DESC
LIMIT 100;

For Grafana, push metrics to Prometheus using pushgateway after each CI run:

# Push to Prometheus pushgateway after Locust completes
cat <<EOF | curl --data-binary @- http://pushgateway.monitoring:9091/metrics/job/ci-perf/commit/$COMMIT_SHA
# TYPE ci_perf_p99_ms gauge
ci_perf_p99_ms{endpoint="/api/rides/fare-estimate"} 128
ci_perf_p99_ms{endpoint="/api/drivers/nearby"} 185
ci_perf_p99_ms{endpoint="/api/rides/request"} 320
# TYPE ci_perf_p95_ms gauge
ci_perf_p95_ms{endpoint="/api/rides/fare-estimate"} 95
ci_perf_p95_ms{endpoint="/api/drivers/nearby"} 142
ci_perf_p95_ms{endpoint="/api/rides/request"} 280
EOF

Grafana dashboard panels:

p99 by endpoint over time: Line chart with each endpoint as a series. X-axis: build timestamp. Y-axis: p99 in milliseconds. Shows drift as a gradual upward slope.
p99 by endpoint per commit: Bar chart showing the p99 for each build. Hovering shows the commit SHA. Allows pinpointing which commits introduced latency.
Error rate by endpoint: Line chart tracking error rate. Should be flat at 0%. Any upward movement indicates a reliability regression.
Throughput (RPS): Validates that the CI test is consistent. RPS should be stable across runs. A drop in RPS indicates CI environment issues, not application regressions.

The Proof

After deploying trend tracking and the GitLab CI pipeline:

Metric                         Before Trends    After Trends    Delta
Gradual drift detection         Never            Automated       Fixed
Time to detect 15% drift        6 months         ~15 builds      -95%
Baseline staleness              6+ months         Updated monthly Controlled
Performance contract violations 2/quarter        0/quarter       -100%
Engineer time on perf bisect    8 hrs/month      30 min/month    -94%

The fare-estimate drift was identified within 15 builds. The trend report flagged the +41% change. The team investigated and found 12 micro-regressions: 4 added validations (2.1ms total), 3 extra log statements (0.9ms total), 2 additional serialization fields (1.8ms total), and 3 query parameter additions (3.2ms total). Each was individually reasonable. Together, they degraded the endpoint by 52ms.

The fix was selective: remove 2 unnecessary log statements (0.6ms), batch 2 validations (0.8ms), and optimize the query (2.1ms). Total recovery: 3.5ms. The remaining 48.5ms of the regression was intentional (new features, required validations) and the baseline was updated to 170ms with documentation explaining why.

The trend database now shows a flat line for the fare-estimate endpoint. The slope dropped from +1.67ms/build to +0.12ms/build. The next drift accumulation will be caught at 15% instead of 43%.