Performance Regression Detection in CI

The Symptom

The team adds a performance test to the CI pipeline. The first implementation: run Locust in the GitHub Actions runner, compare p99 against a hardcoded threshold, pass or fail. It works for two weeks. Then it starts flapping. The same commit passes on retry 60% of the time. The team increases the threshold to reduce false positives. Now real regressions slip through because the threshold is too generous.

The problem: no controlled environment. The Locust test runs on whatever GitHub Actions runner is available. Some runners share a host with other jobs. CPU contention varies by time of day. The database runs without resource limits and competes with the application for memory. Each run produces different results for identical code.

The Cause

Performance testing in CI requires three things that are easy to get wrong:

Reproducible environment: Fixed CPU and memory for every component. Same database version. Same Redis version. Same network topology.
Stable baseline: A reference measurement that represents “correct” performance, updated intentionally, not automatically.
Statistical tolerance: A single run has variance. The comparison must account for noise without masking real regressions.

Docker Compose with resource limits solves #1. A checked-in threshold file solves #2. The comparison script with percentage-based thresholds solves #3.

The Baseline

Current CI performance test failure modes:

Failure Mode                    Frequency    Impact
CI runner CPU contention         3/week       False positive, team ignores gate
No resource limits on DB         1/week       Inconsistent baselines
Threshold too generous           2/month      Real regression passes
No PR comment                    Every PR     Developer does not see results
Manual baseline updates          Never        Baseline drifts from reality

Target:

Requirement                     Solution
Reproducible environment        Docker Compose with CPU/memory limits
Stable baseline                 perf_thresholds.json in repo
Statistical tolerance           10% regression threshold
Developer feedback              Automated PR comment with table
Baseline update process         Manual PR when architecture changes

The Fix

Docker Compose for CI: resource isolation

# SCALED: docker-compose.ci.yml - complete CI environment
version: "3.9"

services:
  rider-api:
    build:
      context: .
      dockerfile: Dockerfile
    environment:
      SPRING_PROFILES_ACTIVE: ci
      SPRING_DATASOURCE_URL: jdbc:postgresql://postgres:5432/ridehailing
      SPRING_DATASOURCE_USERNAME: app
      SPRING_DATASOURCE_PASSWORD: ci-test-only
      SPRING_REDIS_HOST: redis
      SPRING_REDIS_PORT: 6379
      JAVA_OPTS: >-
        -XX:+UseG1GC
        -XX:MaxRAMPercentage=75.0
        -Xms512m
        -Xmx768m
    ports:
      - "8080:8080"
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
      kafka:
        condition: service_healthy
    deploy:
      resources:
        limits:
          cpus: "2.0"
          memory: "1G"
        reservations:
          cpus: "1.0"
          memory: "512M"
    healthcheck:
      test:
        ["CMD-SHELL", "curl -sf http://localhost:8080/health/ready || exit 1"]
      interval: 5s
      timeout: 5s
      retries: 30
      start_period: 30s

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: ridehailing
      POSTGRES_USER: app
      POSTGRES_PASSWORD: ci-test-only
    volumes:
      - ./src/test/resources/init.sql:/docker-entrypoint-initdb.d/init.sql
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U app -d ridehailing"]
      interval: 5s
      timeout: 3s
      retries: 10
    deploy:
      resources:
        limits:
          cpus: "1.0"
          memory: "512M"

  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 10
    deploy:
      resources:
        limits:
          cpus: "0.5"
          memory: "256M"

  kafka:
    image: confluentinc/cp-kafka:7.5.0
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT
      CLUSTER_ID: "ci-test-cluster-id-0001"
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
    healthcheck:
      test:
        [
          "CMD-SHELL",
          "kafka-broker-api-versions --bootstrap-server localhost:9092 || exit 1",
        ]
      interval: 10s
      timeout: 10s
      retries: 15
      start_period: 30s
    deploy:
      resources:
        limits:
          cpus: "1.0"
          memory: "512M"

Fixed resource limits on every service. The rider-api gets exactly 2 CPUs and 1GB of memory on every run, regardless of the host machine. PostgreSQL gets 1 CPU and 512MB. These are lower than production, but they are consistent. Consistency matters more than realism for regression detection.

The init.sql seeds the database with test data: 100 drivers, 50 zones, 1,000 historical trips. Without seed data, the first few Locust requests trigger cold-path code (cache misses, empty result sets) that has different performance characteristics than warm-path code.

Locust configuration for CI

# SCALED: locust_ci.py - CI-optimized load test
from locust import HttpUser, task, between, events
import logging

logger = logging.getLogger(__name__)

class CIPerformanceUser(HttpUser):
    wait_time = between(0.1, 0.5)

    def on_start(self):
        """Warm up the service with a few requests before timing starts"""
        for _ in range(3):
            self.client.get("/api/rides/fare-estimate", params={
                "pickup_lat": 40.7128, "pickup_lng": -74.0060,
                "dropoff_lat": 40.7589, "dropoff_lng": -73.9851
            })

    @task(5)
    def fare_estimate(self):
        self.client.get("/api/rides/fare-estimate",
            params={
                "pickup_lat": 40.7128, "pickup_lng": -74.0060,
                "dropoff_lat": 40.7589, "dropoff_lng": -73.9851
            },
            name="/api/rides/fare-estimate"
        )

    @task(3)
    def nearby_drivers(self):
        self.client.get("/api/drivers/nearby",
            params={"lat": 40.7128, "lng": -74.0060, "radius_km": 2},
            name="/api/drivers/nearby"
        )

    @task(1)
    def request_ride(self):
        self.client.post("/api/rides/request",
            json={
                "rider_id": "ci-test-rider",
                "pickup_lat": 40.7128, "pickup_lng": -74.0060,
                "dropoff_lat": 40.7589, "dropoff_lng": -73.9851,
                "ride_type": "standard"
            },
            name="/api/rides/request"
        )

    @task(2)
    def trip_history(self):
        self.client.get("/api/trips/history",
            params={"rider_id": "ci-test-rider", "limit": 20},
            name="/api/trips/history"
        )

The on_start warmup fires 3 requests per user before the timed run begins. This triggers JVM JIT compilation and fills caches so the measured run reflects warm-path performance. Without warmup, the first 5-10 seconds of data include cold-start overhead that inflates latency numbers.

50 users with 0.1-0.5 second wait generates approximately 150-250 RPS. Low enough to run on a 2-CPU container. High enough to expose event loop contention from blocking calls.

GitHub Actions workflow: the complete pipeline

# SCALED: .github/workflows/performance-gate.yml
name: Performance Regression Gate

on:
  pull_request:
    paths:
      - "src/main/**"
      - "build.gradle"
      - "Dockerfile"
      - "docker-compose.ci.yml"
      - "locust_ci.py"
      - "perf_thresholds.json"

concurrency:
  group: perf-${{ github.head_ref }}
  cancel-in-progress: true

jobs:
  performance-test:
    runs-on: ubuntu-latest
    timeout-minutes: 15

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install Locust
        run: pip install locust==2.28.0

      - name: Start services
        run: |
          docker compose -f docker-compose.ci.yml up -d --build
          echo "Services starting..."

      - name: Wait for application health
        run: |
          echo "Waiting for rider-api health check..."
          attempt=0
          max_attempts=60
          while [ $attempt -lt $max_attempts ]; do
            if curl -sf http://localhost:8080/health/ready > /dev/null 2>&1; then
              echo "Application is ready (attempt $attempt)"
              break
            fi
            attempt=$((attempt + 1))
            if [ $attempt -eq $max_attempts ]; then
              echo "Application failed to become healthy"
              docker compose -f docker-compose.ci.yml logs rider-api
              exit 1
            fi
            sleep 2
          done

      - name: Run performance test
        run: |
          mkdir -p results
          locust -f locust_ci.py \
            --host=http://localhost:8080 \
            --users 50 \
            --spawn-rate 10 \
            --run-time 60s \
            --headless \
            --csv=results/perf \
            --only-summary \
            2>&1 | tee results/locust_output.txt

      - name: Compare against thresholds
        id: compare
        run: |
          python scripts/compare_perf.py \
            --results results/perf_stats.csv \
            --thresholds perf_thresholds.json \
            --output results/comparison.md
        continue-on-error: true

      - name: Post PR comment
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            let body = '';
            try {
              body = fs.readFileSync('results/comparison.md', 'utf8');
            } catch (e) {
              body = '## Performance Test\n\nFailed to generate comparison report.\n';
            }

            // Find and update existing comment or create new one
            const {data: comments} = await github.rest.issues.listComments({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
            });

            const botComment = comments.find(c =>
              c.body.includes('Performance Regression Report')
            );

            if (botComment) {
              await github.rest.issues.updateComment({
                comment_id: botComment.id,
                owner: context.repo.owner,
                repo: context.repo.repo,
                body: body
              });
            } else {
              await github.rest.issues.createComment({
                issue_number: context.issue.number,
                owner: context.repo.owner,
                repo: context.repo.repo,
                body: body
              });
            }

      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: performance-results
          path: results/

      - name: Check result
        run: |
          if grep -q "FAIL" results/comparison.md; then
            echo "Performance regression detected. See PR comment for details."
            exit 1
          fi
          echo "Performance check passed."

      - name: Cleanup
        if: always()
        run: docker compose -f docker-compose.ci.yml down -v --remove-orphans

Key workflow decisions:

concurrency: cancel-in-progress: true cancels the performance test when a new commit is pushed to the same PR. No point testing an outdated commit while the developer is iterating.

continue-on-error: true on the compare step ensures the PR comment is posted even when the test fails. The developer needs to see the comparison table to understand the regression.

The PR comment update logic finds an existing performance comment and updates it instead of creating a new comment per push. After 5 iterations on a PR, 5 performance comments would clutter the conversation.

Threshold file

{
  "endpoints": {
    "/api/rides/fare-estimate": {
      "p99_ms": 200,
      "p95_ms": 150,
      "p50_ms": 80,
      "error_rate_pct": 0.1,
      "max_regression_pct": 10
    },
    "/api/drivers/nearby": {
      "p99_ms": 300,
      "p95_ms": 200,
      "p50_ms": 100,
      "error_rate_pct": 0.1,
      "max_regression_pct": 10
    },
    "/api/rides/request": {
      "p99_ms": 500,
      "p95_ms": 350,
      "p50_ms": 200,
      "error_rate_pct": 0.5,
      "max_regression_pct": 10
    },
    "/api/trips/history": {
      "p99_ms": 500,
      "p95_ms": 350,
      "p50_ms": 150,
      "error_rate_pct": 0.1,
      "max_regression_pct": 10
    }
  },
  "baseline_updated": "2024-03-15",
  "baseline_commit": "a1b2c3d",
  "notes": "Baseline after connection pool optimization in PR #247"
}

The threshold file is version-controlled. Updating it requires a PR and review. The baseline_updated and baseline_commit fields document when and why the baseline changed. This creates an audit trail: “The p99 for fare-estimate changed from 120ms to 200ms on March 15th because PR #247 added a Redis cache layer that changed the latency distribution.”

The comparison script: detailed version

# SCALED: scripts/compare_perf.py
import csv
import json
import sys
import argparse
from datetime import datetime

def parse_locust_stats(csv_path):
    results = {}
    with open(csv_path) as f:
        reader = csv.DictReader(f)
        for row in reader:
            name = row.get("Name", "").strip()
            if name == "Aggregated" or not name:
                continue
            request_count = int(row.get("Request Count", 0))
            failure_count = int(row.get("Failure Count", 0))
            results[name] = {
                "p50": float(row.get("50%", 0)),
                "p95": float(row.get("95%", 0)),
                "p99": float(row.get("99%", 0)),
                "avg": float(row.get("Average (ms)", 0)),
                "min": float(row.get("Min (ms)", 0)),
                "max": float(row.get("Max (ms)", 0)),
                "request_count": request_count,
                "failure_count": failure_count,
                "error_rate": (
                    failure_count / max(request_count, 1) * 100
                ),
                "rps": float(row.get("Requests/s", 0))
            }
    return results

def find_endpoint(results, endpoint_pattern):
    for key in results:
        if endpoint_pattern in key:
            return key
    return None

def compare(results, thresholds):
    lines = []
    overall_pass = True
    failures = []

    lines.append("## Performance Regression Report")
    lines.append("")
    lines.append(f"**Date:** {datetime.now().strftime('%Y-%m-%d %H:%M UTC')}")
    lines.append(
        f"**Baseline:** {thresholds.get('baseline_commit', 'unknown')} "
        f"({thresholds.get('baseline_updated', 'unknown')})"
    )
    lines.append("")
    lines.append(
        "| Endpoint | Metric | Threshold | Actual | Delta | Status |"
    )
    lines.append(
        "|----------|--------|-----------|--------|-------|--------|"
    )

    for endpoint, limits in thresholds.get("endpoints", {}).items():
        matching_key = find_endpoint(results, endpoint)

        if not matching_key:
            lines.append(
                f"| `{endpoint}` | - | - | NOT TESTED | - | SKIP |"
            )
            continue

        actual = results[matching_key]

        checks = [
            ("p99 (ms)", limits.get("p99_ms"), actual["p99"]),
            ("p95 (ms)", limits.get("p95_ms"), actual["p95"]),
            ("p50 (ms)", limits.get("p50_ms"), actual["p50"]),
            (
                "error (%)",
                limits.get("error_rate_pct"),
                actual["error_rate"]
            ),
        ]

        for metric, threshold_val, actual_val in checks:
            if threshold_val is None:
                continue
            passed = actual_val <= threshold_val
            delta_pct = (
                ((actual_val - threshold_val) / max(threshold_val, 0.001))
                * 100
            )
            delta_str = f"+{delta_pct:.0f}%" if delta_pct > 0 else f"{delta_pct:.0f}%"
            status = "PASS" if passed else "FAIL"

            if not passed:
                overall_pass = False
                failures.append(
                    f"{endpoint} {metric}: "
                    f"{actual_val:.1f} > {threshold_val}"
                )

            lines.append(
                f"| `{endpoint}` | {metric} | "
                f"{threshold_val} | {actual_val:.1f} | "
                f"{delta_str} | {status} |"
            )

    lines.append("")

    if overall_pass:
        lines.append("**Overall: PASS**")
    else:
        lines.append("**Overall: FAIL**")
        lines.append("")
        lines.append("### Failures")
        for f in failures:
            lines.append(f"- {f}")

    lines.append("")
    return "\n".join(lines)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Compare Locust results against thresholds"
    )
    parser.add_argument("--results", required=True,
                        help="Path to Locust stats CSV")
    parser.add_argument("--thresholds", required=True,
                        help="Path to threshold JSON")
    parser.add_argument("--output", required=True,
                        help="Path to output markdown")
    args = parser.parse_args()

    results = parse_locust_stats(args.results)
    with open(args.thresholds) as f:
        thresholds = json.load(f)

    report = compare(results, thresholds)

    with open(args.output, "w") as f:
        f.write(report)

    print(report)
    if "FAIL" in report:
        sys.exit(1)

PR comment output

A passing PR produces:

## Performance Regression Report

**Date:** 2024-06-15 14:32 UTC
**Baseline:** a1b2c3d (2024-03-15)

| Endpoint                  | Metric   | Threshold | Actual | Delta | Status |
|---------------------------|----------|-----------|--------|-------|--------|
| /api/rides/fare-estimate  | p99 (ms) | 200       | 128.0  | -36%  | PASS   |
| /api/rides/fare-estimate  | p95 (ms) | 150       | 95.0   | -37%  | PASS   |
| /api/rides/fare-estimate  | p50 (ms) | 80        | 52.0   | -35%  | PASS   |
| /api/rides/fare-estimate  | error (%)| 0.1       | 0.0    | -100% | PASS   |
| /api/drivers/nearby       | p99 (ms) | 300       | 185.0  | -38%  | PASS   |
| /api/rides/request        | p99 (ms) | 500       | 320.0  | -36%  | PASS   |

**Overall: PASS**

A failing PR (the audit logging regression):

## Performance Regression Report

**Date:** 2024-06-15 14:32 UTC
**Baseline:** a1b2c3d (2024-03-15)

| Endpoint                  | Metric   | Threshold | Actual | Delta  | Status |
|---------------------------|----------|-----------|--------|--------|--------|
| /api/rides/fare-estimate  | p99 (ms) | 200       | 682.0  | +241%  | FAIL   |
| /api/rides/fare-estimate  | p95 (ms) | 150       | 510.0  | +240%  | FAIL   |
| /api/rides/fare-estimate  | p50 (ms) | 80        | 245.0  | +206%  | FAIL   |
| /api/rides/fare-estimate  | error (%)| 0.1       | 0.0    | -100%  | PASS   |
| /api/drivers/nearby       | p99 (ms) | 300       | 188.0  | -37%   | PASS   |
| /api/rides/request        | p99 (ms) | 500       | 325.0  | -35%   | PASS   |

**Overall: FAIL**

### Failures
- /api/rides/fare-estimate p99 (ms): 682.0 > 200
- /api/rides/fare-estimate p95 (ms): 510.0 > 150
- /api/rides/fare-estimate p50 (ms): 245.0 > 80

The developer sees exactly which endpoint regressed, by how much, and at which percentile. The fare-estimate endpoint degraded at all percentiles (p50, p95, p99) while other endpoints were unaffected. This points to a change in the fare-estimate code path, not a systemic issue.

The Proof

After deploying the full CI performance gate with Docker Compose isolation:

Metric                        Before           After           Delta
False positives/month          12               0.3             -97%
False negatives/month          2.3              0.1             -96%
Test result variance           ±35%             ±5%             -86%
Developer trust in gate        Low (ignored)    High (acted on) N/A
Time to investigate failure    40 min           3 min           -92%

The variance dropped from ±35% to ±5% because resource limits eliminated runner contention as a variable. The same code produces the same latency within a 5% band across runs. A 10% threshold catches real regressions (which typically show 50%+ increases) without triggering on noise.