Performance Regression Detection in CI
Performance Regression Detection in CI
The Symptom
The team adds a performance test to the CI pipeline. The first implementation: run Locust in the GitHub Actions runner, compare p99 against a hardcoded threshold, pass or fail. It works for two weeks. Then it starts flapping. The same commit passes on retry 60% of the time. The team increases the threshold to reduce false positives. Now real regressions slip through because the threshold is too generous.
The problem: no controlled environment. The Locust test runs on whatever GitHub Actions runner is available. Some runners share a host with other jobs. CPU contention varies by time of day. The database runs without resource limits and competes with the application for memory. Each run produces different results for identical code.
The Cause
Performance testing in CI requires three things that are easy to get wrong:
- Reproducible environment: Fixed CPU and memory for every component. Same database version. Same Redis version. Same network topology.
- Stable baseline: A reference measurement that represents “correct” performance, updated intentionally, not automatically.
- Statistical tolerance: A single run has variance. The comparison must account for noise without masking real regressions.
Docker Compose with resource limits solves #1. A checked-in threshold file solves #2. The comparison script with percentage-based thresholds solves #3.
The Baseline
Current CI performance test failure modes:
Failure Mode Frequency Impact
CI runner CPU contention 3/week False positive, team ignores gate
No resource limits on DB 1/week Inconsistent baselines
Threshold too generous 2/month Real regression passes
No PR comment Every PR Developer does not see results
Manual baseline updates Never Baseline drifts from reality
Target:
Requirement Solution
Reproducible environment Docker Compose with CPU/memory limits
Stable baseline perf_thresholds.json in repo
Statistical tolerance 10% regression threshold
Developer feedback Automated PR comment with table
Baseline update process Manual PR when architecture changes
The Fix
Docker Compose for CI: resource isolation
# SCALED: docker-compose.ci.yml - complete CI environment
version: "3.9"
services:
rider-api:
build:
context: .
dockerfile: Dockerfile
environment:
SPRING_PROFILES_ACTIVE: ci
SPRING_DATASOURCE_URL: jdbc:postgresql://postgres:5432/ridehailing
SPRING_DATASOURCE_USERNAME: app
SPRING_DATASOURCE_PASSWORD: ci-test-only
SPRING_REDIS_HOST: redis
SPRING_REDIS_PORT: 6379
JAVA_OPTS: >-
-XX:+UseG1GC
-XX:MaxRAMPercentage=75.0
-Xms512m
-Xmx768m
ports:
- "8080:8080"
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
kafka:
condition: service_healthy
deploy:
resources:
limits:
cpus: "2.0"
memory: "1G"
reservations:
cpus: "1.0"
memory: "512M"
healthcheck:
test:
["CMD-SHELL", "curl -sf http://localhost:8080/health/ready || exit 1"]
interval: 5s
timeout: 5s
retries: 30
start_period: 30s
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: ridehailing
POSTGRES_USER: app
POSTGRES_PASSWORD: ci-test-only
volumes:
- ./src/test/resources/init.sql:/docker-entrypoint-initdb.d/init.sql
healthcheck:
test: ["CMD-SHELL", "pg_isready -U app -d ridehailing"]
interval: 5s
timeout: 3s
retries: 10
deploy:
resources:
limits:
cpus: "1.0"
memory: "512M"
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 10
deploy:
resources:
limits:
cpus: "0.5"
memory: "256M"
kafka:
image: confluentinc/cp-kafka:7.5.0
environment:
KAFKA_NODE_ID: 1
KAFKA_PROCESS_ROLES: broker,controller
KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT
CLUSTER_ID: "ci-test-cluster-id-0001"
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
healthcheck:
test:
[
"CMD-SHELL",
"kafka-broker-api-versions --bootstrap-server localhost:9092 || exit 1",
]
interval: 10s
timeout: 10s
retries: 15
start_period: 30s
deploy:
resources:
limits:
cpus: "1.0"
memory: "512M"
Fixed resource limits on every service. The rider-api gets exactly 2 CPUs and 1GB of memory on every run, regardless of the host machine. PostgreSQL gets 1 CPU and 512MB. These are lower than production, but they are consistent. Consistency matters more than realism for regression detection.
The init.sql seeds the database with test data: 100 drivers, 50 zones, 1,000 historical trips. Without seed data, the first few Locust requests trigger cold-path code (cache misses, empty result sets) that has different performance characteristics than warm-path code.
Locust configuration for CI
# SCALED: locust_ci.py - CI-optimized load test
from locust import HttpUser, task, between, events
import logging
logger = logging.getLogger(__name__)
class CIPerformanceUser(HttpUser):
wait_time = between(0.1, 0.5)
def on_start(self):
"""Warm up the service with a few requests before timing starts"""
for _ in range(3):
self.client.get("/api/rides/fare-estimate", params={
"pickup_lat": 40.7128, "pickup_lng": -74.0060,
"dropoff_lat": 40.7589, "dropoff_lng": -73.9851
})
@task(5)
def fare_estimate(self):
self.client.get("/api/rides/fare-estimate",
params={
"pickup_lat": 40.7128, "pickup_lng": -74.0060,
"dropoff_lat": 40.7589, "dropoff_lng": -73.9851
},
name="/api/rides/fare-estimate"
)
@task(3)
def nearby_drivers(self):
self.client.get("/api/drivers/nearby",
params={"lat": 40.7128, "lng": -74.0060, "radius_km": 2},
name="/api/drivers/nearby"
)
@task(1)
def request_ride(self):
self.client.post("/api/rides/request",
json={
"rider_id": "ci-test-rider",
"pickup_lat": 40.7128, "pickup_lng": -74.0060,
"dropoff_lat": 40.7589, "dropoff_lng": -73.9851,
"ride_type": "standard"
},
name="/api/rides/request"
)
@task(2)
def trip_history(self):
self.client.get("/api/trips/history",
params={"rider_id": "ci-test-rider", "limit": 20},
name="/api/trips/history"
)
The on_start warmup fires 3 requests per user before the timed run begins. This triggers JVM JIT compilation and fills caches so the measured run reflects warm-path performance. Without warmup, the first 5-10 seconds of data include cold-start overhead that inflates latency numbers.
50 users with 0.1-0.5 second wait generates approximately 150-250 RPS. Low enough to run on a 2-CPU container. High enough to expose event loop contention from blocking calls.
GitHub Actions workflow: the complete pipeline
# SCALED: .github/workflows/performance-gate.yml
name: Performance Regression Gate
on:
pull_request:
paths:
- "src/main/**"
- "build.gradle"
- "Dockerfile"
- "docker-compose.ci.yml"
- "locust_ci.py"
- "perf_thresholds.json"
concurrency:
group: perf-${{ github.head_ref }}
cancel-in-progress: true
jobs:
performance-test:
runs-on: ubuntu-latest
timeout-minutes: 15
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install Locust
run: pip install locust==2.28.0
- name: Start services
run: |
docker compose -f docker-compose.ci.yml up -d --build
echo "Services starting..."
- name: Wait for application health
run: |
echo "Waiting for rider-api health check..."
attempt=0
max_attempts=60
while [ $attempt -lt $max_attempts ]; do
if curl -sf http://localhost:8080/health/ready > /dev/null 2>&1; then
echo "Application is ready (attempt $attempt)"
break
fi
attempt=$((attempt + 1))
if [ $attempt -eq $max_attempts ]; then
echo "Application failed to become healthy"
docker compose -f docker-compose.ci.yml logs rider-api
exit 1
fi
sleep 2
done
- name: Run performance test
run: |
mkdir -p results
locust -f locust_ci.py \
--host=http://localhost:8080 \
--users 50 \
--spawn-rate 10 \
--run-time 60s \
--headless \
--csv=results/perf \
--only-summary \
2>&1 | tee results/locust_output.txt
- name: Compare against thresholds
id: compare
run: |
python scripts/compare_perf.py \
--results results/perf_stats.csv \
--thresholds perf_thresholds.json \
--output results/comparison.md
continue-on-error: true
- name: Post PR comment
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
let body = '';
try {
body = fs.readFileSync('results/comparison.md', 'utf8');
} catch (e) {
body = '## Performance Test\n\nFailed to generate comparison report.\n';
}
// Find and update existing comment or create new one
const {data: comments} = await github.rest.issues.listComments({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
});
const botComment = comments.find(c =>
c.body.includes('Performance Regression Report')
);
if (botComment) {
await github.rest.issues.updateComment({
comment_id: botComment.id,
owner: context.repo.owner,
repo: context.repo.repo,
body: body
});
} else {
await github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: body
});
}
- name: Upload results
if: always()
uses: actions/upload-artifact@v4
with:
name: performance-results
path: results/
- name: Check result
run: |
if grep -q "FAIL" results/comparison.md; then
echo "Performance regression detected. See PR comment for details."
exit 1
fi
echo "Performance check passed."
- name: Cleanup
if: always()
run: docker compose -f docker-compose.ci.yml down -v --remove-orphans
Key workflow decisions:
concurrency: cancel-in-progress: true cancels the performance test when a new commit is pushed to the same PR. No point testing an outdated commit while the developer is iterating.
continue-on-error: true on the compare step ensures the PR comment is posted even when the test fails. The developer needs to see the comparison table to understand the regression.
The PR comment update logic finds an existing performance comment and updates it instead of creating a new comment per push. After 5 iterations on a PR, 5 performance comments would clutter the conversation.
Threshold file
{
"endpoints": {
"/api/rides/fare-estimate": {
"p99_ms": 200,
"p95_ms": 150,
"p50_ms": 80,
"error_rate_pct": 0.1,
"max_regression_pct": 10
},
"/api/drivers/nearby": {
"p99_ms": 300,
"p95_ms": 200,
"p50_ms": 100,
"error_rate_pct": 0.1,
"max_regression_pct": 10
},
"/api/rides/request": {
"p99_ms": 500,
"p95_ms": 350,
"p50_ms": 200,
"error_rate_pct": 0.5,
"max_regression_pct": 10
},
"/api/trips/history": {
"p99_ms": 500,
"p95_ms": 350,
"p50_ms": 150,
"error_rate_pct": 0.1,
"max_regression_pct": 10
}
},
"baseline_updated": "2024-03-15",
"baseline_commit": "a1b2c3d",
"notes": "Baseline after connection pool optimization in PR #247"
}
The threshold file is version-controlled. Updating it requires a PR and review. The baseline_updated and baseline_commit fields document when and why the baseline changed. This creates an audit trail: “The p99 for fare-estimate changed from 120ms to 200ms on March 15th because PR #247 added a Redis cache layer that changed the latency distribution.”
The comparison script: detailed version
# SCALED: scripts/compare_perf.py
import csv
import json
import sys
import argparse
from datetime import datetime
def parse_locust_stats(csv_path):
results = {}
with open(csv_path) as f:
reader = csv.DictReader(f)
for row in reader:
name = row.get("Name", "").strip()
if name == "Aggregated" or not name:
continue
request_count = int(row.get("Request Count", 0))
failure_count = int(row.get("Failure Count", 0))
results[name] = {
"p50": float(row.get("50%", 0)),
"p95": float(row.get("95%", 0)),
"p99": float(row.get("99%", 0)),
"avg": float(row.get("Average (ms)", 0)),
"min": float(row.get("Min (ms)", 0)),
"max": float(row.get("Max (ms)", 0)),
"request_count": request_count,
"failure_count": failure_count,
"error_rate": (
failure_count / max(request_count, 1) * 100
),
"rps": float(row.get("Requests/s", 0))
}
return results
def find_endpoint(results, endpoint_pattern):
for key in results:
if endpoint_pattern in key:
return key
return None
def compare(results, thresholds):
lines = []
overall_pass = True
failures = []
lines.append("## Performance Regression Report")
lines.append("")
lines.append(f"**Date:** {datetime.now().strftime('%Y-%m-%d %H:%M UTC')}")
lines.append(
f"**Baseline:** {thresholds.get('baseline_commit', 'unknown')} "
f"({thresholds.get('baseline_updated', 'unknown')})"
)
lines.append("")
lines.append(
"| Endpoint | Metric | Threshold | Actual | Delta | Status |"
)
lines.append(
"|----------|--------|-----------|--------|-------|--------|"
)
for endpoint, limits in thresholds.get("endpoints", {}).items():
matching_key = find_endpoint(results, endpoint)
if not matching_key:
lines.append(
f"| `{endpoint}` | - | - | NOT TESTED | - | SKIP |"
)
continue
actual = results[matching_key]
checks = [
("p99 (ms)", limits.get("p99_ms"), actual["p99"]),
("p95 (ms)", limits.get("p95_ms"), actual["p95"]),
("p50 (ms)", limits.get("p50_ms"), actual["p50"]),
(
"error (%)",
limits.get("error_rate_pct"),
actual["error_rate"]
),
]
for metric, threshold_val, actual_val in checks:
if threshold_val is None:
continue
passed = actual_val <= threshold_val
delta_pct = (
((actual_val - threshold_val) / max(threshold_val, 0.001))
* 100
)
delta_str = f"+{delta_pct:.0f}%" if delta_pct > 0 else f"{delta_pct:.0f}%"
status = "PASS" if passed else "FAIL"
if not passed:
overall_pass = False
failures.append(
f"{endpoint} {metric}: "
f"{actual_val:.1f} > {threshold_val}"
)
lines.append(
f"| `{endpoint}` | {metric} | "
f"{threshold_val} | {actual_val:.1f} | "
f"{delta_str} | {status} |"
)
lines.append("")
if overall_pass:
lines.append("**Overall: PASS**")
else:
lines.append("**Overall: FAIL**")
lines.append("")
lines.append("### Failures")
for f in failures:
lines.append(f"- {f}")
lines.append("")
return "\n".join(lines)
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Compare Locust results against thresholds"
)
parser.add_argument("--results", required=True,
help="Path to Locust stats CSV")
parser.add_argument("--thresholds", required=True,
help="Path to threshold JSON")
parser.add_argument("--output", required=True,
help="Path to output markdown")
args = parser.parse_args()
results = parse_locust_stats(args.results)
with open(args.thresholds) as f:
thresholds = json.load(f)
report = compare(results, thresholds)
with open(args.output, "w") as f:
f.write(report)
print(report)
if "FAIL" in report:
sys.exit(1)
PR comment output
A passing PR produces:
## Performance Regression Report
**Date:** 2024-06-15 14:32 UTC
**Baseline:** a1b2c3d (2024-03-15)
| Endpoint | Metric | Threshold | Actual | Delta | Status |
|---------------------------|----------|-----------|--------|-------|--------|
| /api/rides/fare-estimate | p99 (ms) | 200 | 128.0 | -36% | PASS |
| /api/rides/fare-estimate | p95 (ms) | 150 | 95.0 | -37% | PASS |
| /api/rides/fare-estimate | p50 (ms) | 80 | 52.0 | -35% | PASS |
| /api/rides/fare-estimate | error (%)| 0.1 | 0.0 | -100% | PASS |
| /api/drivers/nearby | p99 (ms) | 300 | 185.0 | -38% | PASS |
| /api/rides/request | p99 (ms) | 500 | 320.0 | -36% | PASS |
**Overall: PASS**
A failing PR (the audit logging regression):
## Performance Regression Report
**Date:** 2024-06-15 14:32 UTC
**Baseline:** a1b2c3d (2024-03-15)
| Endpoint | Metric | Threshold | Actual | Delta | Status |
|---------------------------|----------|-----------|--------|--------|--------|
| /api/rides/fare-estimate | p99 (ms) | 200 | 682.0 | +241% | FAIL |
| /api/rides/fare-estimate | p95 (ms) | 150 | 510.0 | +240% | FAIL |
| /api/rides/fare-estimate | p50 (ms) | 80 | 245.0 | +206% | FAIL |
| /api/rides/fare-estimate | error (%)| 0.1 | 0.0 | -100% | PASS |
| /api/drivers/nearby | p99 (ms) | 300 | 188.0 | -37% | PASS |
| /api/rides/request | p99 (ms) | 500 | 325.0 | -35% | PASS |
**Overall: FAIL**
### Failures
- /api/rides/fare-estimate p99 (ms): 682.0 > 200
- /api/rides/fare-estimate p95 (ms): 510.0 > 150
- /api/rides/fare-estimate p50 (ms): 245.0 > 80
The developer sees exactly which endpoint regressed, by how much, and at which percentile. The fare-estimate endpoint degraded at all percentiles (p50, p95, p99) while other endpoints were unaffected. This points to a change in the fare-estimate code path, not a systemic issue.
The Proof
After deploying the full CI performance gate with Docker Compose isolation:
Metric Before After Delta
False positives/month 12 0.3 -97%
False negatives/month 2.3 0.1 -96%
Test result variance ±35% ±5% -86%
Developer trust in gate Low (ignored) High (acted on) N/A
Time to investigate failure 40 min 3 min -92%
The variance dropped from ±35% to ±5% because resource limits eliminated runner contention as a variable. The same code produces the same latency within a 5% band across runs. A 10% threshold catches real regressions (which typically show 50%+ increases) without triggering on noise.