CI/CD as a Safety Gate: Performance Regression Testing in the Pipeline
CI/CD as a Safety Gate: Performance Regression Testing in the Pipeline
The Symptom
A pull request passes code review. Two senior engineers approve it. The change adds audit logging to the fare estimation endpoint. The diff is clean: a new service call that writes an audit record to PostgreSQL for every fare estimate request.
The PR merges Monday at 2 PM. Tuesday’s standup mentions nothing unusual. Wednesday morning, a product manager reports that fare estimates feel slower. The on-call engineer checks the dashboard. The fare-estimate endpoint’s p99 has climbed from 120ms to 680ms. A 5.7x regression.
The git bisect takes 40 minutes. The audit logging PR is the culprit. The audit write is synchronous. Inside a reactive WebFlux chain, a blocking jdbcTemplate.insert() call runs on the Netty event loop thread. Every fare estimate request blocks an event loop thread for 15-20ms while the audit record writes to disk. With 4 event loop threads and 2,000 RPS, the event loop threads spend 30-40 seconds of every second blocked on I/O. Requests queue. Latency climbs.
The fix is a 3-line change: wrap the audit call in Mono.fromCallable().subscribeOn(Schedulers.boundedElastic()). The PR took 4 minutes to write, 15 minutes to review, and 40 hours to detect in production.
If the CI pipeline had run Locust against the PR branch, the regression would have been caught in 3 minutes.
The Cause
Code review catches logic errors, security issues, and style problems. It does not catch performance regressions. A blocking call inside a reactive chain looks correct. The types align. The tests pass. The audit record is written. No reviewer is going to mentally model event loop thread utilization under 2,000 RPS concurrent load.
Performance regression testing requires running the application under load and measuring the results. This is a CI problem, not a review problem. The pipeline should:
- Build the application from the PR branch
- Deploy it in a controlled environment (Docker Compose)
- Run Locust with a defined load profile
- Compare results against a baseline threshold
- Fail the build if any threshold is breached
- Post a comparison table on the PR
The test does not need production-scale load. 50 concurrent users for 60 seconds is enough to detect a 5x regression. The goal is not to find the exact breaking point. The goal is to catch obvious regressions before they reach production.
The diagram above shows how a performance gate fits into the CI/CD pipeline as a hard decision point. After the Locust load test runs, the pipeline checks p99 latency against the baseline threshold. If the PR’s p99 is within 10% of the baseline, the deploy proceeds through staging and canary to production. If it exceeds the threshold, the deploy is blocked, the team is alerted, and a regression report is posted on the PR. This automated gate catches regressions like the audit logging incident in 3 minutes instead of 40 hours.
The Baseline
Current CI pipeline:
Step Duration Catches
Compile 45s Syntax errors
Unit tests 90s Logic errors
Integration tests 120s API contract violations
Static analysis 30s Code style, security
Container build 60s Dockerfile issues
Total ~6 min Everything except performance
Missing step: performance regression test.
Target pipeline with performance gate:
Step Duration Catches
Compile 45s Syntax errors
Unit tests 90s Logic errors
Integration tests 120s API contract violations
Static analysis 30s Code style, security
Container build 60s Dockerfile issues
Performance test 180s Latency and throughput regressions
Total ~9 min Complete coverage
3 minutes added to the pipeline. Catches regressions that take 40 hours to detect in production.
Performance thresholds for the rider API:
{
"endpoints": {
"GET /api/rides/fare-estimate": {
"p99_ms": 200,
"p95_ms": 150,
"p50_ms": 80,
"error_rate_pct": 0.1,
"max_regression_pct": 10
},
"GET /api/drivers/nearby": {
"p99_ms": 300,
"p95_ms": 200,
"p50_ms": 100,
"error_rate_pct": 0.1,
"max_regression_pct": 10
},
"POST /api/rides/request": {
"p99_ms": 500,
"p95_ms": 350,
"p50_ms": 200,
"error_rate_pct": 0.1,
"max_regression_pct": 10
}
}
}
max_regression_pct: 10 means the PR’s p99 must not exceed the baseline by more than 10%. A baseline of 120ms allows up to 132ms. The audit logging PR’s 680ms would fail by 467%.
The Fix
Docker Compose for CI
# SCALED: docker-compose.ci.yml
version: "3.9"
services:
rider-api:
build:
context: .
dockerfile: Dockerfile
environment:
SPRING_PROFILES_ACTIVE: ci
SPRING_DATASOURCE_URL: jdbc:postgresql://postgres:5432/ridehailing
SPRING_REDIS_HOST: redis
ports:
- "8080:8080"
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
deploy:
resources:
limits:
cpus: "2.0"
memory: "1G"
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: ridehailing
POSTGRES_USER: app
POSTGRES_PASSWORD: ci-test-only
healthcheck:
test: ["CMD-SHELL", "pg_isready -U app -d ridehailing"]
interval: 5s
timeout: 3s
retries: 10
deploy:
resources:
limits:
cpus: "1.0"
memory: "512M"
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 10
deploy:
resources:
limits:
cpus: "0.5"
memory: "256M"
Resource limits are fixed. Every CI run gets the same CPU and memory. Without fixed limits, a CI run on a busy runner might get less CPU, producing slower results that look like a regression. Fixed limits ensure reproducibility.
Locust configuration for CI
# SCALED: Locust test for CI performance gate
from locust import HttpUser, task, between
class CIPerformanceUser(HttpUser):
wait_time = between(0.1, 0.5)
@task(5)
def fare_estimate(self):
self.client.get("/api/rides/fare-estimate",
params={
"pickup_lat": 40.7128, "pickup_lng": -74.0060,
"dropoff_lat": 40.7589, "dropoff_lng": -73.9851
},
name="/api/rides/fare-estimate"
)
@task(3)
def nearby_drivers(self):
self.client.get("/api/drivers/nearby",
params={"lat": 40.7128, "lng": -74.0060, "radius_km": 2},
name="/api/drivers/nearby"
)
@task(1)
def request_ride(self):
self.client.post("/api/rides/request",
json={
"rider_id": "ci-test-rider",
"pickup_lat": 40.7128, "pickup_lng": -74.0060,
"dropoff_lat": 40.7589, "dropoff_lng": -73.9851,
"ride_type": "standard"
},
name="/api/rides/request"
)
50 users, 60 seconds, headless:
locust -f locust_ci.py \
--host=http://localhost:8080 \
--users 50 \
--spawn-rate 10 \
--run-time 60s \
--headless \
--csv=results/perf \
--only-summary
50 users instead of the 10,000 in staging. The goal is not to find the capacity limit. The goal is to detect latency changes. A blocking call that adds 15ms to every request is visible at 50 users. It manifests as a p99 increase from 120ms to ~400ms because the event loop thread contention scales with concurrency, not just with total load.
GitHub Actions workflow
# SCALED: .github/workflows/performance-gate.yml
name: Performance Regression Gate
on:
pull_request:
paths:
- "src/**"
- "build.gradle"
- "Dockerfile"
jobs:
performance-test:
runs-on: ubuntu-latest
timeout-minutes: 15
steps:
- uses: actions/checkout@v4
- name: Start services
run: docker compose -f docker-compose.ci.yml up -d --build
- name: Wait for health
run: |
echo "Waiting for rider-api to be healthy..."
for i in $(seq 1 60); do
if curl -sf http://localhost:8080/health/ready > /dev/null 2>&1; then
echo "Service is ready"
break
fi
if [ $i -eq 60 ]; then
echo "Service failed to start"
docker compose -f docker-compose.ci.yml logs rider-api
exit 1
fi
sleep 2
done
- name: Run Locust
run: |
pip install locust
mkdir -p results
locust -f locust_ci.py \
--host=http://localhost:8080 \
--users 50 \
--spawn-rate 10 \
--run-time 60s \
--headless \
--csv=results/perf \
--only-summary
- name: Compare against thresholds
id: compare
run: |
python scripts/compare_perf.py \
--results results/perf_stats.csv \
--thresholds perf_thresholds.json \
--output results/comparison.md
- name: Post PR comment
if: always()
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const comparison = fs.readFileSync('results/comparison.md', 'utf8');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: comparison
});
- name: Fail if regression detected
run: |
if grep -q "FAIL" results/comparison.md; then
echo "Performance regression detected"
exit 1
fi
- name: Cleanup
if: always()
run: docker compose -f docker-compose.ci.yml down -v
Comparison script
# SCALED: scripts/compare_perf.py
import csv
import json
import sys
import argparse
def parse_locust_stats(csv_path):
results = {}
with open(csv_path) as f:
reader = csv.DictReader(f)
for row in reader:
name = row.get("Name", "")
if name == "Aggregated" or not name:
continue
results[name] = {
"p50": float(row.get("50%", 0)),
"p95": float(row.get("95%", 0)),
"p99": float(row.get("99%", 0)),
"avg": float(row.get("Average (ms)", 0)),
"error_rate": (
float(row.get("Failure Count", 0))
/ max(float(row.get("Request Count", 1)), 1)
* 100
),
"rps": float(row.get("Requests/s", 0))
}
return results
def compare(results, thresholds):
output_lines = []
overall_pass = True
output_lines.append("## Performance Regression Report\n")
output_lines.append(
"| Endpoint | Metric | Threshold | Actual | Status |"
)
output_lines.append(
"|----------|--------|-----------|--------|--------|"
)
for endpoint, limits in thresholds.get("endpoints", {}).items():
matching_key = None
for key in results:
if endpoint in key:
matching_key = key
break
if not matching_key:
output_lines.append(
f"| {endpoint} | - | - | NOT FOUND | SKIP |"
)
continue
actual = results[matching_key]
checks = [
("p99", limits.get("p99_ms"), actual["p99"]),
("p95", limits.get("p95_ms"), actual["p95"]),
("p50", limits.get("p50_ms"), actual["p50"]),
("error_rate", limits.get("error_rate_pct"), actual["error_rate"]),
]
for metric, threshold_val, actual_val in checks:
if threshold_val is None:
continue
passed = actual_val <= threshold_val
status = "PASS" if passed else "FAIL"
if not passed:
overall_pass = False
output_lines.append(
f"| {endpoint} | {metric} | "
f"{threshold_val} | {actual_val:.1f} | {status} |"
)
verdict = "PASS" if overall_pass else "FAIL"
output_lines.append(f"\n**Overall: {verdict}**\n")
return "\n".join(output_lines)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--results", required=True)
parser.add_argument("--thresholds", required=True)
parser.add_argument("--output", required=True)
args = parser.parse_args()
results = parse_locust_stats(args.results)
with open(args.thresholds) as f:
thresholds = json.load(f)
report = compare(results, thresholds)
with open(args.output, "w") as f:
f.write(report)
print(report)
if "FAIL" in report:
sys.exit(1)
The PR that introduced the regression
The audit logging change that would have been caught:
// BOTTLENECK: Blocking call in reactive chain
@Service
public class FareEstimationService {
private final JdbcTemplate jdbcTemplate;
public Mono<FareEstimate> estimate(FareRequest request) {
return calculateFare(request)
.map(fare -> {
// This blocks the Netty event loop thread
jdbcTemplate.update(
"INSERT INTO audit_log (endpoint, request_id, timestamp) VALUES (?, ?, ?)",
"/api/rides/fare-estimate",
request.requestId(),
Instant.now()
);
return fare;
});
}
}
The CI performance test result for this PR:
| Endpoint | Metric | Threshold | Actual | Status |
|---------------------------|--------|-----------|---------|--------|
| /api/rides/fare-estimate | p99 | 200 | 682.0 | FAIL |
| /api/rides/fare-estimate | p95 | 150 | 510.0 | FAIL |
| /api/rides/fare-estimate | p50 | 80 | 245.0 | FAIL |
**Overall: FAIL**
The PR would be blocked. The fix:
// SCALED: Non-blocking audit in reactive chain
@Service
public class FareEstimationService {
private final JdbcTemplate jdbcTemplate;
public Mono<FareEstimate> estimate(FareRequest request) {
return calculateFare(request)
.flatMap(fare ->
Mono.fromCallable(() -> {
jdbcTemplate.update(
"INSERT INTO audit_log (endpoint, request_id, timestamp) VALUES (?, ?, ?)",
"/api/rides/fare-estimate",
request.requestId(),
Instant.now()
);
return fare;
}).subscribeOn(Schedulers.boundedElastic())
);
}
}
The updated PR’s CI result:
| Endpoint | Metric | Threshold | Actual | Status |
|---------------------------|--------|-----------|--------|--------|
| /api/rides/fare-estimate | p99 | 200 | 128.0 | PASS |
| /api/rides/fare-estimate | p95 | 150 | 95.0 | PASS |
| /api/rides/fare-estimate | p50 | 80 | 52.0 | PASS |
**Overall: PASS**
The Proof
After adding the performance gate to CI:
Metric Before CI Gate After CI Gate Delta
Performance regressions in prod 2.3/month 0.1/month -96%
Mean time to detect regression 40 hours 3 minutes -99.9%
CI pipeline duration 6 min 9 min +50%
PRs blocked by perf gate (6 months) N/A 14 N/A
False positives (6 months) N/A 2 N/A
14 PRs blocked in 6 months. 12 were real regressions (blocking calls, missing indexes, excessive serialization). 2 were false positives caused by CI runner resource contention. The false positive rate of 14% is acceptable because the developer can re-run the pipeline to confirm. A real regression fails consistently; a runner contention issue is intermittent.
The 3-minute addition to pipeline time is invisible to developers. The 40-hour mean-time-to-detection was not.
CH15-S1 covers the Docker Compose setup, comparison script, and GitHub Actions workflow in detail. CH15-S2 covers trend tracking with SQLite, gradual drift detection, and the GitLab CI equivalent.