Continuous Performance Testing: Locust in CI, Regression Detection, and the Baseline That Drifts
Continuous Performance Testing: Locust in CI, Regression Detection, and the Baseline That Drifts
The content platform passes all functional tests. CI is green. A developer merges a PR that adds a new field to the article API response. The field requires a JOIN against the tags table. Nobody notices the latency change because nobody measured it.
Two weeks later, p99 latency on the article endpoint has crept from 180ms to 340ms. The on-call engineer looks at the commit history and sees 47 merged PRs. Which one caused the regression? Maybe it was one PR. Maybe it was three PRs that each added 50ms. The engineer spends two days bisecting commits to find the problem.
This is the cost of not running performance tests in CI. Functional tests answer “does it work?” Performance tests answer “does it work fast enough?” Both questions matter. Most teams only ask the first one.
The Problem with Manual Performance Testing
Manual performance testing follows a predictable pattern:
- A performance engineer runs Locust against a staging environment before a release
- They compare the results against their memory of what the numbers were last time
- They write a report with screenshots of Grafana dashboards
- Nobody reads the report until production is slow
This process has three failure modes.
Inconsistent environments. The staging server that ran last month’s test had 16GB of RAM. The current staging server has 8GB because someone resized it for cost savings. The test results are not comparable, but nobody knows that.
Human baseline comparison. The engineer remembers that p95 was “around 200ms.” It was actually 165ms. They see 210ms and call it acceptable. A 27% regression ships to production.
Infrequent execution. Manual tests run before major releases, maybe quarterly. Three months of merged PRs means hundreds of potential causes for any observed regression.
# SLOW: Manual performance test with no baseline comparison
# Run by a human, compared against memory, results in a PDF
from locust import HttpUser, task, between
class ManualArticleTest(HttpUser):
wait_time = between(1, 3)
@task
def get_article(self):
self.client.get("/api/articles/distributed-tracing-guide")
@task
def search_articles(self):
self.client.get("/api/search?q=performance+optimization")
# Run: locust -f manual_test.py --headless -u 100 -r 10 --run-time 60s
# Results: printed to terminal, copy-pasted to Slack, lost forever
The fix is to run performance tests on every PR, compare results against a stored baseline, and fail the build when latency exceeds the budget. No human judgment required. No reports to ignore.
Designing the CI Performance Test
A CI performance test is not a production load test. Production load tests simulate thousands of users over extended periods to find capacity limits. CI performance tests simulate a small, consistent load over a short period to detect relative changes.
The design constraints for CI performance tests:
| Constraint | Production Load Test | CI Performance Test |
|---|---|---|
| Duration | 30-60 minutes | 60-120 seconds |
| Users | Hundreds to thousands | 10-50 |
| Environment | Production-like | Minimal but consistent |
| Goal | Find capacity limits | Detect regressions |
| Frequency | Monthly/quarterly | Every PR |
| Pass/fail | Human judgment | Automated threshold |
The CI test does not need to prove the system can handle 10,000 concurrent users. It needs to prove this PR did not make things slower.
# FAST: CI-oriented Locust test with structured output for automated comparison
import json
import sys
import time
from pathlib import Path
from locust import HttpUser, task, between, events
from locust.runners import MasterRunner
RESULTS = {
"endpoints": {},
"start_time": None,
"end_time": None,
"total_requests": 0,
"total_failures": 0,
}
class CIArticlePlatformUser(HttpUser):
wait_time = between(0.5, 1.5)
host = "http://localhost:8000"
@task(5)
def get_article(self):
self.client.get(
"/api/articles/distributed-tracing-guide",
name="/api/articles/[slug]",
)
@task(3)
def search_articles(self):
self.client.get(
"/api/search?q=performance",
name="/api/search",
)
@task(2)
def get_recommendations(self):
self.client.get(
"/api/articles/distributed-tracing-guide/recommendations",
name="/api/articles/[slug]/recommendations",
)
@task(1)
def get_trending(self):
self.client.get(
"/api/trending",
name="/api/trending",
)
@events.request.add_listener
def on_request(request_type, name, response_time, response_length, exception, **kwargs):
if name not in RESULTS["endpoints"]:
RESULTS["endpoints"][name] = {
"response_times": [],
"failures": 0,
}
endpoint = RESULTS["endpoints"][name]
if exception:
endpoint["failures"] += 1
RESULTS["total_failures"] += 1
else:
endpoint["response_times"].append(response_time)
RESULTS["total_requests"] += 1
@events.quitting.add_listener
def on_quitting(environment, **kwargs):
RESULTS["end_time"] = time.time()
output = compute_summary(RESULTS)
output_path = Path("perf-results.json")
output_path.write_text(json.dumps(output, indent=2))
print(f"\nResults written to {output_path}")
def compute_summary(results):
summary = {
"duration_seconds": results["end_time"] - results["start_time"],
"total_requests": results["total_requests"],
"total_failures": results["total_failures"],
"error_rate": results["total_failures"] / max(results["total_requests"], 1),
"endpoints": {},
}
for name, data in results["endpoints"].items():
times = sorted(data["response_times"])
if not times:
continue
summary["endpoints"][name] = {
"count": len(times),
"failures": data["failures"],
"p50": times[len(times) // 2],
"p95": times[int(len(times) * 0.95)],
"p99": times[int(len(times) * 0.99)],
"mean": sum(times) / len(times),
"min": times[0],
"max": times[-1],
}
return summary
@events.init.add_listener
def on_init(environment, **kwargs):
RESULTS["start_time"] = time.time()
This test writes structured JSON output. Every field is machine-readable. No parsing terminal output with regex. No scraping HTML reports.
The Baseline File
The baseline is a JSON file checked into the repository. It contains the expected performance numbers for each endpoint.
{
"version": 2,
"environment": {
"cpu_cores": 2,
"memory_mb": 4096,
"container_image": "content-platform:ci"
},
"thresholds": {
"/api/articles/[slug]": {
"p50_ms": 45,
"p95_ms": 120,
"p99_ms": 180,
"error_rate": 0.001
},
"/api/search": {
"p50_ms": 80,
"p95_ms": 200,
"p99_ms": 350,
"error_rate": 0.005
},
"/api/articles/[slug]/recommendations": {
"p50_ms": 60,
"p95_ms": 150,
"p99_ms": 250,
"error_rate": 0.002
},
"/api/trending": {
"p50_ms": 30,
"p95_ms": 80,
"p99_ms": 120,
"error_rate": 0.001
}
},
"regression_tolerance": 0.10,
"block_on_regression": true
}
The regression_tolerance field is important. It means a result can be 10% worse than the baseline without failing the build. Performance tests have inherent variance. A test that runs on a shared CI runner will see different numbers depending on what else is running on the host. Without tolerance, the build flaps.
The Comparison Script
The comparison script reads the baseline and the test results, then decides pass or fail.
# compare_perf.py: Automated regression detection
import json
import sys
from pathlib import Path
def load_json(path: str) -> dict:
return json.loads(Path(path).read_text())
def compare(baseline: dict, results: dict) -> tuple[bool, list[str]]:
tolerance = baseline.get("regression_tolerance", 0.10)
passed = True
messages = []
for endpoint, thresholds in baseline["thresholds"].items():
actual = results["endpoints"].get(endpoint)
if not actual:
messages.append(f"SKIP {endpoint}: no data in results")
continue
for metric in ["p50", "p95", "p99"]:
threshold_key = f"{metric}_ms"
if threshold_key not in thresholds:
continue
threshold_value = thresholds[threshold_key]
actual_value = actual.get(metric, 0)
max_allowed = threshold_value * (1 + tolerance)
if actual_value > max_allowed:
passed = False
pct_over = ((actual_value - threshold_value) / threshold_value) * 100
messages.append(
f"FAIL {endpoint} {metric}: "
f"{actual_value:.1f}ms > {max_allowed:.1f}ms "
f"(baseline {threshold_value}ms, +{pct_over:.1f}%)"
)
else:
messages.append(
f"PASS {endpoint} {metric}: "
f"{actual_value:.1f}ms <= {max_allowed:.1f}ms"
)
if "error_rate" in thresholds:
actual_error_rate = actual.get("failures", 0) / max(actual.get("count", 1), 1)
if actual_error_rate > thresholds["error_rate"]:
passed = False
messages.append(
f"FAIL {endpoint} error_rate: "
f"{actual_error_rate:.4f} > {thresholds['error_rate']}"
)
return passed, messages
def main():
baseline = load_json("perf-baseline.json")
results = load_json("perf-results.json")
passed, messages = compare(baseline, results)
print("=" * 60)
print("PERFORMANCE REGRESSION CHECK")
print("=" * 60)
for msg in messages:
print(f" {msg}")
print("=" * 60)
if not passed:
print("RESULT: FAILED - Performance regression detected")
if baseline.get("block_on_regression", True):
sys.exit(1)
else:
print("WARNING: block_on_regression is false, not failing build")
sys.exit(0)
else:
print("RESULT: PASSED - No performance regression detected")
sys.exit(0)
if __name__ == "__main__":
main()
The output is explicit. Every endpoint and metric gets a PASS or FAIL line. The CI log shows exactly what regressed and by how much. No ambiguity.
============================================================
PERFORMANCE REGRESSION CHECK
============================================================
PASS /api/articles/[slug] p50: 42.0ms <= 49.5ms
PASS /api/articles/[slug] p95: 115.0ms <= 132.0ms
PASS /api/articles/[slug] p99: 172.0ms <= 198.0ms
FAIL /api/search p95: 285.0ms > 220.0ms (baseline 200ms, +42.5%)
PASS /api/search p99: 340.0ms <= 385.0ms
PASS /api/trending p50: 28.0ms <= 33.0ms
============================================================
RESULT: FAILED - Performance regression detected
The GitHub Actions Workflow
The full CI pipeline starts the application in a container, runs the Locust test, compares results against the baseline, and stores results as artifacts.
name: Performance Gate
on:
pull_request:
paths:
- 'src/**'
- 'requirements.txt'
- 'Dockerfile'
jobs:
performance-test:
runs-on: ubuntu-latest
timeout-minutes: 15
services:
postgres:
image: postgres:16
env:
POSTGRES_DB: content_platform
POSTGRES_USER: app
POSTGRES_PASSWORD: ci_test_password
ports:
- 5432:5432
options: >-
--health-cmd="pg_isready"
--health-interval=5s
--health-timeout=3s
--health-retries=5
redis:
image: redis:7
ports:
- 6379:6379
options: >-
--health-cmd="redis-cli ping"
--health-interval=5s
--health-timeout=3s
--health-retries=5
steps:
- uses: actions/checkout@v4
- name: Build application
run: docker build -t content-platform:ci .
- name: Start application
run: |
docker run -d --name app \
--network host \
-e DATABASE_URL=postgresql://app:ci_test_password@localhost:5432/content_platform \
-e REDIS_URL=redis://localhost:6379 \
-e ENVIRONMENT=ci \
content-platform:ci
# Wait for application to be ready
for i in $(seq 1 30); do
if curl -sf http://localhost:8000/health > /dev/null; then
echo "Application is ready"
break
fi
echo "Waiting for application... ($i/30)"
sleep 2
done
- name: Seed test data
run: |
docker exec app python scripts/seed_ci_data.py
- name: Install Locust
run: pip install locust
- name: Run performance test
run: |
locust -f tests/perf/ci_locustfile.py \
--headless \
--users 20 \
--spawn-rate 5 \
--run-time 90s \
--host http://localhost:8000 \
--csv perf-results \
--html perf-report.html
- name: Compare against baseline
run: python tests/perf/compare_perf.py
- name: Upload results
if: always()
uses: actions/upload-artifact@v4
with:
name: perf-results-${{ github.sha }}
path: |
perf-results.json
perf-report.html
perf-results_stats.csv
retention-days: 90
- name: Comment on PR
if: failure()
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const results = JSON.parse(fs.readFileSync('perf-results.json', 'utf8'));
const baseline = JSON.parse(fs.readFileSync('perf-baseline.json', 'utf8'));
let body = '## Performance Regression Detected\n\n';
body += '| Endpoint | Metric | Baseline | Actual | Status |\n';
body += '|---|---|---|---|---|\n';
for (const [endpoint, thresholds] of Object.entries(baseline.thresholds)) {
const actual = results.endpoints[endpoint];
if (!actual) continue;
for (const metric of ['p50', 'p95', 'p99']) {
const key = `${metric}_ms`;
if (!thresholds[key]) continue;
const maxAllowed = thresholds[key] * (1 + baseline.regression_tolerance);
const status = actual[metric] > maxAllowed ? '**FAIL**' : 'PASS';
body += `| ${endpoint} | ${metric} | ${thresholds[key]}ms | ${actual[metric].toFixed(1)}ms | ${status} |\n`;
}
}
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: body
});
This workflow runs only when source code or dependencies change. Documentation changes do not trigger a 15-minute performance test. The paths filter keeps the CI bill under control.
Performance Budgets: Block vs Warn
Not every regression warrants a blocked merge. A 5% increase in p50 latency might be acceptable if the PR adds a feature that users have been requesting. A 50% increase in p99 is never acceptable.
The baseline file supports two modes:
{
"thresholds": {
"/api/articles/[slug]": {
"p50_ms": 45,
"p95_ms": 120,
"p99_ms": 180,
"p50_action": "warn",
"p95_action": "block",
"p99_action": "block"
}
}
}
With this configuration, a p50 regression posts a warning comment on the PR but does not block the merge. A p95 or p99 regression blocks the merge. The team decides which metrics are blockers and which are advisory.
This creates a tiered response:
- p50 regression: The median user experience got slower. Worth investigating but might be an acceptable trade-off for new functionality.
- p95 regression: One in twenty users is hitting a slow path. This is a real problem.
- p99 regression: The tail is growing. Under load, this will become the p95. Block the merge.
Reducing Variance in CI
CI runners are shared infrastructure. The same test on the same code will produce different numbers depending on CPU contention, disk I/O from other jobs, network latency to the database container, and memory pressure.
Techniques that reduce variance:
Pin CPU and memory for the application container. Docker resource limits create a consistent ceiling.
docker run -d --name app \
--cpus 2 \
--memory 4g \
--network host \
content-platform:ci
Warm up before measuring. The first 15 seconds of a Locust test hit cold caches, uninitialized connection pools, and JIT compilation. Exclude the warmup period from results.
# In the Locust test: skip the first 15 seconds of data
@events.request.add_listener
def on_request(request_type, name, response_time, exception, **kwargs):
elapsed = time.time() - RESULTS["start_time"]
if elapsed < 15:
return # skip warmup requests
# ... record the request
Run multiple iterations and take the median. A single 90-second test might catch an anomaly. Three 90-second runs with the median result are more stable.
Use dedicated CI runners. If performance testing is critical, use self-hosted runners with consistent hardware. No neighbor noise. No CPU stealing. The cost is worth the signal quality.
Trade-offs
| Decision | Benefit | Cost |
|---|---|---|
| Run perf tests on every PR | Catch regressions immediately | CI time increases 5-10 min per PR |
| Store baseline in repo | Version-controlled, reviewable | Manual updates when intentional changes ship |
| 10% tolerance | Absorbs CI variance | Misses regressions under 10% |
| Block on p95/p99 only | Does not block feature work | p50 regressions accumulate silently |
| Container resource limits | Consistent results | Does not reflect production capacity |
| Warmup exclusion | Removes cold-start noise | Misses cold-start regressions |
The hardest trade-off is tolerance. Set it too low and the build flaps on every PR. Set it too high and real regressions slip through. Start at 10%, track how often the gate flaps (fails then passes on retry without code changes), and adjust. If it flaps more than once a week, increase tolerance. If regressions are landing in production, decrease it.
The next two sections cover the full GitHub Actions integration in detail (Section 1) and long-term baseline drift management with Prometheus and Grafana (Section 2).