Grafana Dashboards and Alerting
Grafana Dashboards and Alerting
The Feature
A single Grafana dashboard shows everything the developer needs to assess the health of Marketflow:
- Request rate: How many requests per minute the API is handling
- Error rate: What percentage of requests return 5xx status codes
- Response time: P50, P95, and P99 latency
- System resources: CPU, memory, and disk usage on the VPS
- Business metrics: Active vendors, applications submitted, payments processed
Alerts fire when error rate exceeds 5%, response time P95 exceeds one second, or disk usage exceeds 80%.
The Decision
One dashboard. Not five. Not one per service. One dashboard with five rows of panels that answers the question “is Marketflow healthy right now?” in under 10 seconds. When something is wrong, the dashboard narrows the scope: is it the API (high error rate), the database (slow response times), or the infrastructure (high CPU or disk)?
The Implementation
Dashboard Layout (PromQL Queries)
Row 1: Traffic Overview
# Request rate (requests per minute)
sum(rate(http_requests_total[5m])) * 60
# Error rate (percentage of 5xx responses)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
# Success rate (for the stat panel, shows green number)
100 - (
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
)
Row 2: Latency
# P50 response time
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
# P95 response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# P99 response time
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
Row 3: Endpoint Breakdown
# Slowest endpoints (P95 by endpoint)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m]))
by (le, endpoint)
)
# Most errored endpoints
sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint)
Row 4: System Resources (from node_exporter)
# CPU usage percentage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Disk usage percentage
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100
Row 5: Business Metrics
# Active vendors (gauge)
marketflow_active_vendors
# Applications submitted today
increase(marketflow_applications_total[24h])
# Payments processed today
increase(marketflow_payments_total{status="succeeded"}[24h])
System Metrics Collection
# /etc/grafana-agent.yaml (expanded)
server:
log_level: warn
metrics:
configs:
- name: marketflow
scrape_configs:
- job_name: marketflow-api
scrape_interval: 60s
static_configs:
- targets: ["localhost:8000"]
- job_name: node
scrape_interval: 60s
static_configs:
- targets: ["localhost:9100"]
remote_write:
- url: https://prometheus-prod-xx.grafana.net/api/prom/push
basic_auth:
username: "<GRAFANA_CLOUD_USER_ID>"
password: "<GRAFANA_CLOUD_API_KEY>"
Install node_exporter on the VPS for system metrics:
# On the Hetzner VPS
sudo apt-get install prometheus-node-exporter
sudo systemctl enable prometheus-node-exporter
sudo systemctl start prometheus-node-exporter
Alert Rules
Configure in Grafana Cloud (Alerting > Alert rules):
# Alert: High Error Rate
- alert: HighErrorRate
expr: >
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Error rate above 5% for 2 minutes"
# Alert: Slow Response Times
- alert: SlowResponses
expr: >
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "P95 response time above 1 second for 5 minutes"
# Alert: High Disk Usage
- alert: HighDiskUsage
expr: >
(1 - node_filesystem_avail_bytes{mountpoint="/"}
/ node_filesystem_size_bytes{mountpoint="/"}) > 0.80
for: 10m
labels:
severity: warning
annotations:
summary: "Disk usage above 80%"
# Alert: High Memory Usage
- alert: HighMemoryUsage
expr: >
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.90
for: 5m
labels:
severity: critical
annotations:
summary: "Memory usage above 90%"
Alert Notification Channel
Configure a notification channel in Grafana Cloud. Options include email (free), Slack webhook, or a webhook to a custom endpoint. For a solo developer, email notifications are sufficient.
# Grafana Cloud > Alerting > Contact points
- name: developer-email
type: email
settings:
addresses: "[email protected]"
singleEmail: true
The Trap
# TRAP: Alerting on every metric anomaly
- alert: HighCPU
expr: node_cpu_usage > 0.50 # 50% CPU
for: 1m
# Fires during every deployment, every database migration, every
# image optimization. Alert fatigue sets in within a week.
# The developer starts ignoring all alerts.
# SAFE: Alert only on conditions that require action
- alert: HighCPU
expr: node_cpu_usage > 0.90 # 90% CPU
for: 10m # Sustained for 10 minutes
# This means something is genuinely wrong, not a temporary spike
Alert fatigue kills observability. Every false positive trains the developer to ignore alerts. Set thresholds high enough that firing always means a real problem. A 50% CPU alert fires during normal operations. A 90% sustained CPU alert fires when the server is overwhelmed. Only the second one requires action.
The Cost
| Component | Free Tier |
|---|---|
| Grafana Cloud | 10,000 active series, 14 day retention |
| Grafana Agent | $0 (open source) |
| node_exporter | $0 (open source) |
| prometheus_client | $0 (Python library) |
| Email alerts | Included in Grafana Cloud free tier |
The entire observability stack costs $0. Grafana Cloud’s free tier provides 10,000 active metric series. Marketflow generates approximately 500 series (20 endpoints x 5 HTTP methods x 5 status codes for counters, plus system metrics from node_exporter). The 14-day retention is sufficient for debugging recent issues.