Observability on the Cheap

A production application without observability is a box with no windows. Requests go in, responses come out, and when something breaks, the only signal is a customer complaint. Observability on a budget means three things: knowing when errors happen (Sentry), knowing how the system is performing (Grafana Cloud), and knowing what the application did (structured logs).

The free tiers of Sentry and Grafana Cloud provide more than enough visibility for Marketflow’s first hundred customers. Sentry catches exceptions, groups them, and provides stack traces with local variables. Grafana Cloud collects system and application metrics, displays them on dashboards, and sends alerts.

The Feature

When a vendor’s application submission fails, the developer receives a Sentry alert within 60 seconds. The alert includes the full stack trace, the request payload, the user’s session data, and the database query that failed. The Grafana dashboard shows request rate, error rate, response time percentiles, and system resource usage. If the error rate exceeds 5% or response times exceed one second, the developer receives a notification.

The Decision

Sentry for errors, not logs. Sentry excels at error tracking: deduplication, grouping, stack traces with context, release tracking. Using Sentry for general logging wastes the event quota and makes real errors harder to find. Errors go to Sentry. Application logs stay in Docker container logs, queryable with docker compose logs.

Grafana Cloud for metrics, not self-hosted Grafana. Self-hosted Grafana requires a Prometheus instance, persistent storage for metrics data, and maintenance. Grafana Cloud’s free tier includes 10,000 metrics, 50 GB of logs, and 50 GB of traces. At Marketflow’s scale, this is more than sufficient.

The Implementation

Sentry Setup

# Install the Sentry SDK
cd server && uv add sentry-sdk[fastapi]

# backend/app/main.py
import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
from sentry_sdk.integrations.sqlalchemy import SqlalchemyIntegration

from app.config import settings

if settings.sentry_dsn:
    sentry_sdk.init(
        dsn=settings.sentry_dsn,
        integrations=[
            FastApiIntegration(transaction_style="endpoint"),
            SqlalchemyIntegration(),
        ],
        traces_sample_rate=0.1,  # Sample 10% of transactions for performance
        profiles_sample_rate=0.1,
        environment=settings.environment,
        release=settings.app_version,
        # Don't send PII (emails, IPs) to Sentry
        send_default_pii=False,
    )

Sentry Context Enrichment

# backend/app/middleware/sentry_context.py
import sentry_sdk
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request


class SentryContextMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        # Add request context without PII
        sentry_sdk.set_context("request_info", {
            "path": request.url.path,
            "method": request.method,
            "query_params": dict(request.query_params),
        })

        # Add user context if authenticated (ID only, no email)
        if hasattr(request.state, "user") and request.state.user:
            sentry_sdk.set_user({"id": str(request.state.user.id)})

        response = await call_next(request)
        return response

Structured Logging

# backend/app/logging_config.py
import logging
import json
from datetime import datetime


class JSONFormatter(logging.Formatter):
    def format(self, record: logging.LogRecord) -> str:
        log_entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": record.levelname,
            "logger": record.name,
            "message": record.getMessage(),
        }

        if record.exc_info and record.exc_info[0]:
            log_entry["exception"] = self.formatException(record.exc_info)

        # Add extra fields if present
        for key in ("request_id", "user_id", "market_id", "duration_ms"):
            if hasattr(record, key):
                log_entry[key] = getattr(record, key)

        return json.dumps(log_entry)


def configure_logging() -> None:
    handler = logging.StreamHandler()
    handler.setFormatter(JSONFormatter())

    root_logger = logging.getLogger()
    root_logger.handlers = [handler]
    root_logger.setLevel(logging.INFO)

    # Reduce noise from libraries
    logging.getLogger("uvicorn.access").setLevel(logging.WARNING)
    logging.getLogger("sqlalchemy.engine").setLevel(logging.WARNING)

JSON-formatted logs are searchable with docker compose logs | jq. Plain text logs require grep and regex. The structured format makes it possible to filter by request ID, user ID, or any other field.

# Search for errors in the last hour
docker compose logs backend --since 1h | jq 'select(.level == "ERROR")'

# Find all requests for a specific user
docker compose logs backend --since 1h | jq 'select(.user_id == "550e8400-...")'

# Find slow requests
docker compose logs backend --since 1h | jq 'select(.duration_ms > 200)'

Grafana Cloud Metrics

# backend/app/metrics.py
from prometheus_client import Counter, Histogram, Gauge

# Request metrics
REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"],
)

REQUEST_DURATION = Histogram(
    "http_request_duration_seconds",
    "HTTP request duration in seconds",
    ["method", "endpoint"],
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5],
)

# Business metrics
ACTIVE_VENDORS = Gauge(
    "marketflow_active_vendors",
    "Number of active vendors",
)

APPLICATIONS_SUBMITTED = Counter(
    "marketflow_applications_total",
    "Total vendor applications submitted",
    ["market_id"],
)

PAYMENTS_PROCESSED = Counter(
    "marketflow_payments_total",
    "Total payments processed",
    ["status"],
)

Metrics Middleware

# backend/app/middleware/metrics.py
import time
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request

from app.metrics import REQUEST_COUNT, REQUEST_DURATION


class MetricsMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        start = time.perf_counter()
        response = await call_next(request)
        duration = time.perf_counter() - start

        # Normalize path to avoid cardinality explosion
        path = request.url.path
        # Replace UUIDs with placeholder
        # /markets/550e8400-... becomes /markets/{id}
        import re
        path = re.sub(
            r"[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}",
            "{id}",
            path,
        )

        REQUEST_COUNT.labels(
            method=request.method,
            endpoint=path,
            status=response.status_code,
        ).inc()

        REQUEST_DURATION.labels(
            method=request.method,
            endpoint=path,
        ).observe(duration)

        return response

Prometheus Metrics Endpoint

# backend/app/routers/metrics.py
from fastapi import APIRouter
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
from starlette.responses import Response

router = APIRouter()


@router.get("/metrics")
async def metrics():
    """Prometheus metrics endpoint.
    Scraped by Grafana Cloud Agent every 60 seconds."""
    return Response(
        content=generate_latest(),
        media_type=CONTENT_TYPE_LATEST,
    )

Grafana Cloud Agent Configuration

# /etc/grafana-agent.yaml (on the Hetzner VPS)
server:
  log_level: warn

metrics:
  configs:
    - name: marketflow
      scrape_configs:
        - job_name: marketflow-api
          scrape_interval: 60s
          static_configs:
            - targets: ["localhost:8000"]
      remote_write:
        - url: https://prometheus-prod-xx.grafana.net/api/prom/push
          basic_auth:
            username: "<GRAFANA_CLOUD_USER_ID>"
            password: "<GRAFANA_CLOUD_API_KEY>"

Health Check

# backend/app/routers/health.py
from fastapi import APIRouter, Depends
from sqlalchemy import text
from sqlalchemy.ext.asyncio import AsyncSession

from app.database import get_db
from app.services.cache import cache_get

router = APIRouter()


@router.get("/health")
async def health_check(db: AsyncSession = Depends(get_db)):
    checks = {}

    # Database connectivity
    try:
        await db.execute(text("SELECT 1"))
        checks["database"] = "healthy"
    except Exception as e:
        checks["database"] = f"unhealthy: {e}"

    # Redis connectivity
    try:
        await cache_get("health_check_ping")
        checks["redis"] = "healthy"
    except Exception as e:
        checks["redis"] = f"unhealthy: {e}"

    status = "healthy" if all(
        v == "healthy" for v in checks.values()
    ) else "degraded"

    return {"status": status, "checks": checks}

The Trap

# TRAP: High-cardinality labels in Prometheus metrics
REQUEST_COUNT.labels(
    method=request.method,
    endpoint=request.url.path,  # Includes UUIDs: /markets/550e8400-...
    status=response.status_code,
    user_id=str(user.id),  # Unique per user
).inc()
# 1000 unique paths x 100 users = 100,000 time series
# Grafana Cloud free tier allows 10,000 metrics

# SAFE: Normalize paths, avoid per-user labels
REQUEST_COUNT.labels(
    method=request.method,
    endpoint="/markets/{id}",  # Normalized
    status=response.status_code,
).inc()
# ~20 endpoints x 5 methods x 5 status codes = 500 time series

High-cardinality labels generate thousands of unique time series. Each unique combination of label values creates a separate time series in Prometheus. The Grafana Cloud free tier limits you to 10,000 active series. Normalizing paths and avoiding per-user labels keeps the cardinality under control.

The Cost

Component	Free Tier
Sentry	5,000 errors/month, 10,000 transactions
Grafana Cloud	10,000 metrics, 50 GB logs, 50 GB traces
Prometheus client library	$0
Grafana Agent	$0

At 50 customers, Marketflow generates approximately 100-500 errors per month (mostly transient network issues) and 50,000-200,000 requests. Sentry’s 5,000 error quota is ample. Grafana Cloud’s 10,000 metric series accommodate all of Marketflow’s application and system metrics.