The Observability Trap

Observability has become a religion. The pitch is seductive: instrument everything, collect every metric, trace every request, log every event. With enough data, the system becomes transparent. Bugs reveal themselves. Outages explain themselves. You just need the right dashboard.

This is a comforting story. It is also dangerously incomplete.

The observability industry is built on three pillars: logs, metrics, and traces. Each reveals a specific dimension of system behavior. Each hides everything else.

Logs tell you what the application thinks happened. They’re narratives written by developers who imagined what future-you might want to know. If the developer who wrote the payment processing module didn’t log the connection pool state before making a request, you will not find that information in the logs when the connection pool is the problem. Logs are autobiographies — they contain what the author chose to share, and they omit what the author didn’t think was important.

The other problem with logs is volume. A moderately busy microservices system generates gigabytes of logs per hour. When something goes wrong, you search the logs — but what do you search for? If you know the failing request ID, you can trace it. If you don’t, you’re searching for patterns in a sea of noise, and the signal-to-noise ratio in most log systems is abysmal. Engineers add logging statements during incidents, promising to remove them later. They never do. The log stream becomes a firehose of debug output that drowns the information you need.

Metrics tell you aggregate measurements over time windows. Request rate, error rate, latency percentiles, CPU usage, memory consumption. They’re excellent at answering “is this number going up or down?” and terrible at answering “why?”

A p99 latency spike from 200ms to 3 seconds tells you something got slow. It doesn’t tell you what. Was it a specific endpoint? A specific downstream service? A garbage collection pause? A noisy neighbor on the same host? The metric says “bad.” Your systems knowledge says where to look.

Worse, metrics lie by aggregation. An average latency of 50ms can hide a bimodal distribution where 95% of requests take 10ms and 5% take 850ms. If you only track averages and p99s, you might miss a failure mode that affects a crucial customer segment — say, users with large shopping carts whose requests hit the slow path.

Traces tell you how a single request moved through the system. They show which services were called, in what order, and how long each step took. They are the most powerful of the three pillars — and the most expensive.

But traces only capture what’s instrumented. If your trace shows “database query: 800ms,” that’s useful. If the database query was fast but the connection was slow to acquire because the pool was exhausted, and your instrumentation starts the timer after pool acquisition, the trace blames the database for a connection pool problem. Your trace is only as honest as your instrumentation boundaries, and most instrumentation boundaries are wrong.

Dashboard-Driven Debugging

Here’s how incident response works in most organizations:

Alert fires.
On-call engineer opens the dashboard.
They look at the graphs they always look at: request rate, error rate, latency, CPU, memory.
If one graph looks anomalous, they investigate that dimension.
If no graph looks anomalous, they’re stuck.

Step 5 is where things break. Dashboard-driven debugging only works when the failure manifests in a metric you’re already tracking. If the problem is connection pool exhaustion and you don’t have a connection pool utilization graph, the dashboard shows nothing. CPU is fine. Memory is fine. Error rate might be slightly elevated. The engineer stares at the screen, toggles between time ranges, zooms in, zooms out, and finds nothing.

The dashboard becomes a cage. You can only see what the bars and lines show you. If the answer lives outside the metrics you chose to collect last quarter, you’re blind.

This creates a perverse feedback loop: teams instrument what broke last time. After the connection pool incident, they add connection pool metrics. After the DNS resolution incident, they add DNS latency metrics. After the disk I/O incident, they add disk metrics. Their observability improves — retrospectively. But the next incident is always in a dimension they haven’t instrumented yet, because if they’d anticipated it, they would have prevented it.

Green Dashboards, Broken System

Consider this scenario. It’s a real composite drawn from multiple production incidents.

An e-commerce platform processes orders. The order pipeline has seven stages: validate, reserve inventory, charge payment, send confirmation email, update analytics, generate invoice, notify warehouse. Each stage is a separate microservice. The pipeline is orchestrated by an event queue.

One Thursday, the analytics service deploys a new version with a subtle bug: it acknowledges messages from the queue before processing them, then crashes during processing. The message is lost. The queue is drained. The analytics service restarts (Kubernetes), picks up the next message, crashes again on some messages but handles others fine. Its error rate is 6% — below the 10% alert threshold.

From the dashboards:

Order throughput: Normal. Orders are being placed at the expected rate.
Payment success rate: 99.8%. Normal.
Email delivery rate: 99.2%. Normal.
Analytics service error rate: 6%. Below alert threshold.
Queue depth: Low. Messages are being consumed quickly.
CPU/Memory/Disk: All normal.

Every dashboard is green or yellow. No alert fires.

But 6% of orders are silently losing their analytics events. Revenue attribution is wrong. Marketing campaign ROI calculations are corrupted. The A/B testing framework makes wrong decisions because conversion data is missing for 6% of users. A week later, the marketing team notices their numbers don’t match finance’s numbers. Two weeks later, someone investigates. Three weeks later, the bug is found — by manually comparing message counts between the queue producer and the analytics database.

Three weeks of corrupted data. Because the dashboards all looked fine.

The fundamental problem: data correctness is nearly impossible to observe through metrics. You can measure throughput, latency, and error rates. You cannot easily measure “did this system produce the correct result?” without building a separate system that independently computes the expected result and compares it — which is essentially building the system twice.

Silent data corruption is the nightmare scenario for observability: the system is functioning (responding, processing, within latency bounds) but producing wrong outputs. Your dashboards can’t tell you this. Only understanding how the system is supposed to work — and building verification mechanisms based on that understanding — can catch it.

Alert Fatigue: When Everything Screams

The other end of the observability spectrum is equally dysfunctional. A team instruments aggressively, sets alert thresholds conservatively, and creates alerts for everything. Monday morning, the on-call engineer has forty-seven unacknowledged alerts:

Memory usage above 70% on three instances (they always run at 70%)
Latency spike on the search service (it always spikes at 3 AM during the batch import)
Error rate 2% on the recommendation engine (known flaky third-party API)
Disk usage above 80% on the log aggregator (the team that owns it is working on log rotation)
Certificate expiry in 29 days (auto-renewal runs at 7 days)

Forty-seven alerts, zero action required. The engineer auto-acknowledges them. They don’t read the descriptions. They’ve developed the muscle memory of “see alert, click acknowledge, move on.”

On Tuesday, alert forty-eight fires: the primary database’s replication lag exceeds 30 seconds. This is genuinely dangerous — it means failover would lose 30 seconds of writes. But it’s buried among the noise. The engineer acknowledges it with the same muscle memory. Nobody investigates. On Wednesday, the primary database fails over during a routine maintenance window, and 30 seconds of orders are lost.

Alert fatigue is the predictable consequence of treating observability as a substitute for understanding. If you understood the system, you’d know which metrics actually matter and set meaningful thresholds. You’d have five alerts, each tuned so precisely that when one fires, you drop everything. Instead, you have five hundred alerts, each set to the vendor’s recommended defaults, and you’ve trained your team to ignore all of them.

The Unknown Unknowns

The deepest failure of observability-as-substitute is philosophical: you can only instrument what you can imagine failing.

If you’ve never heard of TCP TIME_WAIT socket accumulation, you won’t create a metric for it. If you’ve never encountered a memory-mapped file that pins physical pages, you won’t monitor for it. If you’ve never seen a hash table degrade from O(1) to O(n) due to collision attacks, you won’t measure hash bucket distribution.

Your dashboards are a map of your knowledge. The bugs hide in the territory you haven’t mapped. And the territory you haven’t mapped is exactly the territory defined by the abstractions you never looked beneath.

This is where the observability industry’s messaging is most misleading. “Full-stack observability” implies you can see everything. You can’t. You can see everything you chose to measure, which is limited by everything you know could go wrong, which is limited by everything you understand about how the system works. The unmeasured dimensions are where the interesting failures live.

The Cost of the Trap

Here’s a number that might get your attention: a 2023 survey by the Cloud Native Computing Foundation found that organizations with more than 500 engineers spend an average of $65 million per year on observability tooling. For some companies, the observability bill exceeds the compute bill for the services being observed.

This is a reasonable investment if the tools are being wielded by engineers who understand their systems. An engineer who knows how Linux schedules processes, how the JVM manages memory, how PostgreSQL chooses query plans — that engineer turns observability into a superpower. They build dashboards that reveal real problems. They set alerts at thresholds that mean something. They write queries that answer specific diagnostic questions.

But the same tools in the hands of an engineer who treats the system as a black box become an expensive pacifier. The dashboards are copies of vendor templates. The alerts are copies of blog post recommendations. The queries are copies of Stack Overflow answers. The tools work — they collect data, they draw graphs, they fire alerts. The engineer stares at them during incidents and doesn’t know what they’re looking at.

Observability as Amplifier

The right mental model for observability is amplification. If you understand connection pooling, observability amplifies that understanding: you can see pool utilization across all services in real time, correlate pool exhaustion with latency spikes, set alerts at pool utilization levels that predict failures before they happen. Your knowledge multiplied by the tool produces something powerful.

But amplification works in both directions. If your understanding is zero, the amplifier produces zero. $65 million worth of zero.

The prescription isn’t less observability. It’s more understanding. Learn what your system actually does — at the TCP level, at the memory management level, at the query planner level — and then use observability to watch those mechanisms at scale. The tools become a mirror that reflects your understanding of the system. The deeper the understanding, the more useful the reflection.

Build the knowledge first. Then the tools become worth what you pay for them.