What Good Looks Like

We’ve spent sixteen chapters diagnosing the problem and two chapters prescribing solutions. Now let’s look at the people who are already doing this — engineers who operate with the kind of calibrated understanding this book has been advocating.

These are not prodigies. They are not the mythical “10x developers” that conference speakers invoke to sell productivity tools. They are people who, at some point in their careers, looked at a system they were responsible for, realized they didn’t understand the layer beneath them, and decided to fix that. Their stories follow similar arcs. Their daily habits overlap. And every single one of them will tell you: this is not about being smarter. It’s about being curious in a specific, disciplined way.

Three profiles. Three different specializations. One pattern.

Profile 1: Nneka — The Backend Architect Who Went to the Metal

Nneka spent her first five years as a software engineer inside the Ruby on Rails ecosystem. She was fast. Feature velocity was her calling card. She shipped full CRUD applications with test coverage, background jobs, and caching in weeks. Her Rails code was clean, idiomatic, and well-tested. Her colleagues considered her one of the strongest mid-level engineers on the team.

Then the company hit scale. Not “Twitter scale” — the kind of modest, painful scale where a system that handled 500 requests per second now needed to handle 5,000, and everything started behaving strangely. Response times tripled overnight. Active Record queries that returned in 8ms started taking 3 seconds. Background jobs backed up to a queue depth of 40,000. The Sidekiq dashboard looked like a medical crisis.

Nneka did what she’d always done: she read the application logs, tweaked Active Record queries, added database indexes where the query seemed slow, and increased the Sidekiq concurrency. The response times didn’t change. The queue kept growing.

The on-call SRE — a quiet engineer named James who’d been at the company for a decade — asked Nneka a question she couldn’t answer: “What’s the connection pool doing right now? How many connections are in idle in transaction state?”

Nneka didn’t know how to check. She didn’t know what idle in transaction meant. She knew Active Record had a connection pool with a configurable size, and she knew the size was set to 10 because that’s what the Heroku documentation recommended. She had never asked what happened when eleven concurrent requests needed a database connection, or what the PostgreSQL backend process did while a Rails controller was rendering a view with the transaction still open.

James showed her pg_stat_activity. Thirty-two connections, twenty-six of them in idle in transaction. The connection pool was exhausted. Every new request waited for a connection, and the connections were held by requests that had already finished their database work but hadn’t committed because the Rails controller was still rendering a JSON response inside the transaction block.

The fix took fifteen minutes: move the render outside the transaction, and set idle_in_transaction_session_timeout to 30 seconds as a safety net. The queue drained in six minutes.

Nneka describes that incident as the moment she realized she was building on foundations she’d never inspected. “I could write Rails code all day,” she says. “But I couldn’t tell you what my Rails code was doing to PostgreSQL. I couldn’t tell you what PostgreSQL was doing to the operating system. I was working on the top floor of a building I’d never visited the basement of.”

She spent the next year filling in the gaps. Not all of them — all of them would take a lifetime. She focused on the layers directly beneath her daily work:

PostgreSQL internals: She read the pg_stat_activity documentation end to end. She learned to read EXPLAIN ANALYZE output — not just the top line, but the nested loop joins, the sequential scans, the heap fetches. She practiced on the slowest queries in their monitoring tool until she could predict the query plan before running EXPLAIN.
Operating system basics: She learned what file descriptors are, why a process might run out of them, and how to check with lsof. She learned what the OOM killer does and how to read /proc/meminfo. She didn’t become a kernel developer. She became an application developer who could spot when the OS was the bottleneck.
TCP and HTTP at the socket level: She started reading ss output on production pods during incidents. She learned the TCP state machine — not every edge case, but enough to know that CLOSE-WAIT means the remote side closed and the local application hasn’t acknowledged it, which usually means a connection leak.

Eighteen months later, Nneka is the first person called when something goes wrong in production. Not because she’s the most senior — she isn’t — but because she finds root causes. Here’s a recent example of how she works:

A downstream service starts returning intermittent 500 errors. Other engineers focus on the HTTP responses — retries, error rates, circuit breakers. Nneka opens a terminal on the pod and checks TCP state first:

ss -tnp | grep :8443 | awk '{print $1}' | sort | uniq -c | sort -rn

Thirty connections in ESTABLISHED, twelve in TIME-WAIT, and one in SYN-SENT. The SYN-SENT stands out — a connection that started the three-way handshake but never completed it. She checks which IP that connection is targeting. It’s a different IP than the other established connections for the same hostname. DNS just rotated, and one of the new IPs isn’t responding.

She confirms it in thirty seconds:

dig +short downstream-service.internal
curl -o /dev/null -w "%{time_connect}" https://10.0.3.47:8443/ --connect-timeout 5

The first three IPs respond in 2ms. The fourth times out. She raises the issue with the downstream team, who discover a misconfigured security group on a newly provisioned instance. Total diagnostic time: four minutes.

That diagnostic process didn’t require genius. It required knowing that HTTP connections are TCP connections, that TCP connections have states, and that DNS can return multiple IPs. Knowledge that lives one or two layers below where Nneka writes code every day.

Profile 2: Daniel — The Frontend Engineer Who Went Deep

Daniel builds user interfaces. React, TypeScript, the standard modern stack. Three years ago, his work was indistinguishable from any competent frontend engineer: components rendered, state managed, data fetched. The applications worked. They were also, by any objective measure, slow.

Not catastrophically slow — nobody filed bug reports. But the Lighthouse scores hovered around 55. The largest contentful paint was 3.8 seconds on mobile. Interactions stuttered on mid-range Android phones. The team’s attitude was standard: “Performance is a backend problem. Our bundle sizes are reasonable. Ship it.”

Daniel disagreed, but he didn’t have the vocabulary to explain why the interface felt heavy. He could say “it re-renders too much” but couldn’t explain what a re-render actually cost the browser. That gap bothered him.

He spent six months learning what happens between ReactDOM.render() and pixels appearing on screen:

The browser rendering pipeline: Parse HTML → build DOM → parse CSS → build CSSOM → construct render tree → layout → paint → composite. He drew this pipeline on a whiteboard and left it there for months. When he wrote a component, he asked himself which stages that component would trigger when it updated.
V8’s JIT compilation: He didn’t learn to read V8 bytecode, but he learned the optimization tiers — Ignition interprets, TurboFan compiles hot functions, and deoptimization happens when assumptions about types are violated. He learned that polymorphic function calls (passing different object shapes to the same function) prevent optimization. This changed how he wrote utility functions.
The compositor thread: He learned that CSS transforms and opacity changes happen on the compositor thread without triggering layout or paint, while changes to width, height, top, left, and dozens of other properties force layout recalculation. This single piece of knowledge transformed his animation approach.

Here’s what Daniel’s mental model looks like when he writes a React component now. He’s building a product list that shows 200 items with live price updates every second:

His first question isn’t “what library should I use?” It’s “what does a re-render cost?” He knows that when React re-renders a component, it runs the component function, diffs the output against the previous virtual DOM, and patches the real DOM where differences exist. The diffing is O(n) in the number of elements. Two hundred items re-rendering every second means twelve thousand React element comparisons per second, plus whatever DOM mutations result.

So he virtualizes the list — only render the 15 items visible in the viewport. Now he re-renders 15 items per second instead of 200. But he also knows that the price update shouldn’t re-render the row component at all unless that specific row’s price changed. He memoizes the row component and structures the state so each row subscribes only to its own price. The re-render count drops from 15 per second to 2–3.

Then he checks the paint cost. He opens Chrome DevTools, enables Paint Flashing, and watches. The price text changes trigger paint on the text node only — no layout, because the text container has a fixed width. Good. But the scrollbar is repainting on every frame because his virtualized container is recalculating its scrollable height. He fixes the container’s height to a static value based on total item count times row height, eliminating the layout-triggered scrollbar repaint.

Total time spent on performance: about forty minutes. The result: a list that updates at 60fps on a mid-range phone, while the team’s previous implementation dropped to 15fps and stuttered visibly. Daniel didn’t use a different framework. He used the same tools as everyone else. He just understood what those tools were asking the browser to do.

His Lighthouse scores now average 92. Not because he’s a performance obsessive — he ships features at the same pace as his teammates — but because every component he writes is informed by a mental model of the rendering pipeline. He makes better default choices. His code doesn’t need performance optimization later because it was written with performance awareness from the start.

Profile 3: Adaeze — The DevOps Engineer Who Reads Kernel Logs

Adaeze runs Kubernetes clusters. Not “manages them through a dashboard” — runs them the way a mechanic runs an engine, with knowledge of what’s inside.

She started like most DevOps engineers: writing YAML, applying manifests, checking pod status. When pods failed to schedule, she’d check kubectl describe pod for the event message, Google the error, and apply the Stack Overflow fix. It worked. Until it didn’t.

The incident that changed her approach: a batch processing job — a pod requesting 4 CPU cores and 8GB of RAM — wouldn’t schedule on a cluster with 200 nodes. kubectl describe showed Insufficient cpu. But the cluster dashboard showed average CPU utilization at 30%. The math didn’t add up. Two hundred nodes, each with 16 cores, 30% utilized — there should be over 2,000 available cores. The pod needed 4.

Adaeze’s initial debugging followed the standard playbook: check resource requests and limits, check node capacity, check taints and tolerations, check pod affinity rules. Everything looked fine. The pod just wouldn’t schedule. Other engineers on the team suggested adding more nodes. The manager approved the capacity increase. Problem solved.

Except Adaeze couldn’t stop thinking about it. The cluster had the capacity. The scheduler said it didn’t. Someone was wrong, and she wanted to know who.

She SSH’d into a node and started reading kubelet logs — not the sanitized events that kubectl describe surfaces, but the actual log output in /var/log/kubelet.log. She found entries about resource allocation that didn’t match what the dashboard displayed. The dashboard showed utilization — how much CPU the running workloads were actually using. The scheduler works on requests — how much CPU the pod specs claim they might use. Every pod on the cluster had resource requests set, and those requests, summed across all pods on each node, left less than 4 cores available on every single node. The cluster was 30% utilized but 95% allocated.

This is a fundamental Kubernetes concept, but it’s one that the dashboard abstracted away. The dashboard showed utilization because that’s what operators want to see. The scheduler uses allocation because that’s what guarantees quality of service. The two numbers diverge, and when they diverge far enough, you get a cluster that looks empty but acts full.

Adaeze didn’t stop there. She wanted to understand why the scheduler makes this decision. She read the Kubernetes scheduler source code — specifically the NodeResourcesFit plugin — and traced the logic: the scheduler sums all resource requests for pods bound to a node, subtracts from the node’s allocatable capacity, and rejects the pod if the remainder is less than the pod’s request. It doesn’t look at actual utilization. It can’t — utilization is retrospective, and a scheduling decision is prospective.

Then she went deeper. She learned what happens after the scheduler places a pod: the kubelet creates a cgroup for the pod’s containers, and the cgroup’s cpu.cfs_quota_us and cpu.cfs_period_us enforce the CPU limit. She learned that a pod exceeding its CPU limit gets throttled by the kernel’s Completely Fair Scheduler (CFS) — the process is placed in a throttled state and doesn’t run until the next period. She learned to read throttling metrics:

cat /sys/fs/cgroup/cpu/kubepods/pod<uid>/<container-id>/cpu.stat

nr_throttled and throttled_time tell you whether a container is being CPU-starved, a situation invisible from kubectl top but immediately apparent from the cgroup interface.

Today, when a scheduling issue stumps the team, Adaeze doesn’t guess. She checks the scheduler’s view of the world (allocated resources, not utilized resources), reads kubelet logs for the actual rejection reason, and when necessary, examines the cgroup configuration on the node. She operates at three layers simultaneously: the Kubernetes API layer, the kubelet process layer, and the Linux kernel layer.

Her most recent save: a pod that was running but performing poorly. Application metrics showed request latency spikes every 100ms. Other engineers suspected garbage collection in the JVM. Adaeze checked cpu.stat in the pod’s cgroup first. nr_throttled: 47382. The container was being throttled thousands of times per minute. The CPU limit was set too low for the workload’s burst profile. She raised the limit, the throttling stopped, the latency spikes disappeared. Diagnosis time: three minutes. Previous debugging effort by the team before she was called: two days.

The Common Thread

Nneka, Daniel, and Adaeze work at different layers, in different languages, on different kinds of systems. But their stories share a structure:

A moment of failure: Each encountered a problem they couldn’t solve with their existing knowledge. Not a theoretical gap — a real failure, in production or in performance, that made their ignorance tangible.
A decision to look: Instead of adding the layer — more replicas, a different library, more nodes — they decided to understand the layer beneath. This decision felt uncomfortable. It meant admitting they didn’t know something fundamental about the systems they’d been running for years.
Focused, bounded study: None of them tried to learn everything. Nneka didn’t become a kernel developer. Daniel didn’t learn to write a browser engine. Adaeze didn’t rewrite the Kubernetes scheduler. Each identified the layer directly below their daily work and spent a bounded amount of time — six months to a year — building a working mental model of that layer.
Compound returns: The investment paid off not in one dramatic moment, but in hundreds of small ones. Faster debugging. Better default architecture decisions. The ability to predict failure modes before they happen. The ability to teach others.

Notice what’s missing from this list: exceptional intelligence. Nneka describes herself as “a pretty average programmer who got tired of not understanding.” Daniel says he was “the slowest learner on my bootcamp cohort.” Adaeze’s degree is in business administration.

The common thread is not IQ. It’s the decision to look one layer deeper, combined with the discipline to do it systematically.

This Is a Learnable Skill

The engineers profiled here did not start with deep systems knowledge. They acquired it, deliberately, over months. Their process was remarkably similar:

They identified a specific gap: not “I need to learn more” but “I don’t understand why my database connections are timing out.”
They found one resource: a book, a documentation page, a blog post by someone who’d solved a similar problem.
They spent short, regular sessions: 15–30 minutes a day, not weekend marathons that burn out by Sunday afternoon.
They applied what they learned immediately: the next time a production issue arose, they tried the new tool or checked the new metric, even if they weren’t sure what they’d find.
They wrote down what they learned: not polished blog posts, but one-line notes in a file. “CLOSE-WAIT means the remote side closed but our app hasn’t.” “CFS throttling shows up in cpu.stat, not in kubectl top.”

This is not a talent. Talent is what you fall back on when you don’t have a process. This is a process. It takes time — six months to a year before you notice a significant difference in your debugging speed and architectural judgment. It takes consistency — the fifteen-minute-a-day habit matters more than the occasional deep dive. And it takes honesty — you have to admit, regularly, that there are things beneath your layer that you don’t understand.

But it does not take anything you don’t already have. Every engineer profiled in this chapter started from the same place you’re in now: competent at their layer, blind to the one below. Every one of them will tell you the same thing: the only thing they regret is not starting sooner.

The next two sections give you the specific habits and metrics to make this work: a daily curiosity protocol, and a way to measure your progress over time.