Epilogue: The Machine Is Still There
SummaryReturns to Marcus from the Preface, now six...
Returns to Marcus from the Preface, now six...
Returns to Marcus from the Preface, now six months into deliberate systems study. He faces a near-identical production incident — DNS change causing connection failures — and this time diagnoses it in minutes instead of helplessly restarting pods. The epilogue reflects on the enduring presence of the physical machine beneath all our abstractions and closes with a quiet, direct statement of the choice every engineer faces: will you understand the system, or will you ride it and hope?
Epilogue: The Machine Is Still There
It’s 3:12 AM on a Thursday, seven months after the worst night of Marcus’s career.
The alert is familiar. Checkout failures climbing. Payment service returning 504s on a subset of requests. The Grafana dashboard turns orange, then red. Slack lights up. The incident commander pings the on-call channel. Marcus is on call.
He opens a terminal on one of the payment service pods. Not the application logs — not yet. The terminal.
ss -tnp | grep :443 | awk '{print $1}' | sort | uniq -c | sort -rn
Forty-one connections in ESTABLISHED. Nine in CLOSE-WAIT. Two in SYN-SENT.
He’s seen this before. The CLOSE-WAIT connections mean the remote side closed, but the local application hasn’t cleaned up. The SYN-SENT connections are the tell — outbound connections that started the three-way handshake but never completed it. Something is wrong with the destination.
He checks which IPs the SYN-SENT connections are targeting:
ss -tnp state syn-sent
Two connections to 10.43.8.201:443. He cross-references with DNS:
dig +short payments.provider.com
Four IPs come back. Three of them match the ESTABLISHED connections. 10.43.8.201 is there too — it’s in DNS, but it’s not accepting connections. He confirms in five seconds:
curl -o /dev/null -w "%{time_connect}\n" --connect-timeout 3 https://10.43.8.201:443/
Timeout. The IP is in DNS but the host isn’t responding.
Marcus checks the HTTP client configuration:
dns_cache_ttl: 60
connection_max_lifetime: 300
Those two lines are there because he put them there, seven months ago, the morning after Dara diagnosed the original incident. The connection pool is cycling. DNS is refreshing. But the client is still sending some requests to the dead IP because the DNS record includes it — the provider hasn’t removed it yet.
He posts in the incident channel:
Root cause identified. Payment provider has a dead IP (
10.43.8.201) in their DNS rotation. Connections to that IP time out at the TCP level. Workaround: add the IP to our connection-level blocklist. Permanent fix: contact provider to remove the dead IP from DNS.
He applies the workaround. The 504s stop. Total time from alert to resolution: eleven minutes.
Seven months ago, this same class of issue took forty-seven minutes and an SRE named Dara to resolve. Marcus didn’t know what ss was. He didn’t know TCP had states. He didn’t know his HTTP client cached DNS resolutions. He restarted pods. He doubled timeouts. He made it worse.
The difference isn’t that Marcus became a different person. He’s the same engineer — same company, same codebase, same on-call rotation. He didn’t go back to school. He didn’t take a sabbatical. He spent six months asking “but how?” for fifteen minutes a day. He started a TIL file that now has 127 entries. He read one book on TCP/IP — not even the whole thing, just the chapters on connection management and DNS. He learned to use ss, dig, strace, and lsof. He built a mental model of the layer beneath his application, one piece at a time.
It wasn’t dramatic. There was no single breakthrough moment. There was a slow, steady shift from helplessness to competence. The kind of shift that’s invisible week to week but undeniable when you compare month one to month six.
Every era of computing has had two kinds of engineers.
The first kind understands the machine. Not all of it — nobody understands all of it anymore, and that’s fine. But they understand the layers they touch and the layers those layers rest on. When something breaks, they reason about it. They form hypotheses. They test them. They find root causes.
The second kind rides the machine. They know the interfaces. They’re productive, even impressive. They ship features, build products, advance their careers. But when the machine misbehaves — when the abstraction leaks, when the layer below does something unexpected — they are passengers in a car they can’t open the hood of.
The ratio between these two kinds has shifted over the decades. In 1985, most working programmers understood at least something about the hardware they ran on — because the abstractions were thin enough that they had to. In 2010, cloud computing, containerization, and framework ecosystems made it possible to build production systems without understanding any of the layers beneath the API you called. By 2026, AI-assisted development has added another layer of indirection: you don’t even write all the code yourself anymore, and the code you don’t write rests on layers you’ve never inspected.
The machine hasn’t changed. Electrons still flow through transistors. CPUs still execute instructions in sequence — cleverly reordered, pipelined, cached, but sequential at their core. Memory still has latency. Networks still drop packets. Disks still seek. The physics hasn’t been abstracted away. It’s been hidden behind enough layers that you can forget it’s there.
Until it reminds you. At 3 AM. When revenue is burning.
The argument of this book is not that every engineer must understand every layer. That’s impossible and unnecessary. The argument is simpler: understand the layer you work on, and the layer beneath it. Have a working theory of the layer below that. Know when you’ve reached the boundary of your knowledge, and know how to ask the right questions when you need to cross it.
This is not an extraordinary ask. It’s what engineering has always required. A structural engineer understands steel properties, not just beam calculations. An electrical engineer understands semiconductor physics, not just circuit diagrams. A mechanical engineer understands material science, not just CAD models. The expectation that a software engineer should understand their computational substrate is not gatekeeping. It’s basic professional competence.
You cannot abstract away physics. You cannot abstract away the network. You cannot abstract away the CPU. You cannot abstract away the operating system, the runtime, the memory model, or the query planner. You can pretend they’re not there. You can work for years without ever looking. But they are there, running beneath every line of code you write, every request you serve, every byte you store. They are the machine.
The machine is still there.
The question is whether you’ll meet it.