Skip to main content
the invisible-layer how abstraction is making software engineers dumber

Microservices: Conway's Revenge

11 min read Chapter 25 of 56
Summary

Analyzes microservices as an organizational pattern misapplied as...

Analyzes microservices as an organizational pattern misapplied as a technical solution, contrasting monolith and microservices debugging experiences through a concrete order failure scenario, and exposing the distributed monolith anti-pattern alongside honest decision criteria for architecture choice.

Microservices: Conway’s Revenge

What Conway Actually Said

In 1967, Melvin Conway submitted a paper to the Harvard Business Review. They rejected it. He published it elsewhere under the title “How Do Committees Invent?” and the central observation became one of the most reliably predictive laws in software engineering:

Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations.

Notice: this is not advice. Conway wasn’t saying you should align your architecture to your org chart. He was saying you will, whether you intend to or not. If three teams build an email system, you get a three-part email system. If one team builds it, you get one system.

Microservices are Conway’s Law running in reverse — deliberately. Instead of accepting that your architecture will mirror your organization, you restructure your organization into small, autonomous teams and then declare that each team owns an independent service. The architecture follows the org chart, as Conway predicted. The selling point is that each team can deploy independently, choose its own tech stack, and move at its own pace.

This is genuinely valuable when you have 200 engineers. It is genuinely destructive when you have 12.

The Monolith Debugging Advantage

Tuesday, 2:47 PM. A customer reports that their order was confirmed but never charged. Let’s debug this in two different architectures.

Monolith: Twenty Minutes to Root Cause

You open your application log. One file, one process:

14:43:12 [OrderController] POST /orders - user_id=8842, items=[SKU-112, SKU-445]
14:43:12 [InventoryService] Checking stock for SKU-112: 24 available
14:43:12 [InventoryService] Checking stock for SKU-445: 3 available
14:43:13 [PricingService] Calculated total: $147.82 (discount: SUMMER10 applied)
14:43:13 [PaymentService] Initiating charge: $147.82, card_id=card_9f2x
14:43:13 [PaymentService] ERROR: Stripe returned 402 - card_declined
14:43:13 [PaymentService] Card declined. Falling through to order completion.
14:43:13 [OrderController] Order ORD-29481 created, status=CONFIRMED
14:43:13 [NotificationService] Confirmation email sent for ORD-29481

Found it. Line 7. The payment failed, but the error handling fell through instead of aborting the order. The PaymentService.charge() method caught the exception, logged it, and returned None instead of re-raising. The caller didn’t check for None. The order proceeded as if payment succeeded.

You set a breakpoint on the payment method. You reproduce with a test card that declines. You see the bug. You fix the null check. Deploy. Done.

One log file. One process. One debugger. Twenty minutes.

Microservices: Three Hours in the Labyrinth

Same bug. Five services.

You open the distributed trace for the failed order. If you have distributed tracing. (If you don’t, you’re about to spend three days instead of three hours.)

[Trace ID: 7fa2b91c]
Order Service    → Inventory Service    ✓  (12ms)
Order Service    → Pricing Service      ✓  (8ms)
Order Service    → Payment Service      ?  (timeout after 30s)
Order Service    → Notification Service ✓  (6ms)

The Payment Service timed out. But the order still completed. Why?

You check the Order Service logs. The Payment Service call timed out, the Order Service caught the timeout exception, logged a warning, and continued to the Notification Service. The engineer who wrote that code assumed a timeout meant “try again later” and marked the order for async payment retry. The retry queue exists in the Payment Service, which never received the original request because the timeout happened at the network level — the request might have been processed, or might not have been.

Now you need to check: did the Payment Service actually receive the request? You check the Payment Service logs. Nothing for that trace ID. So the request was lost in transit, or the Payment Service crashed before logging it, or the load balancer routed it to an instance that was being drained. You check the load balancer logs. The request was forwarded to payment-service-3, which was in the middle of a rolling deployment. The container received the request, started processing, then was terminated by the orchestrator before completing.

Root cause: the rolling deployment strategy didn’t wait for in-flight requests to complete before killing the old container. The Order Service timeout was set to 30 seconds. The deployment grace period was 10 seconds. The request arrived during the deployment window, was killed after 10 seconds, and the Order Service’s 30-second timeout eventually fired 20 seconds later, triggering the fallthrough logic.

Three services’ logs. One load balancer log. One Kubernetes deployment configuration. One timeout configuration. One error handling path. Three hours. And you needed expertise in networking, container orchestration, and distributed tracing to diagnose it.

The bug was the same — inadequate error handling on payment failure. The diagnosis was 9x longer because the error handling, the payment processing, and the order logic live in different processes, on different machines, managed by different teams, deployed on different schedules.

Monolith vs microservices debugging flow comparison

Data Consistency Across Services

In a monolith, an order creation involves three tables in one database. You wrap them in a transaction. Either everything commits or everything rolls back. ACID guarantees handle the complexity.

BEGIN TRANSACTION;
  INSERT INTO orders (id, user_id, total) VALUES (...);
  UPDATE inventory SET quantity = quantity - 1 WHERE sku = 'SKU-112';
  INSERT INTO payments (order_id, amount, status) VALUES (..., 'charged');
COMMIT;

If the payment insert fails, the inventory update rolls back. Atomicity. Simple.

In microservices, each service owns its own database. There is no cross-service transaction. You cannot do a distributed BEGIN TRANSACTION across the Order Database, the Inventory Database, and the Payment Database. (Technically you can — it’s called two-phase commit — but we’ll get to why you shouldn’t.)

Two-Phase Commit: Technically Correct, Practically Brutal

Two-phase commit (2PC) works like this: a coordinator asks all participants “can you commit?” (Phase 1). If all say yes, the coordinator says “commit” (Phase 2). If any say no, the coordinator says “abort.”

The problem is what happens when the coordinator crashes between Phase 1 and Phase 2. Every participant has voted “yes” and is holding database locks, waiting for the coordinator to tell them to commit or abort. They can’t release the locks because they don’t know the outcome. If the coordinator stays down for 5 minutes, those locks are held for 5 minutes. Every other transaction touching those rows is blocked. Your database throughput drops to zero for the affected tables.

2PC also requires all participants to be available simultaneously. If the Inventory Service is down during the commit phase, the entire transaction blocks. You’ve coupled the availability of every participating service together — the exact thing microservices were supposed to prevent.

Saga Pattern: Eventually Consistent, Eventually Painful

The alternative is the saga pattern. Instead of one atomic transaction, you execute a sequence of local transactions, each in its own service. If any step fails, you execute compensating transactions to undo the previous steps.

Step 1: Order Service    → Create order (status: PENDING)
Step 2: Inventory Service → Reserve stock
Step 3: Payment Service   → Charge card
Step 4: Order Service    → Update order (status: CONFIRMED)

If Step 3 fails:
  Compensate Step 2: Inventory Service → Release reserved stock
  Compensate Step 1: Order Service → Cancel order

This works. It also introduces a new category of bugs that don’t exist in monoliths. What if the compensation for Step 2 fails? Now you have an order that’s cancelled but inventory that’s still reserved. You need compensation for your compensation. What if the failure notification is lost? The Order Service doesn’t know it needs to compensate. What if two sagas run concurrently for the same SKU and both reserve the last item in Step 2 before either reaches Step 3?

Sagas trade atomicity for availability. You get independent service deployability, but you lose the guarantee that your data is consistent at any given point in time. You’re back to eventual consistency, except now “eventually” means “after all the compensating transactions complete, assuming none of them fail, assuming the message queue delivers them all, assuming no concurrent sagas create conflicts.”

Both patterns are valid. Both are painful. The honest assessment: if your business logic requires atomic operations across multiple data stores, you’re paying a massive complexity tax in a microservices architecture. That tax might be worth it for organizational reasons. But you should know you’re paying it.

The Distributed Monolith

There’s an architecture worse than both monoliths and microservices: the distributed monolith. It has all the operational complexity of microservices — separate deployments, network calls, distributed tracing — with none of the benefits. It happens when teams decompose a monolith into services but maintain synchronous, tightly-coupled communication between them.

Symptoms of the distributed monolith:

  • Lockstep deployments. Service A can’t deploy without Service B deploying a compatible version first. You’re still coordinating releases across teams — you’ve just added network calls to the coordination.
  • Synchronous chains. Every request flows through the same five services in the same order. If any one is down, the request fails. Your availability is the product of individual availabilities: 99.9% × 99.9% × 99.9% × 99.9% × 99.9% = 99.5%. You’ve lost half an order of magnitude of availability by decomposing.
  • Shared databases. Two services read from and write to the same tables. You got separate deployments without separate data ownership. Every schema change requires coordinating two teams.
  • No independent scaling. Service A can’t handle more load without Service B also scaling, because every request to A generates a request to B.

The distributed monolith gives you the worst of both worlds: the deployment complexity and network unreliability of microservices, with the coupling and coordination requirements of a monolith. You’ve taken a system that worked as one process and added network latency, partial failure modes, and operational overhead without gaining any independence.

This isn’t a strawman. It’s the most common outcome when teams adopt microservices without restructuring their organization and data ownership simultaneously. You can split code into services, but if the teams still need to coordinate on every change, Conway’s Law will enforce coupling through the back door.

When Microservices Are Right (And When They’re Not)

Microservices are right when:

  • You have multiple teams (15+ engineers) who need to deploy independently. The organizational benefit is real and significant. If Team A’s bug blocks Team B’s feature launch, and this happens weekly, microservices solve a real problem.
  • You have genuinely different scaling requirements. Your image processing service needs GPU instances. Your API gateway needs small, numerous instances. Different services, different infrastructure.
  • You have genuinely different reliability requirements. Your payment service needs 99.99% uptime. Your recommendation service can tolerate 99.9%. Different SLAs justify different operational investments.
  • You can afford the infrastructure. Distributed tracing, centralized logging, service mesh, container orchestration, separate CI/CD pipelines per service — this infrastructure costs six figures annually and requires dedicated platform engineering.

Microservices are wrong when:

  • You have a small team (under 15 engineers). The operational overhead exceeds the organizational benefit. You don’t have the coordination problem that microservices solve, and you’re creating distributed systems problems you don’t have the staff to manage.
  • Your services share the same data model. If every request requires data from three services’ databases, you haven’t decomposed your domain — you’ve fragmented it.
  • You can’t invest in observability infrastructure. Without distributed tracing, centralized logging, and automated deployment pipelines, microservices are a debugging nightmare with no upside.
  • Your team doesn’t have distributed systems expertise. Microservices require engineers who understand network failure modes, eventual consistency, and distributed data patterns. If your team’s mental model is “it’s like a function call but over HTTP,” you’re going to have a very bad time.

The question isn’t “monolith or microservices?” The question is: “What problem am I solving, and does the solution’s cost justify the problem’s severity?” Conway’s Law is an observation, not a mandate. You can work with it or against it, but you cannot ignore it. And no architecture diagram will save you from the distributed systems problems that microservices introduce. You either understand those problems before you decompose, or you learn about them in production incidents after.

Conway’s revenge is this: you thought you were making a technical decision, but you were making an organizational one. And the organization will enforce its structure on your system regardless of what your architecture diagram says. The only question is whether you’ll be honest about the tradeoffs before deployment, or whether the 3 AM pages will teach you after.