System Design From Scratch: The Components That Actually Run Production Systems

Modern production systems like Amazon.com must process requests through a complex stack including DNS, CDNs, and load balancers to achieve sub-second latency. Behind a single page load, a request may hit a Redis cache or a read-replica database to ensure high availability for millions of users.

Why This Matters

Theoretical whiteboard boxes often ignore the physical realities of horizontal scaling and network latency that production systems face. Implementing horizontal scaling allows for zero-downtime deployments and redundancy, whereas vertical scaling hits a hard physical ceiling and requires service restarts during upgrades. In high-stakes environments like Black Friday, these architectural choices determine whether a platform survives millions of simultaneous requests or collapses under load.

Key Insights

Horizontal scaling (scaling out) provides zero-downtime deployments and linear capacity growth by adding identical servers in parallel.
Managed load balancers like AWS Elastic Load Balancer (ELB) handle SSL termination and connection draining to maintain high availability.
API Gateways serve as reverse proxies, routing requests to specific microservices like /auth or /payments while keeping internal IPs private.
Fan-out architecture using AWS SNS and SQS decouples services, allowing one event to trigger multiple independent actions without cascading failures.
Redis caching can reduce database load by serving up to 49,999 out of 50,000 requests from memory for high-traffic product pages.

Practical Applications

Use Case: Global e-commerce sites use CDNs like Amazon CloudFront to serve static assets from edge nodes in cities like Mumbai or Tokyo to reduce latency. Pitfall: Using vertical scaling for high-traffic events leads to downtime during hardware upgrades.
Use Case: Financial systems route critical ‘read-your-own-write’ operations to primary database nodes to avoid replication lag issues. Pitfall: Ignoring replication lag in read replicas can cause users to see stale data immediately after an update.
Use Case: High-volume notification systems use asynchronous queues like AWS SQS to process millions of emails without blocking the main application server. Pitfall: Synchronous processing of heavy tasks causes cascading failures and server timeouts.

References:

https://dev.to/sabitak/system-design-from-scratch-the-components-that-actually-run-production-systems-422l

On This Page

System Design From Scratch: The Components That Actually Run Production Systems