Building Systems That Don't Fall Apart: Reliability, Scalability, and Maintainability

TL;DR

Reliable systems continue working when things break. Scalable systems handle 10x load without rewriting everything. Maintainable systems don’t make engineers want to quit. This article breaks down what these terms actually mean in practice, using real-world examples like Twitter’s architecture evolution and AWS VM failures. Key takeaway: there’s no magic scaling sauce, but there are patterns that work.

Understanding Reliability: Faults vs Failures

Reliability means “continuing to work correctly, even when things go wrong.” Simple definition, but the devil is in distinguishing faults from failures.

A fault is when one component deviates from spec (a disk dies, a process crashes). A failure is when the entire system stops providing the required service to users. The goal isn’t zero faults (impossible), it’s preventing faults from cascading into failures.

Counterintuitively, deliberately increasing fault rates can improve reliability. Netflix’s Chaos Monkey randomly kills processes in production to ensure fault-tolerance machinery actually works when needed. Many critical bugs hide in error-handling paths that never get exercised until production breaks at 3 AM.

One exception: security. If an attacker compromises your system and exfiltrates data, that’s not something you can “tolerate” after the fact. Prevention is the only cure.

The Three Classes of Faults

Hardware Faults: Random but Predictable

Hard disks have an MTTF of 10-50 years. On a cluster with 10,000 disks, expect one failure per day. Standard mitigation: RAID configs, dual power supplies, hot-swappable CPUs, diesel generators.

This worked fine when single-machine failure was rare. But modern applications run on hundreds or thousands of machines, and cloud platforms like AWS routinely kill VM instances without warning (they prioritize elasticity over single-machine reliability).

The shift: Software fault-tolerance is replacing hardware redundancy. Systems now tolerate entire machine losses, enabling rolling upgrades without downtime. You can patch one node at a time instead of scheduling maintenance windows.

Software Errors: Systematic and Correlated

Hardware faults are independent. Software bugs are not. When the 2012 leap second bug hit the Linux kernel, applications hung simultaneously across entire fleets. Other examples:

Runaway processes consuming all CPU/memory/disk
Downstream dependencies becoming unresponsive or returning corrupted data
Cascading failures where one component’s fault triggers another’s

These bugs lie dormant until triggered by unusual circumstances. Then you discover your code was making assumptions about its environment that stopped being true.

No quick fix exists. Mitigation layers include:

Careful reasoning about assumptions and interactions
Thorough testing (unit, integration, manual)
Process isolation with crash-and-restart patterns
Continuous monitoring and invariant checking in production
Deliberate fault injection to find hidden bugs

Human Errors: Configuration Kills

One study found configuration errors caused more outages than hardware faults (which only accounted for 10-25% of outages). Even well-intentioned operators make mistakes.

Practical defenses:

Minimize error opportunities: Well-designed APIs make the right thing easy and the wrong thing hard. But don’t make interfaces so restrictive that people work around them.
Decouple experimentation from production: Provide sandbox environments with real data where people can explore without affecting users.
Test everything: Unit tests, integration tests, manual tests. Automated testing excels at covering corner cases.
Enable fast recovery: Fast rollbacks, gradual rollouts (limiting blast radius), and tools to recompute data when old computations were wrong.
Implement telemetry: Detailed monitoring provides early warnings and helps diagnose issues when they occur. Metrics are invaluable.

When Reliability Doesn’t Matter (Rarely)

You might sacrifice reliability to reduce development cost (prototypes for unproven markets) or operational cost (razor-thin margins). But be conscious when cutting corners. That parent storing all their kids’ photos in your app? They probably don’t have backups.

Scalability: There Is No Magic Sauce

“X is scalable” is a meaningless statement. The right questions are: “If load grows in this specific way, what are our options?” and “How do we add resources to handle additional load?”

Defining Load Parameters

Before discussing scaling, quantify current load with a few key numbers:

Requests/sec to a web server
Read/write ratio in a database
Simultaneously active users
Cache hit rate

Choose parameters that reflect your actual bottlenecks. Average case might matter, or you might be dominated by tail latency from a few extreme cases.

Case Study: Twitter’s Fan-Out Problem

Twitter’s 2012 stats:

Post tweet: 4.6k req/sec average, 12k peak
Home timeline: 300k req/sec

Handling 12k writes/sec is easy. The hard part is fan-out: each user follows many people and is followed by many people.

Approach 1: Read-time fan-out

SELECT tweets.*, users.* FROM tweets
JOIN users ON tweets.sender_id = users.id
JOIN follows ON follows.followee_id = users.id
WHERE follows.follower_id = current_user

Query-time joins across followers to assemble timelines. Simple, but couldn’t handle 300k reads/sec.

Approach 2: Write-time fan-out

Maintain a cache (timeline mailbox) for each user. When someone tweets, insert it into all followers’ caches. Reads become cheap because results are pre-computed.

Math: 4.6k tweets/sec × 75 average followers = 345k cache writes/sec. This works because writes are 2 orders of magnitude less than reads.

But the average hides a problem: celebrities with 30 million followers. One tweet = 30 million cache writes. Twitter’s SLA is 5-second delivery, making this a significant challenge.

Approach 3: Hybrid (current)

Most users get write-time fan-out. Celebrities are excluded from fan-out. Their tweets are fetched separately and merged at read time. This hybrid delivers consistent p99 performance.

The key load parameter here is follower distribution, weighted by tweet frequency. Your load parameters will be different.

Measuring Performance

Two questions when load increases:

If resources stay constant, how does performance degrade?
How much must resources increase to maintain performance?

For batch systems (Hadoop), care about throughput (records/sec). For online systems, care about response time distribution. Never think of response time as a single number. Even identical requests have variable latency.

Coping with Load

Vertical scaling (bigger machine) vs horizontal scaling (more machines). Reality: good architectures mix both. Several powerful machines can be simpler and cheaper than many tiny VMs.

Elastic systems auto-scale when load increases. Manually scaled systems require humans to add capacity. Elastic is useful for unpredictable load but adds complexity and operational surprises.

The hard part: Stateless services scale easily across machines. Stateful data systems introduce massive complexity when distributed. Traditional wisdom was to scale up (single node) until cost or availability requirements forced distribution.

This may be changing as distributed system abstractions improve, but we’re not there yet for most use cases.

Critical insight: There is no generic scalable architecture. A system handling 100k req/sec of 1KB payloads looks completely different from one handling 3 req/min of 2GB payloads (same throughput, different design).

Scalable architectures are built around assumptions about which operations are common vs rare (your load parameters). If assumptions are wrong, scaling effort is wasted or counterproductive. For early-stage products, iterating quickly on features matters more than scaling to hypothetical future load.

Maintainability: Don’t Build Tomorrow’s Legacy System Today

Most software cost is maintenance, not initial development: fixing bugs, investigating failures, adapting to new platforms, repaying technical debt, adding features.

Three design principles:

Operability: Reduce Toil

Good ops can work around bad software, but good software can’t run reliably with bad ops. Operations teams handle:

Monitoring health and restoring service
Diagnosing performance degradation
Applying security patches
Capacity planning
Deployment and configuration management
Complex maintenance (platform migrations)
Preserving institutional knowledge

Make their lives easier by:

Providing visibility into runtime behavior (good monitoring)
Supporting automation and standard tooling
Avoiding single-machine dependencies (enable rolling maintenance)
Including clear documentation and operational models
Setting sensible defaults while allowing overrides
Self-healing where appropriate, manual control where needed
Exhibiting predictable behavior

Simplicity: Fight Accidental Complexity

As projects grow, complexity explodes: state space bloat, tight coupling, tangled dependencies, inconsistent naming, performance hacks, special-casing. This slows everyone down and increases bug risk.

Accidental complexity isn’t inherent to the problem users face. It arises from implementation choices.

The best tool for managing complexity: abstraction. Good abstractions hide implementation details behind clean interfaces and enable reuse across applications. Examples: high-level languages abstract machine code, SQL abstracts on-disk data structures.

Finding good abstractions is hard, especially in distributed systems. But it’s worth the effort because quality improvements to abstracted components benefit all users.

Evolvability: Embrace Change

Requirements never stay constant: new facts emerge, use cases shift, business priorities change, regulations update, growth forces architectural changes.

Agile practices (TDD, refactoring) help at small scale. But how do you “refactor” Twitter’s timeline architecture from approach 1 to approach 2 when you’re already serving 300k req/sec?

Evolvability (also called extensibility or modifiability) is closely linked to simplicity. Simple systems are easier to modify. Use good abstractions to make changes manageable at system scale.

Actionable Takeaways

Test your fault tolerance by deliberately breaking things in production (carefully). If you’re not running chaos experiments, your error handling is probably broken.
Identify your load parameters before discussing scalability. What actually dominates your bottlenecks? Reads, writes, fan-out, tail latency?
Don’t over-engineer for hypothetical scale. Early-stage products should prioritize iteration speed. Rearchitect at each order-of-magnitude load increase, not before.
Invest in observability from day one. You can’t maintain systems you can’t see. Telemetry is not optional.
Design for operability. Your 3 AM on-call engineer will thank you for predictable behavior, good defaults, and clear docs.
Fight accidental complexity ruthlessly. Every abstraction should justify its existence. If it’s making the system harder to understand, kill it.
Accept that humans make mistakes. Build sandboxes, enable fast rollbacks, and make the right thing easy.

There’s no silver bullet for reliability, scalability, or maintainability. But there are patterns that work, and anti-patterns to avoid. Build systems that tolerate faults, understand their load characteristics, and remain simple enough for the next engineer to maintain.

On This Page