Reliable, Scalable, and Maintainable Systems: Notes from DDIA Chapter 1
Why This Chapter Matters
I just finished the first chapter of Designing Data-Intensive Applications by Martin Kleppmann — the book most senior engineers point to when you ask how to think about backend systems at scale. Chapter 1 sets up the entire book by introducing the three properties that any serious system must care about:
- Reliability — the system keeps working correctly even when things go wrong
- Scalability — the system keeps performing well as load grows
- Maintainability — humans can keep working on the system productively over time
These sound obvious. They are not. Most outages, performance fires, and "we have to rewrite this" moments trace back to a team that optimized for one of these and quietly broke the other two. This post is my distilled version of the chapter, written the way I wish someone had explained it to me before I had to learn it from incidents.
1. Reliability: Designing for Things Going Wrong
A reliable system continues to do what the user expects, even in the face of faults. The key distinction Kleppmann draws — and one I had been sloppy about — is between a fault and a failure:
- A fault is one component deviating from its spec (a disk returning bad bytes, a node losing the network).
- A failure is the system as a whole stopping service to the user.
You cannot prevent faults. You can prevent faults from turning into failures. That's what reliability engineering is.
Where faults come from
- Hardware faults. Disks die, RAM bits flip, power supplies fail. Cloud abstracts a lot of this away, but only by giving you many cheaper machines that fail more often. Redundancy moves from the hardware layer into your software.
- Software faults. Far more dangerous than hardware faults because they're correlated — a single bug can take down every node at once. Memory leaks, runaway recursion under a specific input, a cascading retry storm.
- Human errors. The biggest source. Studies cited in the chapter put operator misconfiguration as the leading cause of outages in large internet services. The fix is not "hire better humans." It's designing systems where the easy path is also the safe path: good abstractions, sandboxes, staging environments, fast rollback.
What a reliable system actually does
It is built to detect, isolate, and recover from faults — automatically when possible.
Concretely, that means: health checks that catch a degraded node before traffic notices, bulkheads that keep one slow dependency from saturating the thread pool for everything else, and retries that have backoff and a budget so they don't amplify the original problem.
The mindset shift from this section: don't ask "will this fail?" — ask "when it fails, what happens next?"
2. Scalability: A Word That Means Nothing Until You Define It
"Scalable" is one of the most abused words in engineering. Kleppmann's correction is sharp: scalability isn't a property a system has — it's a question you have to ask precisely.
If the system grows in a particular way, what are our options for coping with the growth?
To answer that, you need two things first: a description of load, and a description of performance.
Describe the load with the right parameter
Load isn't one number. Twitter's classic example (in the book) is the difference between posting a tweet and reading the home timeline — those are wildly different load patterns and the right system design depends on which one dominates. Pick the parameters that actually stress your system: requests per second, read/write ratio, fan-out per write, cache hit rate, dataset size.
If you skip this step, every scaling discussion devolves into vibes.
Measure performance with percentiles, not averages
This was the most useful single idea in the chapter for me.
Averages hide everything. A service with a 200ms average response time can still be quietly miserable for 5% of users. What matters in practice is the tail:
- p50 (median) — typical user experience
- p95 / p99 — your unhappy users; usually the ones with the most data or the heaviest accounts
- p999 — the rare cases that, at scale, are still thousands of requests a day
Amazon famously cared about p999 because the customers experiencing it were often the most valuable ones (more data → slower requests → more money spent on the platform). The lesson: optimize the tail, not the mean.
There's also the head-of-line blocking problem — a single slow call inside a request can drag the whole request down, and one slow request can hold up the queue behind it. This is why p99 latency in one downstream service quietly becomes p50 latency in the service that calls it.
Vertical vs. horizontal scaling
- Scale up (vertical): bigger machine. Simple. Hits a ceiling. Expensive at the top end.
- Scale out (horizontal): more machines. Cheaper per unit of capacity, but you now own a distributed system, with all the consistency and coordination problems that come with it.
Real systems mix both. Stateless services scale out trivially; stateful services (databases especially) are where the hard architectural decisions live, and most of the rest of the book is about those decisions.
3. Maintainability: The Cost Most Teams Underestimate
The majority of the cost of software is not in writing it — it's in keeping it alive afterward. Bug fixes, on-call, adapting to new requirements, paying down decisions made by someone who left two years ago. Kleppmann breaks maintainability into three design principles:
Operability — make life easy for the people running it
Good operability looks like: clear logs, useful metrics, runbooks for predictable failures, automation for repetitive tasks, predictable behavior, and the ability to deploy confidently. A system that needs a specific person's tribal knowledge to keep running is not operable, no matter how clever its architecture is.
Simplicity — manage complexity
Kleppmann uses Moseley and Marks's distinction between essential complexity (inherent to the problem) and accidental complexity (we created it ourselves). Most legacy pain is accidental. The tool for fixing it is abstraction — but only when the abstraction genuinely hides complexity rather than just relocating it. A leaky abstraction is worse than no abstraction.
Evolvability — make it easy to change
Requirements always change. Evolvability is whether your system bends or breaks when they do. This is downstream of simplicity: the simpler the system, the cheaper it is to evolve. Practices like continuous delivery, good test coverage, and small loosely-coupled services exist to keep the cost of change low.
What I'm Taking Away
A few things from this chapter that I want to keep in front of me:
- Faults are inevitable; failures are a design choice. Reliability is the discipline of preventing the first from becoming the second.
- "Scalable" is not a property — it's a question. You cannot answer it without defining load and performance for your system specifically.
- Stop looking at average latency. p95, p99, and p999 are where the real user experience lives.
- Maintainability is the long-term cost center. Operability, simplicity, and evolvability are not nice-to-haves — they're what determines whether the team can still ship in two years.
Chapter 1 is essentially a vocabulary chapter — it gives you the language to argue about trade-offs precisely. The rest of the book is about the actual machinery (data models, storage engines, replication, partitioning, consensus) and now I have the right frame to read it.
If you've been writing backend code for a while but haven't sat down with DDIA yet, this first chapter alone is worth the time. I'll be posting notes on the chapters that follow.



