Distributed systems, intro
Where your laptop's mental model breaks. Replication, consensus, the eight fallacies, and why everything is eventually consistent until proven otherwise.
Prerequisites
02.5 (networks)02.6 (databases)
Stack
a laptopPython or GoDocker (to run multi-node toys)
By the end of this module
- Recite the eight fallacies and name a real failure for each.
- Distinguish CAP from PACELC and pick the right tradeoff for a given workload.
- Reason about replication topologies (leader-follower, leaderless, multi-leader) and their failure modes.
- Describe Raft well enough to implement it; understand why Paxos is harder than it should be.
- Implement a 3-node leader-elected counter and survive a chaos test that kills the leader.
Until now everything in this track has assumed one machine. One process, one disk, one address space. The mental model is reliable: your code runs, your data is where you put it, time moves forward in one direction. Distributed systems is what happens when those assumptions break.
The shortest definition: a distributed system is one where a machine you have never heard of, that you cannot influence, can fail in a way that prevents you from doing your work. Once you have more than one machine, you have a distributed system, and your single-machine intuitions start lying to you.
The opinionated take of this module: most “distributed systems” content drowns students in formal models before they have any intuition for the actual failures. We’re going to flip that. Read about the eight fallacies. Build a tiny replicated counter. Watch it break in surprising ways. Then read the formal stuff, and the formal stuff will mean something.
Set up
You need Docker to run multi-node toys on a single laptop, and Python or Go for the implementation.
# Linux / macOS
docker --version || echo "install Docker Desktop"
python --version
# Make a project
mkdir distrib && cd distrib
mkdir node tests notes
Docker compose to spin up 3 nodes will be straightforward; we’ll outline it later.
Read these first
In this order:
- Designing Data-Intensive Applications (Kleppmann). book · 8–10 hours · chapters 5 (replication), 7 (transactions), 8 (trouble), 9 (consistency and consensus). Required. The most useful single book on this subject.
- The eight fallacies of distributed computing. page · 10 minutes · Peter Deutsch’s list. Read each one, then read it again every time you propose a network architecture.
- The Raft paper — In Search of an Understandable Consensus Algorithm. PDF · 2 hours · short, clear, named for understandability. The one consensus paper to read first.
- MIT 6.824 lecture notes. course · the gold standard distributed systems course; Robert Morris’s lectures are on YouTube.
- Jeff Hodges — Notes on distributed systems for young bloods. post · 20 minutes · the field-report wisdom that books skip.
After Kleppmann chapters 5, 8, 9 and the Raft paper, you have the conceptual scaffolding. Stop reading. Start breaking.
The eight fallacies, grounded
Peter Deutsch’s list. Restate each, then describe a real failure mode you should expect.
- The network is reliable. Packets get dropped, links flap, NICs misbehave. Plan for retries and idempotency.
- Latency is zero. A round trip across a datacenter is hundreds of microseconds; across an ocean, hundreds of milliseconds. Designs that assume “RPC is just a function call” eat this latency thousands of times per request.
- Bandwidth is infinite. Your “small” payload of 10 MB is a long-tail latency event. Pagination, compression, and streaming exist for this.
- The network is secure. Anything on the wire can be intercepted, modified, or replayed. Encrypt in transit. Authenticate every hop.
- Topology doesn’t change. Routes change, IPs change, services migrate. Service discovery is not optional.
- There is one administrator. Production has many people changing things in parallel. Versioning, auditing, and explicit ownership matter.
- Transport cost is zero. Serializing a request, encrypting it, framing it on the wire — all has cost. JSON over HTTPS is convenient and not free.
- The network is homogeneous. Different links, different MTUs, different congestion control, different middleboxes. Code that works in dev breaks in prod when traffic crosses a firewall that re-fragments differently.
Whenever a system design feels too clean, walk through these eight. The simplification is almost always assuming one of them away.
CAP and PACELC
CAP theorem (Brewer): under a network partition, you must choose between consistency and availability. You cannot have both, full stop.
This is true and famously misunderstood. The corrections most engineers eventually learn:
- CAP is a statement about partitions. Without partitions there is no tradeoff.
- “Consistency” in CAP means linearizability, the strongest model. Most systems live in weaker consistency models where the tradeoff is more nuanced.
- “Availability” in CAP means every non-failing node responds. It is not the marketing definition of uptime.
PACELC (Abadi) extends it: if there is a Partition, choose between Availability and Consistency. Else (no partition), choose between Latency and Consistency. This captures the everyday tradeoff: even when nothing is broken, stronger consistency costs more round trips.
Practical consequence:
- DynamoDB, Cassandra, Riak: AP under partition, low latency in normal operation. Eventually consistent.
- Spanner, FoundationDB, etcd: CP under partition, higher latency for stronger guarantees.
- Most systems lie somewhere in between, with knobs.
Pick the tradeoff for the workload. Banking transactions: strong consistency. Social feed: eventually consistent is fine.
Replication topologies
Three flavors. Know all three.
Leader-follower (single-leader). One node accepts writes, replicates to followers. Simple model, good consistency story. The leader is a single point of failure for writes; election protocols (Raft, Zookeeper, etc.) handle leader change. PostgreSQL streaming replication, MySQL replication, Redis primary/replica.
Multi-leader. More than one node accepts writes; conflicts must be resolved. Hard to get right. Used in geographically distributed setups (CRDB regional, multi-master MySQL). Conflict resolution is its own subfield (CRDTs, last-write-wins, custom merge functions).
Leaderless. Any node accepts writes; quorum (W) defines durability, quorum (R) defines read consistency. If W + R > N (the number of replicas), reads see the latest write. Dynamo, Cassandra, Riak. The Dynamo paper is a foundational read once you’ve finished Kleppmann.
The single most important fact: replication is asynchronous unless you pay extra for synchronous. Async means a follower can be seconds behind. Reads from followers (a common scaling trick) are reads-from-the-past.
Consensus, and why Raft is the one to read
Consensus is the problem of getting a set of nodes to agree on a single value, despite some of them failing. It is the foundation of leader election, distributed locks, replicated state machines.
Paxos was the first widely-known solution. It is famously hard to understand, even harder to implement correctly. Lamport’s original paper is a math paper. The “Paxos made simple” follow-up is still not simple.
Raft (Ongaro and Ousterhout, 2014) is designed to be understandable. The decomposition is clean:
- Leader election. A leader is elected by majority vote. Terms are monotonic; older terms are rejected.
- Log replication. The leader appends entries to its log, replicates to followers, commits when a majority acknowledge.
- Safety. Properties (election safety, leader append-only, log matching, leader completeness, state machine safety) ensure no two nodes apply different commands at the same index.
If you understand Raft, you can implement etcd, Consul, CockroachDB’s replication layer (in spirit), and most modern consensus-based systems. Read the paper. Then watch the visualization until the dance feels obvious.
Time, ordering, and the hard part
There is no global clock. Two events on different machines cannot be perfectly ordered just by comparing wall-clock time. NTP gives you milliseconds of skew if you’re lucky; often more. Atomic clocks (Spanner’s TrueTime) cost real money and only give bounded, not zero, skew.
What you have instead:
- Logical clocks (Lamport). A counter incremented on each event. If A causes B, A’s counter is less than B’s. Cheap; partial order.
- Vector clocks. A vector with one entry per node. Captures concurrency: two events are concurrent if neither vector dominates. More overhead; richer information.
- Hybrid logical clocks. Combine wall clock with logical counter. Used in CockroachDB and friends.
The takeaway for application code: do not depend on machine clocks for correctness. If two events must be ordered, use a logical mechanism (a sequence number from a single source, a Raft log index, an explicit causal token) rather than time.now().
Idempotency, at-least-once, exactly-once
In a system with retries, every operation will be delivered at least once and possibly more. “Exactly once” is a marketing claim; the engineering reality is “effectively once via idempotency.”
The discipline:
- Every write request carries a client-generated unique ID (idempotency key).
- The server records the ID and the result.
- A duplicate request returns the cached result.
That’s the entire trick. It works for billing, for messaging, for orders, for anything where double-applying is bad. Stripe’s Idempotency-Key header is the canonical example.
Anything claiming exactly-once delivery without idempotency on the receiver is either wrong or relying on a narrow assumption (a Kafka exactly-once-semantics setup with transactions, for instance) that breaks the moment you add a side effect outside the system.
Partial failures and cascading failures
The single failure mode that defines distributed systems: one component degrades, and the rest of the system makes the degradation worse. The classical pattern:
- Service A is slow but not failed.
- Service B keeps calling A and waits longer.
- B’s threads pool fills, B starts dropping its own callers.
- C (calling B) sees timeouts and retries. Retries amplify load on B and A.
- Everything is on fire.
The mitigations are classic and underused:
- Timeouts on every RPC. Default to one. Default tighter than your SLA.
- Circuit breakers. Fail fast when downstream is failing. Stop sending work to a known-bad service for a cooldown.
- Bulkheads. Separate thread pools or resource pools per dependency, so one slow dependency doesn’t drain everything.
- Backoff with jitter. Retries should back off exponentially with random jitter, or you’ll synchronize a thundering herd.
- Load shedding. When overloaded, drop low-priority requests early instead of degrading the whole system.
If you take one habit from this module, take this one: every RPC has a timeout, every retry has jitter, every dependency has a fallback.
The build: a 3-node leader-elected counter
Build a counter service. Three nodes. One is the leader; followers replicate. Clients can increment via any node, but writes are forwarded to the leader and acknowledged after a majority commit.
Architectural sketch:
- Each node listens on a TCP port; nodes know each other’s addresses.
- Implement a tiny Raft: terms, RequestVote RPC, AppendEntries RPC, election timeout with jitter (150–300 ms is typical).
- The state machine is
int counter = 0. Increments are appended to the Raft log; they apply when committed. - Clients can query
countfrom any node and get the leader’s most recently applied value.
# Run three nodes
docker compose up
# In another terminal, increment via node 1
curl -X POST localhost:8001/incr
# Read from node 2 (it forwards or returns its applied value)
curl localhost:8002/count
Then run a chaos test: kill the leader randomly, run 1000 concurrent increments, verify the final count is exactly 1000 and that the surviving nodes agree.
The first time you do this and it converges correctly, you have understood Raft well enough to implement it. Most graduating CS students cannot do this. This is the module that fixes that.
A few warnings, learned the hard way:
- Election timeouts must be randomized, or split votes deadlock the cluster.
- Heartbeats from the leader must be more frequent than the election timeout, or the cluster will continually re-elect.
- The log must be persisted before the leader acknowledges a write, or a leader crash loses committed entries.
Going deeper
When you have specific questions, in this order:
- Designing Data-Intensive Applications (Kleppmann). book · re-read after a year of practice. The depth on transactions and consistency models pays off.
- MIT 6.824. course · the labs are a more rigorous version of the build above. If you do them all, you’ve earned a real distributed systems credential.
- The Raft paper. PDF · re-read it after implementing Raft. Different things will jump out.
- The Dynamo paper. PDF · the canonical leaderless system; influenced Cassandra, Riak, DynamoDB.
- The Spanner paper. PDF · what TrueTime buys you, and what it costs.
Skip “microservices vs monolith” content. The right answer is almost always “monolith with clean boundaries until proven insufficient.” Distributed problems are not solved by being more distributed.
Checkpoints
If any one wobbles, redo the corresponding section.
- State the eight fallacies. For each, give one production failure (real or plausible) it explains.
- CAP forces you to choose between availability and consistency under partition. PACELC adds a tradeoff in the absence of partition. State the second tradeoff and give an example workload that would push you each way.
- Walk through Raft’s leader election. What invariant prevents two leaders from being elected for the same term? What invariant prevents a stale leader from committing entries after a partition heals?
- A client makes a payment request, gets a timeout, and retries. The first request actually succeeded but its response was lost. How do you ensure the user is not charged twice, and what’s the contract between client and server that makes this work?
- Service A calls service B, B is slow, A’s threads start piling up, A starts failing health checks. Walk through the cascading-failure pattern and name three concrete defenses.
If you can answer all five with confidence and your replicated counter survives the chaos test, you’ve earned 02.8. Move on to 02.9 (Web fundamentals without the cargo cult) — where the distributed systems lens turns toward what users actually see in a browser.