Designing systems on a whiteboard
The interview format — and the actual skill. Scoping, capacity, trade-offs, and naming what you don't know without faking it.
Prerequisites
02.8
Stack
a whiteboard or excalidrawthe ability to estimatediscipline
By the end of this module
- Walk through a system design problem in a 6-step framework that interviewers actually grade on.
- Estimate capacity (QPS, storage, bandwidth) from first principles in under 5 minutes.
- Pick between the eight standard building blocks based on the problem, not the buzzword.
- Spot the four most common failure modes in interview answers and avoid them.
System design interviews are bad at testing system design and good at testing how you think under uncertainty. That’s actually the more useful skill to learn for real engineering work. The mistake most students make is preparing for the interview as if it’s a memorization exercise — “memorize the URL shortener answer, then the Twitter answer, then the chat answer” — and then performing those scripts on the day. Interviewers see through this in roughly 90 seconds and grade accordingly.
The opinion: the interviewer is grading how you think, not what you’ve memorized. The 6-step framework in this module is not a script — it’s the structure that lets you think clearly when you’ve never seen the problem before. Once you internalize it, you’ll find it works equally well for “design a chat app” and for “design our actual production retry queue at work next Tuesday.” That’s the real point.
Set up
You don’t need a stack for this module. You need a whiteboard or excalidraw, 8-12 hours of focused practice, and at least one practice partner who will challenge you. Pair-practice is non-negotiable — you cannot evaluate your own clarity.
# Practice cadence
# - 4 problems, 45 min each, with a partner
# - Self-record (audio) and listen back
# - Compare your output against the references in "Going deeper"
Read these first
Three sources, in this order, then stop:
- Donne Martin — System Design Primer. repo · 6 hrs (skim first, deep on relevant sections) · the most-used free reference. Don’t try to read all of it — skim the index, then read sections as questions come up.
- Alex Xu — System Design Interview Vol 1, chapters 1-4. book · 4 hrs · the cleanest framework treatment. Worth buying.
- Hussein Nasser — Database Engines talks. channel · pick 2-3 at 30 min each · for the parts of system design that touch databases. Far better than reading docs.
You will be tempted to do dozens of “Top 50 System Design Questions” videos. Don’t. Memorizing answers is what produces the bad interview performances this module exists to prevent.
The 6-step framework
Every system design problem, real or interview, follows this structure. Walk it linearly. Do not skip steps because you “already know what to build.”
| Step | Time | What you do | What the interviewer learns |
|---|---|---|---|
| 1. Clarify | 5 min | Ask scoping questions. Confirm constraints. | You don’t build the wrong thing |
| 2. Estimate | 5 min | QPS, storage, bandwidth, growth | You can reason about scale |
| 3. API | 5 min | Define the public API contract first | You design from the user inward |
| 4. Data | 5 min | Schema and storage choices | You pick storage based on access pattern |
| 5. High-level | 10 min | Boxes and arrows. Major components only. | You can decompose a problem |
| 6. Deep-dive | 15 min | One or two components in detail | You can go all the way down when needed |
The most common failure mode: jumping to step 5 in minute 2. Don’t. Steps 1-4 take twenty minutes and they are the difference between a good answer and a generic one.
Step 1 — Clarify
Ask, in this order:
- “Who’s using this and what’s the primary action?” (Defines the scope.)
- “How many users? How much traffic?” (Defines the scale.)
- “What’s the read/write ratio?” (Defines the architecture shape.)
- “What constraints — latency, consistency, availability?” (Defines the trade-offs you’ll make.)
- “What’s out of scope?” (Stops you from over-building.)
If the interviewer says “you decide,” that’s a test. Make a defensible choice and announce it: “I’ll assume 100M DAU and a 10:1 read/write ratio. Tell me if you want different.”
Step 2 — Estimate
Memorize these numbers; they unlock everything else.
| Quantity | Number |
|---|---|
| Seconds per day | 86,400 ≈ 10⁵ |
| 100M DAU at 10 actions/day | 1000 QPS average, ~3000 QPS peak |
| 1 KB JSON per write | 1 GB / 10⁶ writes |
| 1 photo (compressed) | ~100 KB |
| 1 video minute | ~1-10 MB |
| Cross-region RTT | ~100 ms |
| Same-DC RTT | ~1 ms |
| Disk seek (HDD) | ~10 ms |
| SSD random read | ~100 µs |
| RAM access | ~100 ns |
| L1 cache | ~1 ns |
Now estimate live, on the board:
- “100M DAU, 10 actions/day → 10⁹ events/day → ~12K QPS average → ~36K QPS peak (3x).”
- “Each event is 500 bytes → 500 GB/day → 180 TB/year → 540 TB at 3-year retention.”
- “Show your math. Round to one significant figure. The interviewer is checking that you can do this — exact numbers don’t matter.”
Step 3 — API
Sketch the public contract. RPC-style or REST-style — be consistent.
POST /messages
body: { conversationId, senderId, text, mediaIds[] }
returns: { messageId, timestamp }
GET /conversations/:id/messages?before=:cursor&limit=20
returns: { messages[], nextCursor }
WS /conversations/:id/subscribe
→ server pushes new messages
This forces you to commit to the user-facing surface before you optimize storage. Most bad designs start with “let’s use Cassandra” before knowing what the actual reads look like.
Step 4 — Data
For each entity, decide:
- Key/index: how is it queried? (Primary key, secondary indexes.)
- Storage class: relational, KV, document, search, blob, queue, time-series?
- Sharding key: when scale demands it.
- Consistency model: strong, eventual, monotonic.
Most interview answers go to NoSQL too fast. Postgres handles a stunning amount of traffic with one read replica and the right indexes. Default to Postgres unless you have a specific reason — and articulate it.
Step 5 — High-level
Boxes and arrows. Major components only. The eight building blocks below cover almost everything.
Step 6 — Deep-dive
The interviewer will pick a component. Be ready to go deep on:
- The chat write path, including delivery semantics.
- The feed generation strategy (push, pull, hybrid).
- The cache invalidation strategy.
- The rate-limiting algorithm and where it lives.
- The failure mode if your primary database goes down.
If the interviewer doesn’t pick, propose: “I’d like to deep-dive on X because that’s where the hard trade-off is. OK to go there?”
The eight building blocks
These eight cover 90% of any system you’ll design. Know each well enough to use without explaining what it is.
| Block | What it solves | Avoid when |
|---|---|---|
| Load balancer | Spread traffic, health checking | Single-instance internal services |
| CDN | Static asset delivery, edge caching | Dynamic per-user content |
| App server | Stateless request handling | Compute-heavy or batch |
| Cache (Redis/Memcached) | Hot read path, sessions, rate limits | Strong-consistency reads |
| Database (Postgres/MySQL) | Source of truth | Time-series, full-text search |
| Queue (Kafka/SQS/NATS) | Async work, decoupling, backpressure | Sub-millisecond round-trip |
| Search (Elasticsearch/OpenSearch) | Full-text, faceted queries | Source of truth |
| Blob store (S3/GCS) | Files, images, video, dataset storage | Sub-millisecond read latency |
A clean answer composes these. A bad answer reaches for “let’s use Kafka” without ever explaining why.
The four classic problems, walked through
You should be able to walk each of these end-to-end in 45 minutes. Practice each at least twice with a partner.
URL shortener
Clarify: 100M URLs, 10:1 read:write, sub-100ms read latency
Estimate: write QPS ~120, read QPS ~1200; storage ~100GB/year
API: POST /shorten {url} → {short}; GET /:short → 302 redirect
Data: KV store, key=short_id, value=long_url. Postgres or DynamoDB.
High-level: LB → app → cache → DB. Counter for ID generation.
Deep-dive: ID strategy. Counter? Hash? Range allocation?
Most teams: range-allocate IDs from a counter, base62-encode, cache.
Twitter timeline
Clarify: 200M DAU, average 200 followers, celebs at 100M
Estimate: post writes ~10K QPS, timeline reads ~1M QPS
API: POST /tweets, GET /timeline
Data: tweets in KV; followers as adjacency list; timelines as cache
High-level: hybrid push-pull
- Push tweets to follower timelines on write (for non-celebs)
- Pull tweets from celebs at read time (don't fan out 100M times)
Deep-dive: the push/pull threshold and how you decide it.
Ride share
Clarify: drivers' positions update every 4s, riders match within ~5s
Estimate: 10M drivers, ~2.5M position updates/sec
API: POST /location, POST /request_ride, WS /driver
Data: positions in geohash-keyed cache (Redis with sorted sets);
rides in Postgres
High-level: location service + matching service + ride state service
Deep-dive: geohash precision and the search-radius algorithm
Chat (real-time)
Clarify: 1B users, average 50 messages/day, online status, 1:1 + groups
Estimate: ~580K msgs/sec write peak; persistent WS connections
API: WS connection, POST /messages, presence
Data: messages in time-series-keyed store, group fanout via queue
High-level: edge WS layer, message bus, persistence, fanout workers
Deep-dive: message delivery semantics (at-least-once + dedup)
How to spot a bad answer
These four patterns tell you the answer is going to be marked down:
| Anti-pattern | What it sounds like |
|---|---|
| Over-engineering | ”I’ll use Kafka, Spark, Flink, Elasticsearch, and Cassandra” — for an MVP |
| No estimates | Skipping step 2; building for unknown scale |
| Jumping to NoSQL | ”Definitely DynamoDB” without articulating the access pattern |
| Faking depth | Naming a tech without being able to explain how it solves the problem |
The first three are common. The fourth is the lethal one — it’s the difference between a junior and senior performance.
The thing the framework can’t teach you
The framework gives you structure. The structure gets you to “competent.” The leap to “excellent” is doing this with calibration — naming what you don’t know without panicking.
Bad: "I'll use a circuit breaker." (No idea what that does)
Bad: "Hmm, I'm not sure." (Frozen, no progress)
Good: "I'd want a circuit breaker here. I haven't shipped one in production
myself but the role would be to fail fast when the dependency is down,
with a half-open recovery state. Want me to keep moving and we can
come back if needed?"
That third pattern is the thing senior engineers do that juniors don’t. Practice it. Memorize the meta-pattern: name the gap, propose the role of the missing piece, keep moving.
Going deeper
When you have specific questions, in this order:
- High Scalability — case studies — read Instagram’s, Discord’s, and Stack Overflow’s writeups in particular. Real architectures, real trade-offs.
- Martin Kleppmann — Designing Data-Intensive Applications. book · the bible for the database half of system design. Read after a few interviews.
- ByteByteGo — Alex Xu’s video series. Polished and well-paced.
- Discord engineering blog — How Discord stores trillions of messages — concrete data on the scale problem chat systems actually face.
Skip the YouTube channels that “do system design in 10 minutes.” Real answers take 45.
Checkpoints
If any wobbles, reread the corresponding section.
- Walk through the 6-step framework on a problem you’ve never seen — say “design a feature flag service for 1000 engineers.” Use a timer; aim for 45 minutes.
- From memory: rough QPS for 100M DAU at 10 actions/day. Storage for 1 KB events at that volume over 3 years. Show your math.
- Why is “default to Postgres” usually a better answer than “default to DynamoDB”? Name a specific access pattern that flips the answer.
- Pick the Twitter timeline problem. Why does pure push fan-out break, and why does pure pull break? Describe the hybrid.
- Talk for 60 seconds, out loud, about a system you’ve built or used heavily — APIs, data model, scale, failure mode. If you stumble, that’s where to practice next.
When you can answer all five from memory, move to 05.2 Caching, queues, rate limits. The boxes you drew on the whiteboard are about to become real components in code.