02.1 Data structures, built not memorized

The traditional DSA course has a problem. It teaches you to memorize a hundred algorithms — most of which you will never implement again — without teaching you the small number of habits that actually let you reason about data structures in the wild. This module flips that. You will implement the small set of data structures that matter, by hand, once. Then you will use the library version forever, with confidence.

This is not a LeetCode grind. That comes in module 02.2. This is the foundation under it: knowing what each data structure actually is, what it costs you, and when it is and isn’t the right answer.

Why “build, then never re-implement”

The reason to implement a hash map by hand once is not so you can implement one in production. (You won’t. The standard library’s is faster, more correct, and battle-tested.) The reason is so that when you read code that uses one, you have a mental model of what the machine is actually doing. When dict[key] is slow in your code, you’ll know to look at the hash function, the load factor, or the keys. When the textbook says “amortized O(1),” you’ll know what’s being amortized.

In other words: you are building these once for intuition, not for use. Internalize that frame. It changes how carefully you build them.

The dozen that actually matter

These are the data structures every CS student should be able to draw, implement, and reason about. The order is roughly easiest to hardest.

#	Data structure	Build it because	Use it via
1	Dynamic array	The default container, and ~half of all data structures internally	`list` (Py) / `Vec` (Rust) / `Array` (JS)
2	Linked list	Pointer manipulation, and the trade-off vs. arrays	rare in real code, common in interviews
3	Stack	LIFO; recursion, undo, parsing	`list` with `.append/.pop`
4	Queue	FIFO; BFS, scheduling	`collections.deque`
5	Hash map	The single most-used data structure in production	`dict` (Py) / `Map` (JS) / `HashMap` (Rust)
6	Hash set	Hash map’s “I only care if it exists” cousin	`set`
7	Binary search tree	Ordered data with O(log n) operations	rare directly; rebalancing variants are everywhere
8	Heap (priority queue)	“Give me the smallest / largest” in O(log n)	`heapq` (Py) / `BinaryHeap` (Rust)
9	Trie	Prefix problems, autocomplete, IP routing	rolled by hand or as a package
10	Graph (adjacency list)	Anything with relationships	rolled by hand using `dict[T, list[T]]`
11	Disjoint set / union-find	Connectivity, Kruskal’s, online connectivity queries	rolled by hand
12	Bloom filter	Probabilistic membership at huge scale	`pybloom_live` / DB internals / CDN logic

That’s the whole list. Anything beyond this is a variation, an optimization, or a domain-specific structure you’ll learn in context. If your CS course makes you implement red-black trees from scratch, do it once and then stop — they are an excellent exercise in pointer manipulation and an awful production data structure to roll yourself.

How to actually build them

For each of the twelve, follow the same loop. It takes 30–90 minutes per structure.

1. Draw it on paper. What does it look like in memory?
   What invariants must hold after every operation?
2. Write the operations as method signatures, with type hints.
   Add cost (Big-O) as a comment beside each.
3. Implement, in plain Python, no fancy tricks.
4. Write a pytest file with at least 5 cases per operation,
   including: empty, single element, duplicates, the operation
   that changes the underlying structure (resize, rebalance).
5. Run a tiny benchmark: insert 1M items, time it.
   Compare to the stdlib equivalent. Note the ratio.

Step 5 is the one most students skip and the one that produces the most insight. Your hand-rolled hash map will be 10–30× slower than dict. That gap is the lesson — that is what years of optimization, cache locality, and CPython internals buy you.

Build prompt: the dozen, in order

Open a fresh project. Build each one in src/<name>.py, with tests/test_<name>.py next to it.

mkdir dsa && cd dsa
uv venv .venv && source .venv/bin/activate
uv pip install pytest
mkdir src tests

Skeletons to start from. Don’t peek at solutions until you’ve struggled for at least 20 minutes per structure.

# src/hash_map.py — implement this from scratch, no `dict`
class HashMap:
    def __init__(self, capacity: int = 16):
        ...
    def __setitem__(self, key, value) -> None: ...   # O(1) amortized
    def __getitem__(self, key):                       # O(1) avg, O(n) worst
        ...
    def __delitem__(self, key) -> None: ...
    def __contains__(self, key) -> bool: ...
    def __len__(self) -> int: ...

A few specific stretch goals when you finish:

Hash map: handle collisions with both chaining and open addressing. Benchmark both.
Heap: implement it as a flat array using the 2i+1 / 2i+2 indexing trick. Your bubble_up and bubble_down should each fit in 8 lines.
Graph: write BFS, DFS, and Dijkstra on top of your adjacency-list class. Run them on a real dataset (random edges, or scrape Wikipedia categories).
Union-find: implement with both path compression and union-by-rank, then turn each off in turn and benchmark on n = 10^6 operations.

When all twelve are green and benchmarked, you have done more genuine DSA work than 90% of CS undergrads finishing their degree.

The five patterns that recur

Once you’ve built the structures, the next thing to internalize is the patterns that tell you which one to reach for. There are roughly five.

Pattern in problem	Reach for	Why
”Has this been seen before?” / “Group by X”	hash map / hash set	O(1) membership
”Always smallest / largest / next”	heap	O(log n) extract-min
”Range / ordered queries”	sorted structure / segment tree	log-time queries on intervals
”Path / connectivity / dependency”	graph + BFS/DFS	most things in real systems
”Prefix / autocomplete / matching”	trie	branching by character

Drill this. When you read a problem, the first thing you should do is name the pattern. Then pick the structure. Most students do this in the wrong order and end up choosing a structure first because they’ve been over-trained to think in code.

Going deeper

When you have specific questions, in this order:

Sedgewick & Wayne — Algorithms (4th ed.). book · the cleanest reference there is. The companion site has all the code.
MIT 6.006 — Introduction to Algorithms. course · free, well-paced, with proofs when proofs matter.
Visualgo. site · animations of every operation. Useful when the diagrams in your head aren’t matching the code.
CLRS — only after the above. It’s a reference, not a teacher. Most students pick it up too early and get demoralized.

Skip the rented “Top 10 LeetCode patterns” videos until module 02.2. Patterns mean nothing without the structures under them.

Checkpoints

If any one wobbles, build that structure again from scratch.

Walk through what happens physically in memory when a Python dict exceeds its load factor. Why is the resize amortized O(1)?
Why is a heap typically stored as a flat array, not as a tree of nodes? What does the array layout buy you?
Give the time complexity of: lookup in a hash map, lookup in a balanced BST, lookup in a trie of strings of length L. Now give the constant-factor reality of each.
You’re given a stream of 10 billion URLs and 16 GB of RAM. You need to answer “have we seen this URL before?” with high probability. Which structure do you reach for, and what’s the trade-off?
Pick any two structures from the dozen and explain when one is the right answer and the other is a tempting wrong answer. Be specific about the problem statement that distinguishes them.

If you can answer all five from memory, you’ve earned 02.1. Move on to 02.2 (Algorithms & complexity, the practical view) — that’s where the patterns above turn into solved problems.