Computer architecture and performance
Cache, memory hierarchy, branch prediction, why your code is slow. Profiling as a first-class engineering skill.
Prerequisites
01.5 (a low-level language: C or Rust)
Stack
a Linux or Mac machineperf (linux) or Instruments (mac)godbolt.org for asm explorationPython (cProfile, py-spy) and your low-level lang from 01.5
By the end of this module
- Recite the latency hierarchy (registers, L1, L2, L3, RAM, SSD) within an order of magnitude.
- Predict whether a loop is cache-friendly by reading the code, before running it.
- Use a profiler before optimizing. Always. Without exception.
- Read a flame graph and identify the hot path within a minute.
- Speed up a slow Python loop by 100x using profiler-driven changes.
There is a category of bug you cannot fix without understanding the machine. The Python loop that’s 100x slower than its NumPy equivalent. The C++ array access pattern that’s 10x slower in column-major than row-major. The database query that’s fast in isolation and dies under cache pressure. None of these submit to careful logic. They submit to knowing what the CPU and memory hierarchy actually do.
This module is the practical version of computer architecture. Not RTL design, not pipeline scheduling — the parts that change how you write code. The opinionated take: you do not need to optimize most code. The code that genuinely matters wears its bottlenecks visibly when you profile it. Always profile first.
The two skills you will leave with: a working mental model of the memory hierarchy, and the reflex to reach for a profiler before changing a line.
Set up
# Linux
sudo apt install linux-tools-common linux-tools-generic
pip install py-spy
# macOS
# Instruments comes with Xcode
xcode-select --install
pip install py-spy
Verify:
perf --version 2>/dev/null || echo "use Instruments on macOS"
py-spy --version
You will use godbolt.org for assembly exploration. No install needed; just bookmark it. Make a project:
mkdir perf && cd perf
mkdir bench notes
Read these first
In this order:
- Brendan Gregg — Systems Performance (2nd ed.). book · 4–6 hours · the definitive applied book. Chapters 1–6 are required. The “USE method” is the only mental framework you need.
- Jeff Dean — Numbers everyone should know. card · 5 minutes · memorize this. It is the most-cited card in CS for a reason.
- Ulrich Drepper — What every programmer should know about memory. paper · 3–4 hours · old but still mostly correct. The cache section is the canonical reference.
- Hennessy & Patterson — Computer Architecture: A Quantitative Approach. book · only after the above; chapter 2 (memory hierarchy) is what you want.
- Algorithmica — Modern Algorithms. free site · the best free resource on practical performance engineering, with godbolt-driven examples throughout.
After Brendan Gregg’s first six chapters and Jeff Dean’s card, you can stop and start measuring. Most of the ROI is there.
The latency hierarchy
The single chart that will change how you write code. Approximate, modern numbers:
| Operation | Time | Relative |
|---|---|---|
| L1 cache hit | 1 ns | 1x |
| Branch mispredict | 5 ns | 5x |
| L2 cache hit | 4 ns | 4x |
| L3 cache hit | 12 ns | 12x |
| Main memory access | 100 ns | 100x |
| SSD read (random) | 100 µs | 100,000x |
| Datacenter round trip | 500 µs | 500,000x |
| Disk seek (HDD) | 10 ms | 10,000,000x |
| Cross-continental round trip | 150 ms | 150,000,000x |
A few things should jump out. Main memory is 100x slower than L1. SSD is 1000x slower than main memory. A cross-datacenter call is closer in cost to a disk seek than to a memory read. The whole game of high-performance code is keeping the working set in cache.
When someone says “this loop is slow,” your first question should be: what fraction of its accesses miss L1? Most of the time, that’s the answer.
Cache lines and the row-vs-column lesson
Memory is fetched from RAM in cache lines — typically 64 bytes on x86. When you read a single byte, the CPU fetches the entire 64-byte line containing it into L1. The next 63 bytes are then nearly free.
This is why row-major vs column-major iteration matters:
// Cache-friendly: sequential memory access, one cache miss per 16 ints
for (int r = 0; r < N; r++)
for (int c = 0; c < N; c++)
sum += a[r][c];
// Cache-hostile: each access jumps N ints in memory, one miss per access
for (int c = 0; c < N; c++)
for (int r = 0; r < N; r++)
sum += a[r][c];
For N = 4096, the cache-hostile version is typically 5–20x slower on a modern CPU. Same operation count. Same algorithm. Different memory access pattern. This is one of the cleanest demonstrations of why architecture matters.
A practical heuristic: touch memory in the order it’s laid out. In C, that’s row-major for 2D arrays. In Fortran or NumPy with order='F', it’s column-major. In Python lists of lists, it’s anything goes — which is one reason Python loops over numerical data are slow.
Branch prediction and the famous benchmark
Modern CPUs guess which way a branch will go and execute speculatively. A correct guess is free; a misprediction costs about 5 ns and a pipeline flush. The most-cited Stack Overflow answer in computer science is the demonstration that processing a sorted array can be 6x faster than processing the same elements unsorted, because the branch becomes predictable.
for (int i = 0; i < n; i++)
if (data[i] >= 128) // predictable when data is sorted
sum += data[i];
Sort the array first. Run the same loop. It’s faster — even after counting the sort. That is branch prediction laid bare.
Practical implications:
- Branchless code (using bit tricks or
cmov) can beat branch-heavy code on unpredictable inputs. - Vectorized code (SIMD) avoids branches entirely on the inner loop.
- Most of the time, you should not write branchless code by hand. The compiler does it when it can. Profile first.
SIMD at a glance
SSE, AVX, AVX-512, NEON. Single Instruction Multiple Data. One instruction operates on 4, 8, or 16 values at once. The compiler vectorizes loops automatically when it can prove the access pattern is safe and contiguous.
You don’t usually write SIMD by hand. You write code the compiler can vectorize:
- No data dependencies across loop iterations.
- Contiguous memory access.
- No function calls in the inner loop.
- Aligned data when possible.
Tools like -fopt-info-vec (gcc) tell you which loops vectorized. NumPy and SciPy already use vectorized BLAS under the hood — half the speed of NumPy over a Python loop is SIMD, the other half is pre-allocated arrays and branch-free C kernels.
Profilers per language
Always profile before optimizing. The first rule, the only rule.
| Language | Profiler | Notes |
|---|---|---|
| Python | py-spy, cProfile | py-spy attaches to a running process, no code changes |
| C / C++ | perf, callgrind, vtune | perf record -g, then perf report |
| Rust | perf, samply, flamegraph | cargo flamegraph is a one-liner |
| Go | pprof (built in) | go test -bench -cpuprofile |
| Java | async-profiler, JFR | flame graphs come for free |
| JS / Node | Chrome DevTools, clinic.js | the DevTools profiler is genuinely good |
The output you want is a flame graph: stacks on the y-axis, time on the x-axis. Wider bars are hotter. Read a flame graph by scanning for the widest plateau in user code; that’s your bottleneck.
# Python, real example
py-spy record -o profile.svg -- python slow_thing.py
open profile.svg
Twenty seconds of profiling beats two hours of guessing.
The build: make a slow Python loop 100x faster
A canonical exercise. Start with:
# bench/slow.py
import random, time
def pairwise_distance(xs, ys):
n = len(xs)
out = [[0.0] * n for _ in range(n)]
for i in range(n):
for j in range(n):
dx = xs[i] - xs[j]
dy = ys[i] - ys[j]
out[i][j] = (dx * dx + dy * dy) ** 0.5
return out
n = 1500
xs = [random.random() for _ in range(n)]
ys = [random.random() for _ in range(n)]
t = time.perf_counter(); pairwise_distance(xs, ys); print(time.perf_counter() - t)
On my laptop this takes ~6 seconds for n = 1500. Profile it:
py-spy record -o slow.svg -- python bench/slow.py
The flame graph shows ~95% of the time in the inner loop, dominated by the arithmetic and the list-of-list indexing.
Now optimize. Three moves, each measured:
- NumPy vectorization. Rewrite as a broadcast:
xs[:,None] - xs[None,:]. This does the whole computation in one C kernel with SIMD. Expect 30–100x. scipy.spatial.distance.cdist. Battle-tested, BLAS under the hood. Often faster than naive NumPy by another 2–3x.- Numba
@jit. If you must keep the loop form, JIT it. Often within a factor of 2 of NumPy.
# bench/fast.py
import numpy as np
def pairwise_distance(xs, ys):
pts = np.stack([xs, ys], axis=1)
diff = pts[:, None, :] - pts[None, :, :]
return np.sqrt((diff * diff).sum(axis=2))
Same n = 1500, this runs in ~50 ms. That’s 120x. Same algorithm. Different memory layout, vectorized C kernel, no Python interpreter overhead in the inner loop.
The lesson is not “always use NumPy.” The lesson is the inner loop is where time lives, and the right representation makes that loop fast. You only learn which representation that is by profiling.
The “use a profiler before optimizing” rule
Knuth’s full quote, often misquoted: “Premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.” The 3% is the hot path. The profiler tells you which 3%.
Common anti-patterns:
- “I think this is slow because of the X.” Profile it. Most guesses are wrong.
- Optimizing the function the IDE happens to be open on. The profiler does not lie about which functions are hot.
- Caching everything “to be safe.” Caches add complexity, bugs, and memory pressure. Cache after measuring, not before.
- Choosing a faster algorithm whose constant factor is worse than the slow one’s, at your input size. The algorithm chapter tables (02.2) account for asymptotics; the profiler accounts for reality.
The discipline is one line: measure, change one thing, measure again. Hold yourself to it.
Going deeper
When you have specific questions, in this order:
- Brendan Gregg — Systems Performance. book · the applied book. Re-read after a year of practice; you’ll get more out of it.
- Algorithmica — Modern Algorithms. free site · godbolt-driven, ridiculously good.
- Hennessy & Patterson — Computer Architecture. book · the textbook that defines the field.
- Agner Fog’s optimization manuals. free PDFs · the canonical reference for instruction latencies and microarchitecture.
- Daniel Lemire’s blog. link · ongoing field reports on practical performance.
Skip “10 tricks to speed up Python” listicles. They will teach you nothing the profiler wouldn’t.
Checkpoints
If any one wobbles, redo the corresponding section.
- Recite the latency hierarchy from L1 to a cross-continental round trip. Be within one order of magnitude on each.
- You’re iterating over a 2D array of 4-byte ints, 1024 by 1024. Estimate the number of cache lines you touch in row-major vs column-major order, and the speed ratio you’d expect on a typical CPU.
- A coworker insists their inner loop is slow because of “the math.” How do you confirm or refute that hypothesis in under 10 minutes?
- Read a flame graph: a 200 ms request shows a 80 ms plateau in
serialize_responseand a 60 ms plateau indb.query. Where do you optimize first, and why? - A Python function that processes a million-row CSV takes 30 seconds. The profile says 95% of time is in the loop body. Name three transformations that could each plausibly buy 10x or more, and the order you’d try them in.
If you can answer all five from memory and have shipped your 100x Python speedup, you’ve earned 02.7. Move on to 02.8 (Distributed systems, intro) — where everything you just learned about a single machine breaks once you have more than one.