02.7 Computer architecture and performance

There is a category of bug you cannot fix without understanding the machine. The Python loop that’s 100x slower than its NumPy equivalent. The C++ array access pattern that’s 10x slower in column-major than row-major. The database query that’s fast in isolation and dies under cache pressure. None of these submit to careful logic. They submit to knowing what the CPU and memory hierarchy actually do.

This module is the practical version of computer architecture. Not RTL design, not pipeline scheduling — the parts that change how you write code. The opinionated take: you do not need to optimize most code. The code that genuinely matters wears its bottlenecks visibly when you profile it. Always profile first.

The two skills you will leave with: a working mental model of the memory hierarchy, and the reflex to reach for a profiler before changing a line.

Set up

# Linux
sudo apt install linux-tools-common linux-tools-generic
pip install py-spy

# macOS
# Instruments comes with Xcode
xcode-select --install
pip install py-spy

Verify:

perf --version 2>/dev/null || echo "use Instruments on macOS"
py-spy --version

You will use godbolt.org for assembly exploration. No install needed; just bookmark it. Make a project:

mkdir perf && cd perf
mkdir bench notes

Read these first

In this order:

Brendan Gregg — Systems Performance (2nd ed.). book · 4–6 hours · the definitive applied book. Chapters 1–6 are required. The “USE method” is the only mental framework you need.
Jeff Dean — Numbers everyone should know. card · 5 minutes · memorize this. It is the most-cited card in CS for a reason.
Ulrich Drepper — What every programmer should know about memory. paper · 3–4 hours · old but still mostly correct. The cache section is the canonical reference.
Hennessy & Patterson — Computer Architecture: A Quantitative Approach. book · only after the above; chapter 2 (memory hierarchy) is what you want.
Algorithmica — Modern Algorithms. free site · the best free resource on practical performance engineering, with godbolt-driven examples throughout.

After Brendan Gregg’s first six chapters and Jeff Dean’s card, you can stop and start measuring. Most of the ROI is there.

The latency hierarchy

The single chart that will change how you write code. Approximate, modern numbers:

Operation	Time	Relative
L1 cache hit	1 ns	1x
Branch mispredict	5 ns	5x
L2 cache hit	4 ns	4x
L3 cache hit	12 ns	12x
Main memory access	100 ns	100x
SSD read (random)	100 µs	100,000x
Datacenter round trip	500 µs	500,000x
Disk seek (HDD)	10 ms	10,000,000x
Cross-continental round trip	150 ms	150,000,000x

A few things should jump out. Main memory is 100x slower than L1. SSD is 1000x slower than main memory. A cross-datacenter call is closer in cost to a disk seek than to a memory read. The whole game of high-performance code is keeping the working set in cache.

When someone says “this loop is slow,” your first question should be: what fraction of its accesses miss L1? Most of the time, that’s the answer.

Cache lines and the row-vs-column lesson

Memory is fetched from RAM in cache lines — typically 64 bytes on x86. When you read a single byte, the CPU fetches the entire 64-byte line containing it into L1. The next 63 bytes are then nearly free.

This is why row-major vs column-major iteration matters:

// Cache-friendly: sequential memory access, one cache miss per 16 ints
for (int r = 0; r < N; r++)
    for (int c = 0; c < N; c++)
        sum += a[r][c];

// Cache-hostile: each access jumps N ints in memory, one miss per access
for (int c = 0; c < N; c++)
    for (int r = 0; r < N; r++)
        sum += a[r][c];

For N = 4096, the cache-hostile version is typically 5–20x slower on a modern CPU. Same operation count. Same algorithm. Different memory access pattern. This is one of the cleanest demonstrations of why architecture matters.

A practical heuristic: touch memory in the order it’s laid out. In C, that’s row-major for 2D arrays. In Fortran or NumPy with order='F', it’s column-major. In Python lists of lists, it’s anything goes — which is one reason Python loops over numerical data are slow.

Branch prediction and the famous benchmark

Modern CPUs guess which way a branch will go and execute speculatively. A correct guess is free; a misprediction costs about 5 ns and a pipeline flush. The most-cited Stack Overflow answer in computer science is the demonstration that processing a sorted array can be 6x faster than processing the same elements unsorted, because the branch becomes predictable.

for (int i = 0; i < n; i++)
    if (data[i] >= 128)  // predictable when data is sorted
        sum += data[i];

Sort the array first. Run the same loop. It’s faster — even after counting the sort. That is branch prediction laid bare.

Practical implications:

Branchless code (using bit tricks or cmov) can beat branch-heavy code on unpredictable inputs.
Vectorized code (SIMD) avoids branches entirely on the inner loop.
Most of the time, you should not write branchless code by hand. The compiler does it when it can. Profile first.

SIMD at a glance

SSE, AVX, AVX-512, NEON. Single Instruction Multiple Data. One instruction operates on 4, 8, or 16 values at once. The compiler vectorizes loops automatically when it can prove the access pattern is safe and contiguous.

You don’t usually write SIMD by hand. You write code the compiler can vectorize:

No data dependencies across loop iterations.
Contiguous memory access.
No function calls in the inner loop.
Aligned data when possible.

Tools like -fopt-info-vec (gcc) tell you which loops vectorized. NumPy and SciPy already use vectorized BLAS under the hood — half the speed of NumPy over a Python loop is SIMD, the other half is pre-allocated arrays and branch-free C kernels.

Profilers per language

Always profile before optimizing. The first rule, the only rule.

Language	Profiler	Notes
Python	py-spy, cProfile	py-spy attaches to a running process, no code changes
C / C++	perf, callgrind, vtune	`perf record -g`, then `perf report`
Rust	perf, samply, flamegraph	`cargo flamegraph` is a one-liner
Go	pprof (built in)	`go test -bench -cpuprofile`
Java	async-profiler, JFR	flame graphs come for free
JS / Node	Chrome DevTools, clinic.js	the DevTools profiler is genuinely good

The output you want is a flame graph: stacks on the y-axis, time on the x-axis. Wider bars are hotter. Read a flame graph by scanning for the widest plateau in user code; that’s your bottleneck.

# Python, real example
py-spy record -o profile.svg -- python slow_thing.py
open profile.svg

Twenty seconds of profiling beats two hours of guessing.

The build: make a slow Python loop 100x faster

A canonical exercise. Start with:

# bench/slow.py
import random, time

def pairwise_distance(xs, ys):
    n = len(xs)
    out = [[0.0] * n for _ in range(n)]
    for i in range(n):
        for j in range(n):
            dx = xs[i] - xs[j]
            dy = ys[i] - ys[j]
            out[i][j] = (dx * dx + dy * dy) ** 0.5
    return out

n = 1500
xs = [random.random() for _ in range(n)]
ys = [random.random() for _ in range(n)]
t = time.perf_counter(); pairwise_distance(xs, ys); print(time.perf_counter() - t)

On my laptop this takes ~6 seconds for n = 1500. Profile it:

py-spy record -o slow.svg -- python bench/slow.py

The flame graph shows ~95% of the time in the inner loop, dominated by the arithmetic and the list-of-list indexing.

Now optimize. Three moves, each measured:

NumPy vectorization. Rewrite as a broadcast: xs[:,None] - xs[None,:]. This does the whole computation in one C kernel with SIMD. Expect 30–100x.
scipy.spatial.distance.cdist. Battle-tested, BLAS under the hood. Often faster than naive NumPy by another 2–3x.
Numba @jit. If you must keep the loop form, JIT it. Often within a factor of 2 of NumPy.

# bench/fast.py
import numpy as np
def pairwise_distance(xs, ys):
    pts = np.stack([xs, ys], axis=1)
    diff = pts[:, None, :] - pts[None, :, :]
    return np.sqrt((diff * diff).sum(axis=2))

Same n = 1500, this runs in ~50 ms. That’s 120x. Same algorithm. Different memory layout, vectorized C kernel, no Python interpreter overhead in the inner loop.

The lesson is not “always use NumPy.” The lesson is the inner loop is where time lives, and the right representation makes that loop fast. You only learn which representation that is by profiling.

The “use a profiler before optimizing” rule

Knuth’s full quote, often misquoted: “Premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.” The 3% is the hot path. The profiler tells you which 3%.

Common anti-patterns:

“I think this is slow because of the X.” Profile it. Most guesses are wrong.
Optimizing the function the IDE happens to be open on. The profiler does not lie about which functions are hot.
Caching everything “to be safe.” Caches add complexity, bugs, and memory pressure. Cache after measuring, not before.
Choosing a faster algorithm whose constant factor is worse than the slow one’s, at your input size. The algorithm chapter tables (02.2) account for asymptotics; the profiler accounts for reality.

The discipline is one line: measure, change one thing, measure again. Hold yourself to it.

Going deeper

When you have specific questions, in this order:

Brendan Gregg — Systems Performance. book · the applied book. Re-read after a year of practice; you’ll get more out of it.
Algorithmica — Modern Algorithms. free site · godbolt-driven, ridiculously good.
Hennessy & Patterson — Computer Architecture. book · the textbook that defines the field.
Agner Fog’s optimization manuals. free PDFs · the canonical reference for instruction latencies and microarchitecture.
Daniel Lemire’s blog. link · ongoing field reports on practical performance.

Skip “10 tricks to speed up Python” listicles. They will teach you nothing the profiler wouldn’t.

Checkpoints

If any one wobbles, redo the corresponding section.

Recite the latency hierarchy from L1 to a cross-continental round trip. Be within one order of magnitude on each.
You’re iterating over a 2D array of 4-byte ints, 1024 by 1024. Estimate the number of cache lines you touch in row-major vs column-major order, and the speed ratio you’d expect on a typical CPU.
A coworker insists their inner loop is slow because of “the math.” How do you confirm or refute that hypothesis in under 10 minutes?
Read a flame graph: a 200 ms request shows a 80 ms plateau in serialize_response and a 60 ms plateau in db.query. Where do you optimize first, and why?
A Python function that processes a million-row CSV takes 30 seconds. The profile says 95% of time is in the loop body. Name three transformations that could each plausibly buy 10x or more, and the order you’d try them in.

If you can answer all five from memory and have shipped your 100x Python speedup, you’ve earned 02.7. Move on to 02.8 (Distributed systems, intro) — where everything you just learned about a single machine breaks once you have more than one.