The math you actually need for ML
Linalg, probability, calculus, optimization — ranked by how often each shows up. Less than the textbook tells you. More than most students can do cold.
Prerequisites
02.3
Stack
Python 3.12numpymatplotliba notebook
By the end of this module
- Implement matmul, softmax, and cross-entropy from scratch in numpy and verify against torch.
- Explain why minimizing cross-entropy is identical to maximum likelihood estimation.
- Read a transformer paper and follow every linalg op without looking anything up.
- Know which math you actually need and which the textbook is wasting your time on.
There’s a particular failure mode CS students fall into around ML math. They either skip it entirely and end up copy-pasting PyTorch code they don’t understand, or they sign up for a real-analysis course and disappear into measure theory for a semester. Both are wrong. The math you actually need to read modern ML papers and build models is a small, specific subset — and most of it is mechanical fluency, not deep theory.
This module ranks the four areas by how often each shows up in real work, tells you exactly what to know cold in each, and tells you what to skip without guilt. The opinion this module is built on: you do not need to derive backpropagation more than once in your life. You need to be able to read a forward pass, identify shapes, identify gradients, and trust the autograd. Everything else is academic theater unless you’re writing a new optimizer.
Set up
mkdir mlmath && cd mlmath
uv venv .venv && source .venv/bin/activate
uv pip install numpy matplotlib torch jupyter
jupyter notebook
Open a notebook. Almost everything in this module is “verify by running it.” If you read this module without writing code, you’ll learn nothing.
Read these first
In this order, then stop:
- 3Blue1Brown — Essence of Linear Algebra. playlist · 3 hrs total · only the first 9 episodes. The geometric intuition for matrices is the single most useful prerequisite for everything else.
- 3Blue1Brown — Essence of Calculus, episodes 1–4. playlist · 90 min · derivatives, chain rule, that’s all you need.
- Goodfellow et al. — Deep Learning, chapters 2–4. free online · 4 hrs · the only textbook reference this module endorses. Read chapters 2 (linalg), 3 (probability), 4 (numerical computation). Skip the rest of the book until you have a reason.
- Andrew Ng — Linear Algebra Review. notes · 30 min · use as a cheat sheet during the build, not as a teaching resource.
That is enough. Anyone who tells you to read Strang’s full textbook before doing ML is wasting your weekend. Strang is wonderful and you can read him later.
The four areas, ranked by frequency
| Rank | Area | How often in ML work | What to know cold | What to skip |
|---|---|---|---|---|
| 1 | Linear algebra | Every line of code | Matmul, broadcasting, eigendecomp, SVD, low-rank | Jordan forms, abstract vector spaces |
| 2 | Probability | Every loss function | Likelihood, KL, cross-entropy, expectations | Measure theory, sigma algebras |
| 3 | Calculus | Read once, never derive again | Chain rule, gradients in many dims | Multivariable analysis, real analysis |
| 4 | Optimization | Picking hyperparameters | SGD, momentum, Adam intuition | Convex optimization theory |
This ranking is not academic. It’s how many minutes per week you’ll actually spend with each. Linalg is constantly under your hands. Optimization is mostly “use AdamW, set the LR, move on.”
Linear algebra — the deep one
Linalg is the load-bearing math. Get this right and the rest follows.
What to know cold
import numpy as np
# 1. Matmul rules and shapes
A = np.random.randn(32, 128) # batch=32, features=128
W = np.random.randn(128, 64) # weight: in=128, out=64
out = A @ W # (32, 64). Internalize the shape grammar.
# 2. Broadcasting
x = np.random.randn(32, 64)
b = np.random.randn(64) # bias, broadcasts to (32, 64)
y = x + b # works without an explicit reshape
# 3. Dot product as projection
u = np.array([1.0, 0.0, 0.0])
v = np.array([0.7, 0.7, 0.0])
np.dot(u, v) # 0.7 — projection of v onto u
Build mental fluency for these four operations: matmul, transpose, broadcast, dot. If you have to look up the shape rule for any of them, drill until you don’t.
Eigendecomp and SVD — the geometric intuition
You need eigenvectors and SVD for one reason in modern ML: low-rank approximations. This is the single trick behind LoRA, model compression, and most “this matrix has structure” arguments in papers.
# Low-rank approximation via SVD
A = np.random.randn(100, 100)
U, S, Vt = np.linalg.svd(A)
# Keep only top-k singular values
k = 10
A_approx = U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :]
print(np.linalg.norm(A - A_approx) / np.linalg.norm(A)) # relative error
That code is a hundred-page LoRA explanation, compressed. A dense 100x100 matrix can be approximated by two thin matrices of shape (100, 10) and (10, 100), reducing parameters from 10000 to 2000. That’s the whole game.
What to skip
Jordan canonical forms. Linear maps over abstract fields. Most of “Linear Algebra Done Right.” If you’re not writing a numerical analysis library, you do not need them.
Probability — the second most important one
Probability is where most students get cocky and most papers get lost. Be ruthless: there are five things you must know cold.
Likelihood and MLE
# Coin flips: probability of seeing this data given parameter p
flips = [1, 1, 0, 1, 0, 1, 1]
def likelihood(p, data):
return np.prod([p if x == 1 else 1 - p for x in data])
ps = np.linspace(0.01, 0.99, 100)
likes = [likelihood(p, flips) for p in ps]
print(ps[np.argmax(likes)]) # ≈ 5/7 ≈ 0.71
Maximum likelihood means: pick the parameters that make the data most probable. That’s the entire frame for “training a model.” Internalize this and 90% of paper notation becomes readable.
Cross-entropy as MLE for classification
This is the connection most students miss. Cross-entropy loss is not arbitrary — it’s the negative log-likelihood of the data under your model.
For classification with true label y and predicted probabilities p:
- log-likelihood of one example: log p[y]
- negative log-likelihood: -log p[y]
- minimize over the dataset: that’s cross-entropy loss
# Cross-entropy from scratch
def softmax(z):
z = z - z.max(axis=-1, keepdims=True) # numerical stability
e = np.exp(z)
return e / e.sum(axis=-1, keepdims=True)
def cross_entropy(logits, targets):
p = softmax(logits)
return -np.log(p[np.arange(len(targets)), targets]).mean()
# Verify against torch
import torch
import torch.nn.functional as F
logits = np.random.randn(8, 10)
targets = np.random.randint(0, 10, 8)
my_ce = cross_entropy(logits, targets)
torch_ce = F.cross_entropy(torch.tensor(logits), torch.tensor(targets)).item()
assert abs(my_ce - torch_ce) < 1e-5
print(my_ce, torch_ce)
The point: when you minimize cross-entropy, you are doing maximum likelihood under a categorical model. They are the same thing.
KL divergence — the asymmetric distance
KL(P || Q) is “how surprised would I be by samples from P if I thought they came from Q?” It is not symmetric. KL(P || Q) is not KL(Q || P). This matters because cross-entropy = entropy of P + KL(P || Q), and that decomposition shows up in VAEs, RLHF, and DPO.
What to skip
Measure theory. Sigma algebras. Almost-sure convergence. Most of a real probability theory course. If you can compute expectations, marginals, and conditionals on discrete and Gaussian distributions, you are equipped.
Calculus — read once, trust autograd forever
Modern deep learning is one calculus operation: gradient descent on a loss function. The chain rule is the only piece of calculus you actually need to be fluent in.
# Manual chain rule on f(x) = (3x + 2)^2
# df/dx = 2 * (3x + 2) * 3 = 18x + 12
def f(x): return (3 * x + 2) ** 2
def df_manual(x): return 18 * x + 12
# Verify with finite differences
x = 1.5
h = 1e-6
print((f(x + h) - f(x - h)) / (2 * h)) # numerical gradient
print(df_manual(x)) # analytical
Do this once for a 2-layer net by hand. Compute the gradient of the loss with respect to every parameter, by chain rule, on paper. Verify it matches loss.backward() in torch. Then never derive backprop by hand again. Autograd exists. Use it.
The opinion: deriving backprop is a rite of passage, not a daily skill. It is exactly as useful as knowing assembly. You do it once for awareness, then trust the abstraction.
Optimization — conceptual only
You do not need optimization theory. You need three intuitions:
- SGD with momentum: gradient descent with a memory of recent gradients. Smooths noise, escapes shallow minima.
- Adam / AdamW: per-parameter adaptive learning rates. The default optimizer for almost everything. Use AdamW, not Adam.
- Learning rate schedules: warmup (start low, ramp up) followed by cosine decay (slowly drop to near-zero). Used everywhere in transformer training. Cuts loss meaningfully versus a flat LR.
LR schedule shape:
↑ lr
| /‾‾‾‾‾‾\
| / \____
| / ‾‾‾‾___
| /
+———————————————————————→ steps
warmup cosine decay
Skip: convex analysis, conjugate gradients, second-order methods, Lagrange duality. They will not help you train a transformer.
The build
Implement these from scratch in numpy. Verify each against torch. Save to a single notebook.
| Op | numpy impl | Verify against |
|---|---|---|
| Dot product | sum(a * b) | np.dot(a, b) |
| Matmul | Triple loop, then vectorized | np.matmul(A, B) |
| Softmax | With log-sum-exp trick | torch.softmax(x, dim=-1) |
| Cross-entropy | From logits and integer targets | F.cross_entropy(logits, targets) |
| KL divergence | Discrete case | F.kl_div(p.log(), q, reduction='sum') |
When all five pass within 1e-5 of the torch reference, you have done more honest ML math than most CS undergrads.
Going deeper
When you have specific questions, in order:
- Goodfellow, Bengio, Courville — Deep Learning, chapters 5–8. free · 6 hrs · once you’ve shipped a model and want the theory underneath.
- Boyd & Vandenberghe — Convex Optimization. free · reference only. Skim the convexity chapter. Skip the rest unless you write optimizers.
- Strang — Linear Algebra and Its Applications. book · the proper textbook. Read after you’ve used linalg in anger for a few months.
- Bishop — Pattern Recognition and Machine Learning, chapter 1–2. book · the cleanest probability framing for ML.
Skip “Mathematics for Machine Learning” by Deisenroth. It tries to be everything for everyone and ends up teaching nothing well.
Checkpoints
Read these out loud. If any wobbles, the corresponding section is what to reread.
- Why is cross-entropy loss the same thing as maximum likelihood for a classifier? Show the algebra in two lines.
- What does broadcasting
(32, 64) + (64,)actually do, mechanically? When does broadcasting fail? - Take a 1000x1000 matrix and approximate it with a rank-20 SVD. How many parameters did you save, and what’s the reconstruction error?
- Walk through the gradient of
loss = (W @ x - y).pow(2).sum()with respect toW. Get the shape right. - AdamW has four hyperparameters. Name them and say what each one does in one sentence each.
When you can answer all five from memory, move to 04.2 Transformers from scratch. The math you just internalized is the math that module assumes.