04.1 The math you actually need for ML

There’s a particular failure mode CS students fall into around ML math. They either skip it entirely and end up copy-pasting PyTorch code they don’t understand, or they sign up for a real-analysis course and disappear into measure theory for a semester. Both are wrong. The math you actually need to read modern ML papers and build models is a small, specific subset — and most of it is mechanical fluency, not deep theory.

This module ranks the four areas by how often each shows up in real work, tells you exactly what to know cold in each, and tells you what to skip without guilt. The opinion this module is built on: you do not need to derive backpropagation more than once in your life. You need to be able to read a forward pass, identify shapes, identify gradients, and trust the autograd. Everything else is academic theater unless you’re writing a new optimizer.

Set up

mkdir mlmath && cd mlmath
uv venv .venv && source .venv/bin/activate
uv pip install numpy matplotlib torch jupyter
jupyter notebook

Open a notebook. Almost everything in this module is “verify by running it.” If you read this module without writing code, you’ll learn nothing.

Read these first

In this order, then stop:

3Blue1Brown — Essence of Linear Algebra. playlist · 3 hrs total · only the first 9 episodes. The geometric intuition for matrices is the single most useful prerequisite for everything else.
3Blue1Brown — Essence of Calculus, episodes 1–4. playlist · 90 min · derivatives, chain rule, that’s all you need.
Goodfellow et al. — Deep Learning, chapters 2–4. free online · 4 hrs · the only textbook reference this module endorses. Read chapters 2 (linalg), 3 (probability), 4 (numerical computation). Skip the rest of the book until you have a reason.
Andrew Ng — Linear Algebra Review. notes · 30 min · use as a cheat sheet during the build, not as a teaching resource.

That is enough. Anyone who tells you to read Strang’s full textbook before doing ML is wasting your weekend. Strang is wonderful and you can read him later.

The four areas, ranked by frequency

Rank	Area	How often in ML work	What to know cold	What to skip
1	Linear algebra	Every line of code	Matmul, broadcasting, eigendecomp, SVD, low-rank	Jordan forms, abstract vector spaces
2	Probability	Every loss function	Likelihood, KL, cross-entropy, expectations	Measure theory, sigma algebras
3	Calculus	Read once, never derive again	Chain rule, gradients in many dims	Multivariable analysis, real analysis
4	Optimization	Picking hyperparameters	SGD, momentum, Adam intuition	Convex optimization theory

This ranking is not academic. It’s how many minutes per week you’ll actually spend with each. Linalg is constantly under your hands. Optimization is mostly “use AdamW, set the LR, move on.”

Linear algebra — the deep one

Linalg is the load-bearing math. Get this right and the rest follows.

What to know cold

import numpy as np

# 1. Matmul rules and shapes
A = np.random.randn(32, 128)   # batch=32, features=128
W = np.random.randn(128, 64)   # weight: in=128, out=64
out = A @ W                    # (32, 64). Internalize the shape grammar.

# 2. Broadcasting
x = np.random.randn(32, 64)
b = np.random.randn(64)        # bias, broadcasts to (32, 64)
y = x + b                      # works without an explicit reshape

# 3. Dot product as projection
u = np.array([1.0, 0.0, 0.0])
v = np.array([0.7, 0.7, 0.0])
np.dot(u, v)                   # 0.7 — projection of v onto u

Build mental fluency for these four operations: matmul, transpose, broadcast, dot. If you have to look up the shape rule for any of them, drill until you don’t.

Eigendecomp and SVD — the geometric intuition

You need eigenvectors and SVD for one reason in modern ML: low-rank approximations. This is the single trick behind LoRA, model compression, and most “this matrix has structure” arguments in papers.

# Low-rank approximation via SVD
A = np.random.randn(100, 100)
U, S, Vt = np.linalg.svd(A)

# Keep only top-k singular values
k = 10
A_approx = U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :]
print(np.linalg.norm(A - A_approx) / np.linalg.norm(A))  # relative error

That code is a hundred-page LoRA explanation, compressed. A dense 100x100 matrix can be approximated by two thin matrices of shape (100, 10) and (10, 100), reducing parameters from 10000 to 2000. That’s the whole game.

What to skip

Jordan canonical forms. Linear maps over abstract fields. Most of “Linear Algebra Done Right.” If you’re not writing a numerical analysis library, you do not need them.

Probability — the second most important one

Probability is where most students get cocky and most papers get lost. Be ruthless: there are five things you must know cold.

Likelihood and MLE

# Coin flips: probability of seeing this data given parameter p
flips = [1, 1, 0, 1, 0, 1, 1]
def likelihood(p, data):
    return np.prod([p if x == 1 else 1 - p for x in data])

ps = np.linspace(0.01, 0.99, 100)
likes = [likelihood(p, flips) for p in ps]
print(ps[np.argmax(likes)])    # ≈ 5/7 ≈ 0.71

Maximum likelihood means: pick the parameters that make the data most probable. That’s the entire frame for “training a model.” Internalize this and 90% of paper notation becomes readable.

Cross-entropy as MLE for classification

This is the connection most students miss. Cross-entropy loss is not arbitrary — it’s the negative log-likelihood of the data under your model.

For classification with true label y and predicted probabilities p:

log-likelihood of one example: log p[y]
negative log-likelihood: -log p[y]
minimize over the dataset: that’s cross-entropy loss

# Cross-entropy from scratch
def softmax(z):
    z = z - z.max(axis=-1, keepdims=True)   # numerical stability
    e = np.exp(z)
    return e / e.sum(axis=-1, keepdims=True)

def cross_entropy(logits, targets):
    p = softmax(logits)
    return -np.log(p[np.arange(len(targets)), targets]).mean()

# Verify against torch
import torch
import torch.nn.functional as F
logits = np.random.randn(8, 10)
targets = np.random.randint(0, 10, 8)
my_ce = cross_entropy(logits, targets)
torch_ce = F.cross_entropy(torch.tensor(logits), torch.tensor(targets)).item()
assert abs(my_ce - torch_ce) < 1e-5
print(my_ce, torch_ce)

The point: when you minimize cross-entropy, you are doing maximum likelihood under a categorical model. They are the same thing.

KL divergence — the asymmetric distance

KL(P || Q) is “how surprised would I be by samples from P if I thought they came from Q?” It is not symmetric. KL(P || Q) is not KL(Q || P). This matters because cross-entropy = entropy of P + KL(P || Q), and that decomposition shows up in VAEs, RLHF, and DPO.

What to skip

Measure theory. Sigma algebras. Almost-sure convergence. Most of a real probability theory course. If you can compute expectations, marginals, and conditionals on discrete and Gaussian distributions, you are equipped.

Calculus — read once, trust autograd forever

Modern deep learning is one calculus operation: gradient descent on a loss function. The chain rule is the only piece of calculus you actually need to be fluent in.

# Manual chain rule on f(x) = (3x + 2)^2
# df/dx = 2 * (3x + 2) * 3 = 18x + 12
def f(x): return (3 * x + 2) ** 2
def df_manual(x): return 18 * x + 12

# Verify with finite differences
x = 1.5
h = 1e-6
print((f(x + h) - f(x - h)) / (2 * h))   # numerical gradient
print(df_manual(x))                       # analytical

Do this once for a 2-layer net by hand. Compute the gradient of the loss with respect to every parameter, by chain rule, on paper. Verify it matches loss.backward() in torch. Then never derive backprop by hand again. Autograd exists. Use it.

The opinion: deriving backprop is a rite of passage, not a daily skill. It is exactly as useful as knowing assembly. You do it once for awareness, then trust the abstraction.

Optimization — conceptual only

You do not need optimization theory. You need three intuitions:

SGD with momentum: gradient descent with a memory of recent gradients. Smooths noise, escapes shallow minima.
Adam / AdamW: per-parameter adaptive learning rates. The default optimizer for almost everything. Use AdamW, not Adam.
Learning rate schedules: warmup (start low, ramp up) followed by cosine decay (slowly drop to near-zero). Used everywhere in transformer training. Cuts loss meaningfully versus a flat LR.

LR schedule shape:
  ↑ lr
  |    /‾‾‾‾‾‾\
  |   /        \____
  |  /              ‾‾‾‾___
  | /
  +———————————————————————→ steps
   warmup    cosine decay

Skip: convex analysis, conjugate gradients, second-order methods, Lagrange duality. They will not help you train a transformer.

The build

Implement these from scratch in numpy. Verify each against torch. Save to a single notebook.

Op	numpy impl	Verify against
Dot product	`sum(a * b)`	`np.dot(a, b)`
Matmul	Triple loop, then vectorized	`np.matmul(A, B)`
Softmax	With log-sum-exp trick	`torch.softmax(x, dim=-1)`
Cross-entropy	From logits and integer targets	`F.cross_entropy(logits, targets)`
KL divergence	Discrete case	`F.kl_div(p.log(), q, reduction='sum')`

When all five pass within 1e-5 of the torch reference, you have done more honest ML math than most CS undergrads.

Going deeper

When you have specific questions, in order:

Goodfellow, Bengio, Courville — Deep Learning, chapters 5–8. free · 6 hrs · once you’ve shipped a model and want the theory underneath.
Boyd & Vandenberghe — Convex Optimization. free · reference only. Skim the convexity chapter. Skip the rest unless you write optimizers.
Strang — Linear Algebra and Its Applications. book · the proper textbook. Read after you’ve used linalg in anger for a few months.
Bishop — Pattern Recognition and Machine Learning, chapter 1–2. book · the cleanest probability framing for ML.

Skip “Mathematics for Machine Learning” by Deisenroth. It tries to be everything for everyone and ends up teaching nothing well.

Checkpoints

Read these out loud. If any wobbles, the corresponding section is what to reread.

Why is cross-entropy loss the same thing as maximum likelihood for a classifier? Show the algebra in two lines.
What does broadcasting (32, 64) + (64,) actually do, mechanically? When does broadcasting fail?
Take a 1000x1000 matrix and approximate it with a rank-20 SVD. How many parameters did you save, and what’s the reconstruction error?
Walk through the gradient of loss = (W @ x - y).pow(2).sum() with respect to W. Get the shape right.
AdamW has four hyperparameters. Name them and say what each one does in one sentence each.

When you can answer all five from memory, move to 04.2 Transformers from scratch. The math you just internalized is the math that module assumes.