04.2 Transformers from scratch

This module is the single highest-leverage weekend in the entire curriculum. Almost every architecture you’ll encounter in the rest of the field — Llama, Mistral, Qwen, GPT-OSS, the latest Claude/GPT — is structurally a transformer with modifications. Once you’ve written one yourself, the modifications stop being mysterious.

You’ll be following Karpathy’s path closely. There’s no shame in that — it’s the best path that exists. The point of this module is to walk it once with intent, then know what you actually built.

Set up the project

mkdir tinygpt && cd tinygpt
uv venv .venv && source .venv/bin/activate
uv pip install torch tiktoken numpy
git init && echo ".venv/" >> .gitignore

# Get the dataset
mkdir data && cd data
curl -O https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
cd ..

You should now have a data/input.txt with about 1MB of Shakespeare. That’s your training corpus for this module. Small enough to overfit on a laptop, big enough to be interesting.

Read these first, in this order

Don’t read everything. Read these three, in order, and stop:

Karpathy — Let’s build GPT: from scratch, in code, spelled out. video · 2 hrs · the spine of this module
Vaswani et al. — Attention Is All You Need. arxiv · 30 min · skim, then re-read sections 3.1–3.3 after you’ve written attention
Alammar — The Illustrated Transformer. post · 20 min · only if Karpathy’s diagrams aren’t clicking yet

That’s it. Resist the urge to read more before you’ve written a working model. The reading-list rabbit hole is the single biggest reason students never actually finish this module.

The plan

You’ll build it in five steps. Each step ends with code that runs and produces output you can see. Don’t move to step N+1 until step N’s output looks right.

Step	Artifact	Should produce
1	`tokenize.py`	Encode/decode round-trips perfectly on the dataset
2	`data.py`	Random batches with input/target pairs, shapes correct
3	`attention.py`	Single causal attention head, masked correctly
4	`model.py`	Full GPT class: embed → blocks → head
5	`train.py`	Loss decreases on Tiny Shakespeare, sample text

Step 1 — Tokenizer

Start dumb. Use a character-level tokenizer for the first pass — it’s two functions and removes a whole layer of confusion.

# tokenize.py
text = open("data/input.txt").read()
chars = sorted(set(text))
vocab_size = len(chars)            # ~65 unique chars
stoi = {c: i for i, c in enumerate(chars)}
itos = {i: c for i, c in enumerate(chars)}

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: "".join(itos[i] for i in l)

assert decode(encode("Hello, world!")) == "Hello, world!"

Once the model trains, then swap in tiktoken (BPE) and watch what changes. That’s the lesson — not the BPE itself, but seeing what tokenization actually does to your loss curve.

Step 2 — Dataset & batching

# data.py
import torch

data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]

block_size = 128   # context window
batch_size = 32

def get_batch(split):
    d = train_data if split == "train" else val_data
    ix = torch.randint(len(d) - block_size, (batch_size,))
    x = torch.stack([d[i:i+block_size] for i in ix])
    y = torch.stack([d[i+1:i+block_size+1] for i in ix])
    return x, y

x, y = get_batch("train")
assert x.shape == (batch_size, block_size)
assert y.shape == (batch_size, block_size)

Internalize this: y is x shifted by one. The model’s job at every position t is to predict the token at position t+1. That single sentence is most of language modeling.

Step 3 — Attention, by hand

Write this once, by yourself, before reading any reference implementation. Get it wrong, fix it, get it wrong again. The aha is in the wrong-ness.

# attention.py — one causal head, the long way
import torch
import torch.nn.functional as F

def scaled_dot_product_attention(q, k, v, mask):
    # q, k, v: (B, T, head_size)
    d_k = k.size(-1)
    scores = q @ k.transpose(-2, -1) / (d_k ** 0.5)   # (B, T, T)
    scores = scores.masked_fill(mask == 0, float("-inf"))
    weights = F.softmax(scores, dim=-1)               # (B, T, T)
    return weights @ v                                # (B, T, head_size)

When that works, fold it into a MultiHeadAttention module and verify: with n_heads = 4 and n_embd = 64, every head should operate on head_size = 16, and concatenated output should be (B, T, 64) again.

Two common bugs to watch for. (1) Forgetting to register the causal mask as a buffer, so it doesn’t move to GPU. (2) Applying softmax over the wrong axis. The scores you softmax over are the keys axis, not queries.

Step 4 — Full model

The final assembly is depressingly small. That’s the point.

GPT(
  token_embedding:  Embedding(vocab_size, n_embd)
  position_embedding: Embedding(block_size, n_embd)
  blocks: [
    Block(n_embd, n_heads):
      LayerNorm → MultiHeadAttention → residual
      LayerNorm → FeedForward (4×) → residual
  ] × n_layers
  ln_final: LayerNorm
  lm_head: Linear(n_embd, vocab_size)
)

Two things that often surprise people the first time:

It’s residual everywhere. Each block adds to its input, never replaces it. Without this, gradients die.
Weight tying is real. lm_head.weight = token_embedding.weight saves parameters and usually trains better. Try it both ways and look at the loss.

Sanity targets for a n_embd=192, n_heads=6, n_layers=6 model on Shakespeare:

Total params: ~3.5M
Forward pass on a single batch: under 50ms on M-series MPS / under 10ms on a 4090
Untrained loss: ≈ ln(vocab_size) ≈ 4.17 (uniform over 65 chars)

If your untrained loss isn’t near that, you have a bug in your loss computation, not your model.

Step 5 — Train it

# train.py — minimum viable
opt = torch.optim.AdamW(model.parameters(), lr=3e-4)

for step in range(5000):
    x, y = get_batch("train")
    logits = model(x)
    loss = F.cross_entropy(logits.view(-1, vocab_size), y.view(-1))
    opt.zero_grad(set_to_none=True)
    loss.backward()
    opt.step()
    if step % 200 == 0:
        print(f"step {step}: train loss {loss.item():.4f}")

Numbers you should see on Tiny Shakespeare with the model above:

Step	Train loss	Sample quality
0	~4.17	random characters
500	~2.4	space-separated nonsense
2000	~1.6	actual words, broken grammar
5000	~1.3	Shakespearean rhythm, meaningless content

If your loss is stuck above 2.0 after 2000 steps, something is wrong — usually a tokenization bug, mask bug, or you forgot LayerNorm.

Make it bigger, then check yourself

Once it works, do these in any order. They’re how you actually internalize the architecture:

Replace your hand-rolled attention with F.scaled_dot_product_attention. Time both. Watch flash-attention kick in on a real GPU.
Swap char-level for tiktoken’s cl100k_base. Note what changes: vocab size, loss scale, sample quality.
Add a learning-rate warmup + cosine decay. The default flat LR is leaving training quality on the table.
Implement KV-cache for inference. Generation should now be O(1) per token instead of O(n²).
Profile with torch.profiler. Identify the matmul that dominates. Verify it’s QK^T.

Each of these takes 30–90 minutes and produces a permanent piece of intuition.

Going deeper (resources, ranked)

When you actually have specific questions, in roughly this order:

karpathy/nanoGPT — the reference impl this module is shadowing. Diff your code against it.
karpathy/build-nanogpt — Karpathy’s longer follow-up: GPT-2 reproduction, on real hardware.
Phi/SmolLM technical reports — small model, real training story, more honest than most flagship reports.
Tay et al. — Efficient Transformers: A Survey. arxiv — read once you can describe vanilla attention without notes.
Hoffmann et al. — Chinchilla (scaling laws). arxiv — the paper that explains why your 3.5M model can’t be a good chatbot.

Skip the survey papers and “intro to transformers” Mediums. They’re written for the version of you that hasn’t built this yet, and once you have, they can’t help.

Checkpoints

Read these out loud, alone, in plain language. If any one wobbles, the corresponding section above is what to reread.

Why is attention O(n²) in sequence length, and which specific matmul dominates?
What does the causal mask actually mask, and what would the model learn if you removed it?
Why do we use LayerNorm and not BatchNorm? What would break with BatchNorm?
What does weight tying do, and why does it usually help?
What’s the relationship between cross-entropy loss and the perplexity number people report? (Hint: exp(loss).)

If you can answer all five from memory, you’ve earned module 01.1. Move on to 01.2 (Tokenizers, datasets, training) or jump to 02.1 (Build an AI agent) if you’d rather skip ahead to applications.