04.3 Tokenizers, datasets, training

The architecture is one tenth of the work. Once you have a transformer that runs, the next nine tenths are everything around it: how you tokenize, how you batch, how you schedule the learning rate, how you handle precision, how you log. Almost every “my model isn’t training” question on the forums is a problem in this module — not in the architecture from 04.2.

This module is the unsexy plumbing. Most of it does not have a clean theoretical justification — it’s a pile of empirical tricks that the field figured out and now everyone uses. The opinion: if you skip this module and jump to fine-tuning, you will not be able to debug your own runs. A fine-tuned model that’s worse than the base is almost always a tokenizer or batching mistake, not a hyperparameter one.

Set up

mkdir tokenizers-training && cd tokenizers-training
uv venv .venv && source .venv/bin/activate
uv pip install torch tiktoken tokenizers datasets wandb numpy

# Login to wandb (free tier is fine)
wandb login

git init && echo ".venv/" >> .gitignore
echo "wandb/" >> .gitignore

Bring forward the working nanoGPT-style code from 04.2. You’ll be modifying it, not rewriting from scratch.

Read these first

Three sources, in order, then stop:

Karpathy — Let’s build the GPT Tokenizer. video · 2 hrs · BPE from first principles. Watch this once and tokenizers stop being magic.
Andrej Karpathy — build-nanogpt. repo · 1 hr to read · the reproducible GPT-2 training run. Real LR schedule, real grad accum, real masking.
Hugging Face — Tokenizers docs. docs · 30 min · only the “training a new tokenizer” page. Reference for the build.

That’s it. The rabbit hole here is “every blog post about tokenizer edge cases ever written.” Resist. You will encounter the edge cases naturally as you train.

What this module covers

Section	What you’ll know after
BPE in practice	Why subword beats char and word, and how merges work
Sequence packing	Why padding wastes 30%+ of compute on real data
Attention masks	Causal, bidirectional, prefix — when each is right
Mixed precision	Why bf16 mostly replaced fp16 for training
Gradient accumulation	Big effective batch on a small GPU
LR schedules and clipping	Warmup, cosine, why naive constant LR is bad
Logging with wandb	The 4 charts you must look at every run

BPE in practice

Byte Pair Encoding is the dominant tokenizer for modern LLMs. The intuition: start with single bytes, repeatedly merge the most frequent adjacent pair, until you hit a target vocabulary size. The result is a vocabulary where common words are one token, rare words are several, and unknown text is always representable.

Train one on your own corpus:

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = ByteLevel()
trainer = BpeTrainer(
    vocab_size=8192,
    special_tokens=["[PAD]", "[UNK]", "[BOS]", "[EOS]"],
)
tokenizer.train(files=["data/corpus.txt"], trainer=trainer)
tokenizer.save("tokenizer.json")

# Sanity
out = tokenizer.encode("Hello, world! Tokens are weird.")
print(out.tokens)   # ['Hello', ',', ' world', '!', ...]
print(out.ids)

Or use a pre-trained one for speed:

import tiktoken
enc = tiktoken.get_encoding("cl100k_base")  # GPT-4's tokenizer
ids = enc.encode("Hello, world!")
print(ids, [enc.decode([i]) for i in ids])

The lesson: vocabulary size is a tradeoff. Smaller vocab means longer sequences (more compute per example) but better generalization to rare strings. 8K-50K is the modern range for small models. 100K+ is the range for production LLMs.

Sequence packing vs padding

The naive approach: every example becomes a separate sequence, padded to the max length in the batch. Result: most of your compute is on padding tokens producing nothing.

The better approach: concatenate all training examples (with separator tokens) and slice into fixed-length chunks. No padding. Every token is a real training signal.

import torch

def pack_sequences(token_lists, block_size, eos_id):
    # Flatten with EOS between examples
    all_tokens = []
    for ids in token_lists:
        all_tokens.extend(ids)
        all_tokens.append(eos_id)
    # Slice into blocks
    n = len(all_tokens) // block_size
    packed = torch.tensor(all_tokens[: n * block_size]).view(n, block_size)
    return packed

# Now every batch is real tokens, no waste

For SFT (chat-style fine-tuning) you’ll often pack with a per-sequence attention mask so attention doesn’t cross example boundaries. For pretraining, you usually don’t bother — the model figures it out.

Attention masks — the three flavors

Mask	Used in	What it does
Causal	GPT-style decoders	Token at position t sees positions 0..t
Bidirectional	BERT-style encoders	Every token sees every other token
Prefix	Some chat models	Prefix is bidirectional, completion is causal

In your nanoGPT from 04.2, you have a causal mask. For a chat fine-tune, you may want prefix masking on the user turn so the model can attend to the full prompt as a unit. This is a one-line change but matters for some workloads.

Mixed precision

Training in fp32 is expensive and unnecessary on modern GPUs. Two formats matter:

fp16: half precision. Smaller range (overflow risk). Needs gradient scaling. Older GPUs.
bf16: brain float. Same range as fp32, less mantissa precision. No gradient scaling needed. Ampere/Hopper GPUs.

Default to bf16 if your GPU supports it (most cloud GPUs since 2021 do). It “just works” without the loss-scaling dance.

# bf16 training loop
scaler = None  # not needed for bf16
device = "cuda"
model = model.to(device)

for step in range(n_steps):
    x, y = get_batch("train")
    x, y = x.to(device), y.to(device)

    with torch.amp.autocast(device_type=device, dtype=torch.bfloat16):
        logits = model(x)
        loss = F.cross_entropy(logits.view(-1, vocab_size), y.view(-1))

    opt.zero_grad(set_to_none=True)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    opt.step()

For fp16 you’d need torch.amp.GradScaler. Skip the headache; use bf16 unless you literally cannot.

Gradient accumulation

Big batch sizes train better. Your GPU can’t fit them. Solution: do K forward/backward passes, accumulate gradients, then step.

accum_steps = 8        # effective batch = micro_batch * 8
opt.zero_grad(set_to_none=True)

for micro_step in range(accum_steps):
    x, y = get_batch("train")
    with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
        logits = model(x)
        loss = F.cross_entropy(logits.view(-1, vocab_size), y.view(-1))
        loss = loss / accum_steps    # scale so total gradient is the average

    loss.backward()

torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
opt.step()

This trades wall time for memory. A 4080 with 16GB can effectively train at batch sizes meant for an 80GB A100, just slower.

LR schedules and gradient clipping

The default everyone uses for transformer training:

import math

def get_lr(step, warmup_steps, max_steps, max_lr, min_lr):
    if step < warmup_steps:
        return max_lr * step / warmup_steps
    if step > max_steps:
        return min_lr
    decay_ratio = (step - warmup_steps) / (max_steps - warmup_steps)
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))
    return min_lr + coeff * (max_lr - min_lr)

# Apply at every step
for step in range(max_steps):
    lr = get_lr(step, 100, 5000, 3e-4, 3e-5)
    for g in opt.param_groups:
        g["lr"] = lr
    # ... rest of training step

Warmup prevents early-training instability. Cosine decay smoothly slides the LR to near-zero. Both are empirically validated; the math is hand-wavy.

Gradient clipping (clip_grad_norm_(model.parameters(), 1.0)) catches the occasional huge gradient that would otherwise blow up training. One line, never skip it.

Logging — the four charts

import wandb
wandb.init(project="tinygpt", config={"lr": 3e-4, "batch_size": 32})

# Inside training loop
if step % 50 == 0:
    wandb.log({
        "train/loss": loss.item(),
        "train/lr": lr,
        "train/grad_norm": grad_norm.item(),
        "train/tokens_per_sec": tokens_per_sec,
    }, step=step)

if step % 500 == 0:
    val_loss = evaluate(model)
    wandb.log({"val/loss": val_loss}, step=step)

The four charts to watch:

Chart	What healthy looks like	What broken looks like
Train loss	Monotone decreasing, smooth	Spikes, plateau, NaN
Val loss	Tracks train loss with small gap	Diverges → overfitting
Grad norm	Stable around 0.5–2.0	Spikes to 100+ → about to NaN
Tokens/sec	Stable, near GPU max	Drifting down → memory leak

The build

Start from your nanoGPT from 04.2. Make these changes in order:

Train a BPE tokenizer (vocab_size=8192) on a larger corpus (TinyStories, ~500MB, or similar).
Replace the char-level tokenizer with the BPE.
Switch from random batches to packed sequences.
Add bf16 mixed precision.
Add gradient accumulation (accum_steps=4 or 8).
Add the warmup-cosine LR schedule.
Add gradient clipping.
Wire up wandb logging.

Train for 5000 steps. Compare final val loss to the char-level Shakespeare baseline. You should see a substantially better model — coherent multi-sentence completions, not just rhythmic gibberish.

Going deeper

When you have specific questions, in this order:

karpathy/build-nanogpt — the cleanest reference for everything in this module, in production-quality code.
Hugging Face — NLP Course, chapter 6 — tokenizers in depth, with the failure modes.
Mosaic — Composer training tricks — what professional shops actually do for big training runs.
Liu et al. — RoBERTa. arxiv — the paper that made “training tricks dominate architecture” the consensus view.

Skip “5 ways to speed up your training” Mediums. They almost always recommend things you already did or things that don’t work.

Checkpoints

If any wobbles, reread the corresponding section.

Why is BPE better than word-level tokenization for English text? Give a concrete example where word-level fails and BPE doesn’t.
You have a 16GB GPU and want effective batch size 256. Your micro-batch fits 32. What’s your accum_steps and total step time multiplier?
Walk through what bf16 mixed precision actually does — which tensors are in bf16, which stay in fp32, and why the master weights matter.
Your training loss is decreasing but val loss is going up. Name three concrete things to try, ranked by what you’d try first.
You start a training run, and after 200 steps the loss is NaN. Where do you look first, second, third?

When you can answer all five from memory, move to 04.4 Fine-tuning — LoRA, QLoRA, full FT. The training plumbing you just built is what every fine-tuning recipe assumes you understand.