Fine-tuning — LoRA, QLoRA, full FT
When to use which. Real configs that actually train. How to fine-tune a 7B on a single GPU and get a model that's measurably better at your task.
Prerequisites
04.3
Stack
Python 3.12PyTorchtransformerspeftbitsandbytestrl1× GPU (rented L4/A10 from Modal/RunPod is fine)
By the end of this module
- Pick the right adaptation strategy (prompt, RAG, fine-tune, pretrain) for a given problem in under 30 seconds.
- LoRA-fine-tune a 7B model on a single 24GB GPU with a working transformers + peft + trl config.
- Evaluate a fine-tuned model and recognize the most common failure mode (better on training set, worse in production).
- Merge LoRA adapters back into a base model for serving.
Fine-tuning is one of the most over-prescribed techniques in applied ML. Everyone wants to fine-tune a model. Most of them shouldn’t — and the ones who should are usually trying to do it the wrong way, with the wrong data, on the wrong base. This module exists to make you the rare engineer who fine-tunes well, sparingly, and only when prompting and retrieval have already failed.
The opinion: fine-tuning is the third thing you try, not the first. Prompt engineering is free and instant. RAG is fast and cheap and adds knowledge. Fine-tuning is slow, expensive, and changes behavior — useful only when you need a specific style, format, or skill that prompting cannot reliably elicit. By the end of this module you’ll know exactly which problems each tool solves, and you’ll have actually shipped a LoRA fine-tune of a 7B model on a single GPU.
Set up
mkdir finetune && cd finetune
uv venv .venv && source .venv/bin/activate
uv pip install torch transformers peft bitsandbytes trl accelerate datasets wandb
# Verify GPU is visible
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
git init && echo ".venv/" >> .gitignore
You need a real GPU for this module. CPU fine-tuning is not a thing. The cheapest legitimate option is renting an L4 (24GB) on Modal or RunPod for a few dollars; an A10 (24GB) or A100 (40GB+) is faster but pricier. A 4090 (24GB) at home works.
Read these first
Three sources, in order, then stop:
- Hu et al. — LoRA: Low-Rank Adaptation of Large Language Models. arxiv · 30 min · the original paper. Section 4 is the math; section 7 is the empirical justification. Read both.
- Dettmers et al. — QLoRA: Efficient Finetuning of Quantized LLMs. arxiv · 40 min · this is what makes single-GPU 7B fine-tuning real.
- Hugging Face — PEFT docs, TRL SFTTrainer page. PEFT · TRL · 30 min · the canonical reference for the libraries you’ll actually use.
Skip the dozens of “fine-tune Llama in 5 minutes” Medium posts. They are usually wrong about something subtle (chat templates, especially) and they all cargo-cult the same buggy hyperparameters from each other.
The decision tree
Before you fine-tune anything, walk this tree. Most of the time you should not fine-tune.
Question: "I want my model to..."
→ "...know about my company's specific docs"
→ use RAG. Do not fine-tune.
→ "...respond in our brand voice or specific format"
→ first try a system prompt + few-shot examples
→ if that's not enough, then SFT fine-tune
→ "...follow a complex multi-step procedure"
→ first try detailed prompting + a small eval set
→ if prompts can't reach quality bar, SFT fine-tune
→ "...be aligned to user preferences (helpfulness, safety)"
→ DPO or RLHF, after SFT
→ "...handle a novel modality, language family, or domain
that's truly underrepresented in pretraining"
→ consider continued pretraining, then SFT
→ this is rare; you usually don't need this
If you’re not sure which branch you’re on, you’re not ready to fine-tune. Re-read 04.6 RAG done right first.
LoRA — what it actually does
The idea is small and beautiful. A weight update during fine-tuning, ΔW, is usually low-rank in practice — it can be approximated by a product of two thin matrices. Instead of updating the full weight W (millions of parameters), you train two small matrices A (in_dim × r) and B (r × out_dim) where r is small (8 to 64), and use W + AB at inference.
Original layer: y = W x (W is d × d, big)
LoRA-adapted: y = W x + B A x (A is r × d, B is d × r)
For a 7B model, full fine-tuning trains 7B parameters. LoRA with r=16 trains roughly 0.1% of that — typically 5-10M parameters. The base model stays frozen. Result: training fits on consumer GPUs, and you ship just the adapter (a few hundred MB) instead of a full model.
# Minimal LoRA setup with PEFT
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16, # rank
lora_alpha=32, # scaling, usually 2*r
target_modules=["q_proj", "v_proj"], # which linear layers to adapt
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: ~5M || all params: 7B || trainable%: 0.07
Choosing r and target_modules is the only LoRA hyperparameter art:
| r value | When to use |
|---|---|
| 4–8 | Small style change, narrow task |
| 16–32 | Default. Most use cases land here. |
| 64+ | Closer to full FT; usually unnecessary |
Target modules: minimum is q_proj and v_proj. Adding k_proj, o_proj, and the MLP projections (gate_proj, up_proj, down_proj) gets closer to full FT quality at the cost of more parameters. Start with q+v, only expand if results disappoint.
QLoRA — when you can’t even fit the base
QLoRA is “load the base model in 4-bit precision, attach LoRA adapters in fp16/bf16 on top.” The base never gets gradient updates (it’s frozen and quantized). Only the adapters train.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat 4
bnb_4bit_use_double_quant=True, # quantize the quantization constants
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=bnb_config,
device_map="auto",
)
This is what makes 7B-13B fine-tuning fit on a 16-24GB GPU. The quality cost relative to LoRA on a fp16 base is small (1-2% on most benchmarks); the memory savings are huge.
Use QLoRA when: the base model in fp16 doesn’t fit in your VRAM with room for activations and gradients. Use plain LoRA when it does.
Full fine-tuning — when?
Almost never, for individuals. Full FT means updating all 7B parameters. Cost: 8x the GPU memory of LoRA, much longer training, and a 14GB+ artifact you have to store and serve.
Reasonable use cases: you have a large, high-quality, domain-specific dataset (think: 100K+ examples of legal documents); you’ve validated that LoRA can’t reach quality; you have multi-GPU budget. Otherwise: LoRA.
SFT, DPO, PPO — what trains what
| Method | What it learns from | When to use |
|---|---|---|
| SFT (supervised fine-tuning) | (prompt, completion) pairs | Most fine-tuning is this. Always start here. |
| DPO (Direct Preference Optimization) | (prompt, preferred, rejected) triples | After SFT, when you have preference data |
| PPO (RLHF) | A reward model + RL loop | When DPO isn’t enough; rare in practice |
For 95% of applied work, SFT alone gets you what you need. DPO is worth it when you have explicit preference data. PPO requires a reward model and an RL training loop — most teams don’t have the data or the infra.
The chat template trap
This is the #1 invisible mistake in chat-style fine-tuning. Every base instruction model has a specific chat template — special tokens, formatting, role labels — that it was trained on. If your fine-tuning data uses a different template, you are fighting the base model.
# Always use the model's actual chat template
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
messages = [
{"role": "user", "content": "What is photosynthesis?"},
{"role": "assistant", "content": "Photosynthesis is..."},
]
formatted = tok.apply_chat_template(messages, tokenize=False)
print(formatted)
# Will produce something like: <|begin_of_text|><|start_header_id|>user<|end_header_id|>...
Use apply_chat_template from the model’s own tokenizer. Do not roll your own. Do not assume “ChatML” or “Llama format” or whatever you saw in a tutorial.
A real config, end to end
This is a working config for LoRA SFT on Llama-3.1-8B with a small instruction dataset, on a single 24GB GPU.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch
MODEL = "meta-llama/Llama-3.1-8B"
tok = AutoTokenizer.from_pretrained(MODEL)
tok.pad_token = tok.eos_token
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL, quantization_config=bnb_config, device_map="auto"
)
lora_config = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
bias="none", task_type="CAUSAL_LM",
)
dataset = load_dataset("yahma/alpaca-cleaned", split="train").select(range(2000))
def format_example(ex):
messages = [
{"role": "user", "content": ex["instruction"] + ("\n" + ex["input"] if ex["input"] else "")},
{"role": "assistant", "content": ex["output"]},
]
return {"text": tok.apply_chat_template(messages, tokenize=False)}
dataset = dataset.map(format_example)
training_args = SFTConfig(
output_dir="./out",
num_train_epochs=2,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
bf16=True,
logging_steps=10,
save_strategy="epoch",
max_seq_length=1024,
packing=True,
)
trainer = SFTTrainer(
model=model,
tokenizer=tok,
train_dataset=dataset,
peft_config=lora_config,
args=training_args,
)
trainer.train()
trainer.save_model("./adapter")
That trains in roughly an hour on an L4 with 2000 examples. For a real project you’d use 5-50K examples and run for longer.
The most common failure mode
You fine-tune. Eval on the training set looks great. Eval on real data is worse than the base model. You are confused. This is the most common outcome of the first fine-tune everyone runs.
Causes, in order of frequency:
- Chat template mismatch. You used a different format than the base.
- Training data is too narrow. Model now refuses or hallucinates on out-of-distribution prompts.
- You overfit. 5+ epochs on a small dataset memorizes it.
- Base model was the wrong choice. Fine-tuning a chat model on raw instruction data can break alignment.
The fix: always evaluate on a held-out set that looks like real production traffic, not on training-set lookalikes. If real-world quality drops, your fine-tune is failing even if loss looks great.
Merging adapters
For serving, you usually want one model file, not “base + adapter.”
from peft import PeftModel
from transformers import AutoModelForCausalLM
base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base, "./adapter")
merged = model.merge_and_unload()
merged.save_pretrained("./merged")
Note: you can’t directly merge a LoRA into a 4-bit quantized base. Load the base in fp16/bf16 to merge, then re-quantize for serving if needed.
Going deeper
When you have specific questions, in order:
- axolotl — the most-used framework for serious open-source fine-tuning. Read its example configs once you’ve shipped one fine-tune by hand.
- Rafailov et al. — DPO. arxiv — the paper. Read after you’ve shipped an SFT model and want to align it.
- unsloth — 2x faster fine-tuning on single GPU. Worth the switch once your loop is stable.
- Anthropic — Constitutional AI. paper — alternative to RLHF. Read when you care about alignment more than benchmarks.
Skip “I fine-tuned Llama on my journal” content. It’s almost always overfitting demonstrations.
Checkpoints
If any wobbles, reread the corresponding section.
- Walk through the LoRA decomposition for a 4096x4096 attention projection at r=16. How many parameters do you train, and what fraction of full FT is that?
- When does QLoRA make sense and when is plain LoRA strictly better? Give the rule based on VRAM.
- You SFT on 1000 examples and eval shows the model is now worse on prompts unrelated to your task. What’s the likely cause and the fix?
- What’s the difference between
apply_chat_templateand just concatenating “user: … assistant: …” strings? What breaks if you do the latter? - You have 200 (prompt, preferred, rejected) examples. Should you do SFT, DPO, or both? Order matters — explain why.
When you can answer all five from memory, move to 04.5 Build an AI agent. Most production “AI products” are a fine-tune plus an agent loop on top.