Build an AI agent
Go from a single LLM call to a working tool-using agent. Build a research agent with tracing and evals — small enough to fit in 200 lines, real enough to actually use.
Prerequisites
Python 3.12an Anthropic API keymodule 01.1 (or equivalent)comfort with HTTP / JSON schemas
Stack
anthropic >= 0.45Claude Sonnet 4.6tavily-python (or exa-py)python-dotenvrich (for nice traces)
By the end of this module
- Implement the basic agent loop (LLM ↔ tools) by hand, no framework.
- Define tool schemas that the model actually uses correctly.
- Trace every step of an agent run and replay failures.
- Write a small eval set and measure your agent's accuracy on it.
The word agent gets used to mean a lot of different things in 2026. In this module it means exactly one thing: an LLM in a loop, with tools, that decides each step what to do next. Almost every interesting “AI agent” product is some variant of that. You don’t need a framework to build one — you need to understand the loop.
In this module you’ll build a real research agent. By the end it can take a question, search the web, read pages, run small Python snippets to do math or parse data, and produce a sourced answer. Roughly 200 lines of Python, no framework dependencies, fully traced and evaluable.
Set up
mkdir agent && cd agent
uv venv .venv && source .venv/bin/activate
uv pip install anthropic tavily-python python-dotenv rich
cat > .env <<'EOF'
ANTHROPIC_API_KEY=sk-ant-...
TAVILY_API_KEY=tvly-...
EOF
git init && echo ".env\n.venv/" >> .gitignore
If you don’t have a Tavily key, use Exa or Brave Search API — same shape, swap the client. The point is some search backend you can call from Python.
Read these first
Three sources, in this order, then stop:
- Anthropic — Building effective agents. post · 20 min · the framing this module follows
- Anthropic — Tool use docs. docs · 15 min · canonical for the API shape
- Schick et al. — Toolformer. arxiv · 30 min · why tool use works at all
You’ll be tempted to read about LangChain, LangGraph, AutoGen, CrewAI, and the rest. Don’t, yet. Build it without a framework first — you’ll evaluate frameworks in 02.5 with a much sharper opinion.
Step 0 — The smallest possible “agent”
Strip out everything you’ve read about agents and write the dumbest version. One LLM call. No tools. No loop.
# v0.py
import os
from anthropic import Anthropic
from dotenv import load_dotenv
load_dotenv()
client = Anthropic()
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "What's 47 * 89?"}],
)
print(resp.content[0].text)
Run it. The model will probably get the math right, but trust nothing — try a question that needs current information (“What’s the latest stable PyTorch version?”). Watch it confidently make something up. That’s the gap tools close.
Step 1 — Define a tool
A tool, as far as the model is concerned, is a JSON schema and a name. The runtime is yours.
# tools.py
from tavily import TavilyClient
import os
tavily = TavilyClient(api_key=os.getenv("TAVILY_API_KEY"))
WEB_SEARCH = {
"name": "web_search",
"description": (
"Search the web for recent or specific information. "
"Use when the user's question depends on current facts, "
"specific URLs, or anything you might be wrong about."
),
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query, 3-10 words."},
"max_results": {"type": "integer", "default": 5, "minimum": 1, "maximum": 10},
},
"required": ["query"],
},
}
def run_web_search(query: str, max_results: int = 5) -> str:
results = tavily.search(query=query, max_results=max_results)
return "\n\n".join(
f"# {r['title']}\n{r['url']}\n{r['content']}"
for r in results["results"]
)
A few things to internalize:
- The description is the prompt for the tool. If the model is calling it wrongly or not calling it when it should, fix the description before fixing anything else.
- The schema is enforced. If
max_resultsis out of range, the API rejects the call before it reaches your code. - Required vs optional matters. Be ruthless. Optional fields are where models hallucinate.
Step 2 — The agent loop
This is the whole conceptual content of the module. The rest is plumbing.
# agent.py
import json
from rich import print
from anthropic import Anthropic
from tools import WEB_SEARCH, run_web_search
client = Anthropic()
TOOLS = [WEB_SEARCH]
TOOL_FNS = {"web_search": run_web_search}
def run_agent(user_msg: str, max_turns: int = 8) -> str:
messages = [{"role": "user", "content": user_msg}]
for turn in range(max_turns):
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
tools=TOOLS,
messages=messages,
)
print(f"[bold cyan]turn {turn} — stop_reason:[/] {resp.stop_reason}")
# Always append assistant message back into history
messages.append({"role": "assistant", "content": resp.content})
if resp.stop_reason == "end_turn":
return next(
(b.text for b in resp.content if b.type == "text"),
"(no text in final response)",
)
if resp.stop_reason == "tool_use":
tool_results = []
for block in resp.content:
if block.type != "tool_use":
continue
fn = TOOL_FNS[block.name]
print(f"[yellow]→ {block.name}({block.input})[/]")
try:
out = fn(**block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": out,
})
except Exception as e:
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": f"ERROR: {e}",
"is_error": True,
})
messages.append({"role": "user", "content": tool_results})
continue
raise RuntimeError(f"unexpected stop_reason: {resp.stop_reason}")
return "(hit max_turns)"
if __name__ == "__main__":
print(run_agent("What's the latest stable PyTorch release and when did it ship?"))
The whole pattern:
loop:
call LLM with current messages + tool schemas
if stop_reason == "end_turn": return final text
if stop_reason == "tool_use": run tools, append results, continue
else: you have a bug
That’s it. Every “agent framework” you’ve seen is a variation, abstraction, or extension of this loop. Knowing the loop in code is non-negotiable before you can evaluate a framework honestly.
Step 3 — Add a second tool: Python execution
A research agent that can’t do arithmetic or parse data is a clipping service. Add a sandboxed code-exec tool.
# tools.py (continued)
import subprocess, tempfile, os
PY_EXEC = {
"name": "python_exec",
"description": (
"Execute a short Python snippet in a fresh subprocess. "
"Use for math, list manipulation, JSON parsing, date math. "
"DO NOT use for network calls — use web_search instead."
),
"input_schema": {
"type": "object",
"properties": {
"code": {"type": "string", "description": "Python source. stdout is returned."},
},
"required": ["code"],
},
}
def run_python_exec(code: str) -> str:
with tempfile.NamedTemporaryFile("w", suffix=".py", delete=False) as f:
f.write(code)
path = f.name
try:
r = subprocess.run(
["python", path],
capture_output=True, text=True, timeout=10,
)
return f"stdout:\n{r.stdout}\nstderr:\n{r.stderr}"
finally:
os.unlink(path)
Security reality. A subprocess is not a sandbox. For anything beyond local experiments, run untrusted code in a real sandbox: Docker with no network, Modal, E2B, or Pyodide in a worker. Module 02.2 covers this in depth.
Register the tool:
TOOLS = [WEB_SEARCH, PY_EXEC]
TOOL_FNS = {"web_search": run_web_search, "python_exec": run_python_exec}
Now ask: "How many seconds are in a fortnight, and what year was that idea coined?" Watch the agent split the question across both tools.
Step 4 — Tracing
If you can’t see what your agent did, you can’t debug it. Save every turn to disk.
# tracing.py
import json, time, uuid
from pathlib import Path
class Trace:
def __init__(self, run_dir="runs"):
self.run_id = f"{int(time.time())}-{uuid.uuid4().hex[:6]}"
self.dir = Path(run_dir) / self.run_id
self.dir.mkdir(parents=True, exist_ok=True)
self.events = []
def log(self, kind, payload):
self.events.append({"ts": time.time(), "kind": kind, **payload})
(self.dir / "events.jsonl").write_text(
"\n".join(json.dumps(e, default=str) for e in self.events)
)
Wire it into the loop: log user_message, llm_response, every tool_call, every tool_result, and final_answer. Then write a one-page replay viewer (or just cat runs/<id>/events.jsonl | jq).
This is the unsexy work. It is also the work that separates an agent that works on a demo from an agent that works on a Wednesday afternoon.
Step 5 — Eval
This is the step most students skip. It’s also the step that separates “vibes-based” agent work from real engineering.
Write 10 questions with known correct answers in a JSON file:
[
{"q": "What's the latest stable Python 3.12.x patch release?", "expect_contains": ["3.12"]},
{"q": "Who is the current CTO of Anthropic?", "expect_contains": ["Sam"]},
{"q": "What is 7! + 3^5?", "expect_contains": ["5283"]},
...
]
Run your agent against each, log success rate, log per-question traces. Three numbers to track:
| Metric | Why it matters |
|---|---|
| Accuracy | Does it actually answer correctly? |
| Tool-call count | Cost / latency proxy |
| Failure mode | ”Wrong answer” vs “loop hit max_turns” vs “tool error” |
Run the eval after every prompt change. Refuse to ship a change that drops accuracy without a clearly understood reason.
Going deeper (resources, ranked)
When you have specific questions, in this order:
- anthropic-cookbook/tool_use — official patterns: parallel tools, retry, JSON-only output.
- Model Context Protocol — covered in module 02.3. The standardized way to expose tools to any model client.
- Anthropic — Constitutional AI / agentic harms — what can go wrong, by the people who study it.
- OpenAI Agents SDK — read the docs, not the tutorials. Compare its loop to yours.
- LangGraph — the most-used graph-style agent framework. Worth one project, then make up your own mind.
Skip the YouTube videos that say “Build an AI agent in 5 minutes.” They’re showing you a single LLM call with requests.
Checkpoints
If any one wobbles, reread the corresponding section.
- Walk through what
stop_reasonmeans in the agent loop, and what your code does for each value. - Why does the assistant message need to be appended to history even when it contained tool calls? What breaks if you skip it?
- What’s the difference between a tool description and a system prompt, in terms of when each is most useful for steering behavior?
- Why is “vibes-based” agent eval dangerous? Name two specific failure modes a 10-question eval set would catch.
- Sketch how you’d add a third tool — say,
read_file(path)— including its schema, its handler, and what its description should make clear so the model uses it correctly.
Pass all five and you’ve earned 02.1. Next stop: 02.3 MCP to expose your tools as a real server, or 02.4 Agent memory to give your agent something to remember between runs.