$ yuktics v0.1

T4 — AI Literacy & Engineering module 04.5 ~6–10 hrs

Build an AI agent

Go from a single LLM call to a working tool-using agent. Build a research agent with tracing and evals — small enough to fit in 200 lines, real enough to actually use.

Prerequisites

  • Python 3.12
  • an Anthropic API key
  • module 01.1 (or equivalent)
  • comfort with HTTP / JSON schemas

Stack

  • anthropic >= 0.45
  • Claude Sonnet 4.6
  • tavily-python (or exa-py)
  • python-dotenv
  • rich (for nice traces)

By the end of this module

  • Implement the basic agent loop (LLM ↔ tools) by hand, no framework.
  • Define tool schemas that the model actually uses correctly.
  • Trace every step of an agent run and replay failures.
  • Write a small eval set and measure your agent's accuracy on it.

The word agent gets used to mean a lot of different things in 2026. In this module it means exactly one thing: an LLM in a loop, with tools, that decides each step what to do next. Almost every interesting “AI agent” product is some variant of that. You don’t need a framework to build one — you need to understand the loop.

In this module you’ll build a real research agent. By the end it can take a question, search the web, read pages, run small Python snippets to do math or parse data, and produce a sourced answer. Roughly 200 lines of Python, no framework dependencies, fully traced and evaluable.

Set up

mkdir agent && cd agent
uv venv .venv && source .venv/bin/activate
uv pip install anthropic tavily-python python-dotenv rich

cat > .env <<'EOF'
ANTHROPIC_API_KEY=sk-ant-...
TAVILY_API_KEY=tvly-...
EOF

git init && echo ".env\n.venv/" >> .gitignore

If you don’t have a Tavily key, use Exa or Brave Search API — same shape, swap the client. The point is some search backend you can call from Python.

Read these first

Three sources, in this order, then stop:

  1. Anthropic — Building effective agents. post · 20 min · the framing this module follows
  2. Anthropic — Tool use docs. docs · 15 min · canonical for the API shape
  3. Schick et al. — Toolformer. arxiv · 30 min · why tool use works at all

You’ll be tempted to read about LangChain, LangGraph, AutoGen, CrewAI, and the rest. Don’t, yet. Build it without a framework first — you’ll evaluate frameworks in 02.5 with a much sharper opinion.

Step 0 — The smallest possible “agent”

Strip out everything you’ve read about agents and write the dumbest version. One LLM call. No tools. No loop.

# v0.py
import os
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()
client = Anthropic()

resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "What's 47 * 89?"}],
)
print(resp.content[0].text)

Run it. The model will probably get the math right, but trust nothing — try a question that needs current information (“What’s the latest stable PyTorch version?”). Watch it confidently make something up. That’s the gap tools close.

Step 1 — Define a tool

A tool, as far as the model is concerned, is a JSON schema and a name. The runtime is yours.

# tools.py
from tavily import TavilyClient
import os

tavily = TavilyClient(api_key=os.getenv("TAVILY_API_KEY"))

WEB_SEARCH = {
    "name": "web_search",
    "description": (
        "Search the web for recent or specific information. "
        "Use when the user's question depends on current facts, "
        "specific URLs, or anything you might be wrong about."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "query": {"type": "string", "description": "Search query, 3-10 words."},
            "max_results": {"type": "integer", "default": 5, "minimum": 1, "maximum": 10},
        },
        "required": ["query"],
    },
}

def run_web_search(query: str, max_results: int = 5) -> str:
    results = tavily.search(query=query, max_results=max_results)
    return "\n\n".join(
        f"# {r['title']}\n{r['url']}\n{r['content']}"
        for r in results["results"]
    )

A few things to internalize:

  • The description is the prompt for the tool. If the model is calling it wrongly or not calling it when it should, fix the description before fixing anything else.
  • The schema is enforced. If max_results is out of range, the API rejects the call before it reaches your code.
  • Required vs optional matters. Be ruthless. Optional fields are where models hallucinate.

Step 2 — The agent loop

This is the whole conceptual content of the module. The rest is plumbing.

# agent.py
import json
from rich import print
from anthropic import Anthropic
from tools import WEB_SEARCH, run_web_search

client = Anthropic()
TOOLS = [WEB_SEARCH]
TOOL_FNS = {"web_search": run_web_search}

def run_agent(user_msg: str, max_turns: int = 8) -> str:
    messages = [{"role": "user", "content": user_msg}]

    for turn in range(max_turns):
        resp = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            tools=TOOLS,
            messages=messages,
        )
        print(f"[bold cyan]turn {turn} — stop_reason:[/] {resp.stop_reason}")

        # Always append assistant message back into history
        messages.append({"role": "assistant", "content": resp.content})

        if resp.stop_reason == "end_turn":
            return next(
                (b.text for b in resp.content if b.type == "text"),
                "(no text in final response)",
            )

        if resp.stop_reason == "tool_use":
            tool_results = []
            for block in resp.content:
                if block.type != "tool_use":
                    continue
                fn = TOOL_FNS[block.name]
                print(f"[yellow]→ {block.name}({block.input})[/]")
                try:
                    out = fn(**block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": out,
                    })
                except Exception as e:
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": f"ERROR: {e}",
                        "is_error": True,
                    })
            messages.append({"role": "user", "content": tool_results})
            continue

        raise RuntimeError(f"unexpected stop_reason: {resp.stop_reason}")

    return "(hit max_turns)"

if __name__ == "__main__":
    print(run_agent("What's the latest stable PyTorch release and when did it ship?"))

The whole pattern:

loop:
  call LLM with current messages + tool schemas
  if stop_reason == "end_turn":   return final text
  if stop_reason == "tool_use":   run tools, append results, continue
  else:                           you have a bug

That’s it. Every “agent framework” you’ve seen is a variation, abstraction, or extension of this loop. Knowing the loop in code is non-negotiable before you can evaluate a framework honestly.

Step 3 — Add a second tool: Python execution

A research agent that can’t do arithmetic or parse data is a clipping service. Add a sandboxed code-exec tool.

# tools.py (continued)
import subprocess, tempfile, os

PY_EXEC = {
    "name": "python_exec",
    "description": (
        "Execute a short Python snippet in a fresh subprocess. "
        "Use for math, list manipulation, JSON parsing, date math. "
        "DO NOT use for network calls — use web_search instead."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "code": {"type": "string", "description": "Python source. stdout is returned."},
        },
        "required": ["code"],
    },
}

def run_python_exec(code: str) -> str:
    with tempfile.NamedTemporaryFile("w", suffix=".py", delete=False) as f:
        f.write(code)
        path = f.name
    try:
        r = subprocess.run(
            ["python", path],
            capture_output=True, text=True, timeout=10,
        )
        return f"stdout:\n{r.stdout}\nstderr:\n{r.stderr}"
    finally:
        os.unlink(path)

Security reality. A subprocess is not a sandbox. For anything beyond local experiments, run untrusted code in a real sandbox: Docker with no network, Modal, E2B, or Pyodide in a worker. Module 02.2 covers this in depth.

Register the tool:

TOOLS = [WEB_SEARCH, PY_EXEC]
TOOL_FNS = {"web_search": run_web_search, "python_exec": run_python_exec}

Now ask: "How many seconds are in a fortnight, and what year was that idea coined?" Watch the agent split the question across both tools.

Step 4 — Tracing

If you can’t see what your agent did, you can’t debug it. Save every turn to disk.

# tracing.py
import json, time, uuid
from pathlib import Path

class Trace:
    def __init__(self, run_dir="runs"):
        self.run_id = f"{int(time.time())}-{uuid.uuid4().hex[:6]}"
        self.dir = Path(run_dir) / self.run_id
        self.dir.mkdir(parents=True, exist_ok=True)
        self.events = []

    def log(self, kind, payload):
        self.events.append({"ts": time.time(), "kind": kind, **payload})
        (self.dir / "events.jsonl").write_text(
            "\n".join(json.dumps(e, default=str) for e in self.events)
        )

Wire it into the loop: log user_message, llm_response, every tool_call, every tool_result, and final_answer. Then write a one-page replay viewer (or just cat runs/<id>/events.jsonl | jq).

This is the unsexy work. It is also the work that separates an agent that works on a demo from an agent that works on a Wednesday afternoon.

Step 5 — Eval

This is the step most students skip. It’s also the step that separates “vibes-based” agent work from real engineering.

Write 10 questions with known correct answers in a JSON file:

[
  {"q": "What's the latest stable Python 3.12.x patch release?", "expect_contains": ["3.12"]},
  {"q": "Who is the current CTO of Anthropic?", "expect_contains": ["Sam"]},
  {"q": "What is 7! + 3^5?", "expect_contains": ["5283"]},
  ...
]

Run your agent against each, log success rate, log per-question traces. Three numbers to track:

MetricWhy it matters
AccuracyDoes it actually answer correctly?
Tool-call countCost / latency proxy
Failure mode”Wrong answer” vs “loop hit max_turns” vs “tool error”

Run the eval after every prompt change. Refuse to ship a change that drops accuracy without a clearly understood reason.

Going deeper (resources, ranked)

When you have specific questions, in this order:

  1. anthropic-cookbook/tool_use — official patterns: parallel tools, retry, JSON-only output.
  2. Model Context Protocol — covered in module 02.3. The standardized way to expose tools to any model client.
  3. Anthropic — Constitutional AI / agentic harms — what can go wrong, by the people who study it.
  4. OpenAI Agents SDK — read the docs, not the tutorials. Compare its loop to yours.
  5. LangGraph — the most-used graph-style agent framework. Worth one project, then make up your own mind.

Skip the YouTube videos that say “Build an AI agent in 5 minutes.” They’re showing you a single LLM call with requests.

Checkpoints

If any one wobbles, reread the corresponding section.

  1. Walk through what stop_reason means in the agent loop, and what your code does for each value.
  2. Why does the assistant message need to be appended to history even when it contained tool calls? What breaks if you skip it?
  3. What’s the difference between a tool description and a system prompt, in terms of when each is most useful for steering behavior?
  4. Why is “vibes-based” agent eval dangerous? Name two specific failure modes a 10-question eval set would catch.
  5. Sketch how you’d add a third tool — say, read_file(path) — including its schema, its handler, and what its description should make clear so the model uses it correctly.

Pass all five and you’ve earned 02.1. Next stop: 02.3 MCP to expose your tools as a real server, or 02.4 Agent memory to give your agent something to remember between runs.