04.7 Inference, deployment, costs

This is the module that decides whether your AI product has margins. Inference is where 80%+ of an AI product’s variable cost lives, and it’s also where most teams have absolutely no idea what they’re doing. They pick a model based on benchmarks, deploy it on whatever GPU was available, and wonder why their unit economics don’t work. Then they ship anyway and bleed money until somebody finally measures.

The opinion: tokens per dollar is the only metric that matters at the serving layer. Latency matters separately. Quality matters separately. But for any given quality bar, you are choosing between configurations that cost different amounts per million tokens, and most teams don’t know what theirs is. By the end of this module you’ll have served a model yourself, measured its real throughput, and have a calculator in your head for “should I run this or call an API?”

Set up

mkdir inference && cd inference
uv venv .venv && source .venv/bin/activate
uv pip install vllm anthropic openai requests numpy

# You need a real GPU. Modal is the lowest-friction path:
pip install modal
modal token new

You can run a small model on a 24GB GPU. For the bigger experiments (Llama-3-70B), you’ll want an A100 80GB. Rent it. Do not buy it.

Read these first

Three sources, in order, then stop:

Kwon et al. — Efficient Memory Management for Large Language Model Serving with PagedAttention. arxiv · 30 min · the vLLM paper. The KV cache section is essential reading.
Anyscale — Continuous batching. post · 20 min · why naive batching wastes most of your GPU.
Tim Dettmers — Quantization: A Visual Guide. post · 30 min · the cleanest explanation of INT4/INT8 trade-offs. Maarten Grootendorst’s piece, often credited to Dettmers’ work.

Skip the marketing pages from inference platform vendors. They’re optimized for “look how fast we are at the perfect benchmark” and tell you nothing about your workload.

Tokens per dollar

This is the unit of analysis. Every serving decision should reduce to: “given my quality bar, what’s the lowest cost per million tokens I can achieve?”

A few rough numbers to anchor on, late 2025:

Source	Approx $ / 1M output tokens
Claude Sonnet 4.6 (API)	$15
Claude Haiku 4.5 (API)	$5
GPT-4 class (API)	$10–30
Llama-3.1-8B self-hosted on 1×L4	$1–3
Llama-3.1-70B self-hosted on 1×A100	$4–8
GPT-OSS / Qwen2.5-72B self-hosted	similar to Llama-70B

The gap between “API for top model” and “self-hosted open model” is roughly 5-10x. If your task can be done by an open 8B model, self-hosting may save you significant money. If it can’t, the API is almost always cheaper than trying.

Latency budgets

Inference latency has three numbers, not one:

Metric	What it measures	Why it matters
TTFT (Time to First Token)	Prompt → first output token	Felt as “is the model thinking?”
ITL (Inter-Token Latency)	Time between consecutive output tokens	Streaming feel, perceived speed
Total latency	TTFT + ITL × output length	Total wall time

A chatbot with TTFT of 200ms and ITL of 30ms feels instant. The same model with TTFT of 2s and ITL of 30ms feels slow even though the total time is similar. Optimize for the metric your product actually needs.

Continuous batching

Naive batching: collect N requests, run them together, return results. Problem: requests have different lengths. Short ones finish first and the GPU sits idle waiting for the long ones.

Continuous batching: at every model step, evaluate which sequences are still generating, swap finished ones out, swap new ones in. The GPU never idles waiting for stragglers.

Static batching:
[req1: short ████        idle idle idle]
[req2: long  █████████████             ]
[req3: med   ███████        idle idle  ]
                ↑ wasted GPU time

Continuous batching:
[req1: short ████]                     [req4: med    ███████]
[req2: long  █████████████]
[req3: med   ███████]    [req5: short  ████]
              ↑ slots reused as soon as freed

vLLM and sglang both implement this. It’s a 3-5x throughput win versus naive serving frameworks.

KV cache — the memory you forgot about

When you generate token N, attention needs the keys and values for tokens 1..N-1. Without caching, you’d recompute them every step (O(n²) compute over the whole sequence). With caching, you store K and V from previous steps and reuse them.

The KV cache size is roughly:

kv_size_bytes = 2 × n_layers × n_heads × head_dim × seq_len × batch_size × dtype_bytes

For Llama-3-8B with bf16: roughly 0.5 MB per token per request. A 4K context with 16 concurrent requests: 32 GB of KV cache alone. This is why long contexts are expensive — KV cache, not weights, dominates memory at serving time.

PagedAttention (vLLM’s contribution) treats KV cache like virtual memory. It allocates in fixed-size pages, allows non-contiguous storage, and shares pages across requests with shared prefixes. Result: dramatically less wasted memory, higher max batch size, more throughput per GPU.

Quantization — the practical view

Quantization replaces fp16 weights with INT8 or INT4 representations. Smaller, faster, with some quality loss.

Method	Bits	Quality cost	Use when
INT8 (LLM.int8)	8	Negligible	Easy memory savings, mild speedup
AWQ	4	Small (1-2% benchmarks)	Best quality at INT4. GPU serving.
GPTQ	4	Small	Older, similar to AWQ. Wider tooling support.
GGUF (Q4_K_M etc.)	4–5	Small to moderate	CPU serving via llama.cpp.
INT4 plain	4	Bigger	Avoid; AWQ/GPTQ exist for a reason.
FP8	8	Tiny	Hopper GPUs (H100), production serving.

Default: AWQ for GPU serving, GGUF for CPU/Apple Silicon. If you’re on H100s and your inference engine supports it, FP8 is even better.

# Serving an AWQ-quantized model with vLLM
from vllm import LLM, SamplingParams

llm = LLM(
    model="TheBloke/Llama-2-7B-Chat-AWQ",
    quantization="awq",
    dtype="float16",
    gpu_memory_utilization=0.9,
    max_model_len=4096,
)

prompts = ["Explain quantum tunneling in one paragraph."]
params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(prompts, params)
print(outputs[0].outputs[0].text)

Speculative decoding (briefly)

A small “draft” model proposes the next K tokens. The big model verifies them in parallel. If most are accepted, you got K tokens for the cost of roughly one. If they’re rejected, you fall back to normal generation. Average speedup: 2-3x for autoregressive workloads.

Useful when latency matters more than throughput. Most managed inference platforms (Together, Fireworks, Anthropic for the closed models) use it transparently. Self-hosting it requires a draft model and some plumbing — worth it for latency-critical apps.

The serve-or-call decision

A simple framework. Estimate four numbers:

Number	How to get it
Daily token volume	Count tokens × users × usage rate
API cost at that volume	Multiply by API rates above
Self-host GPU cost	$0.50–$3/hr × 24 × your fleet size
Engineering cost to run it	At least 1 engineer, 20% of their time

The rule:

If daily volume × API rate > self-host fixed cost + ops:
    self-hosting MIGHT be cheaper, run real numbers
Else:
    pay the API. Stop thinking about it.

For a typical SaaS chatbot with 10K queries/day: API. For a high-volume document processing pipeline at 10M tokens/hour: probably self-host. The crossover is real but it’s higher than people think because GPUs require ops.

The build

End-to-end exercise on a single L4 (24GB):

Pull Llama-3.1-8B-AWQ from Hugging Face.
Serve with vLLM. Set max_model_len=4096, gpu_memory_utilization=0.9.
Send sequential single-stream requests. Measure TTFT and tokens/sec for output_length=256.
Send concurrent requests at batch sizes 1, 4, 8, 16, 32. Measure aggregate tokens/sec.
Plot throughput vs concurrency. Find the knee.
Switch to a non-quantized 8B base. Compare throughput. Measure quality on a 50-prompt eval. Decide whether quant is “free.”

You should see throughput climb dramatically from concurrency 1 to 8, then plateau. The plateau is where your KV cache or compute budget runs out. Now you know your serving capacity in tokens/sec on this hardware.

The “rent before buy” rule

Buying a GPU server costs $10K-$200K up front. Renting one costs $0.50-$5/hr. Until you have a year of stable, predictable demand at high utilization, rent. The math almost never favors buying for sub-7-figure-revenue companies, and even then “we have stable demand” is the part teams get wrong.

Specifically:

Modal, RunPod, Lambda Labs, Together, Fireworks: rent on-demand. Pay by the second.
Reserved instances (1-year, 3-year): save ~40% but lock you in.
Buying: only if utilization is ~80%+ for over a year, you have devops, and you have capital sitting idle.

Going deeper

When you have specific questions, in this order:

vLLM docs — the production reference. The “Production Stack” page is the most useful.
sglang — vLLM’s main competitor. Faster on some workloads, worse tooling.
Pope et al. — Efficiently Scaling Transformer Inference. arxiv — Google’s paper on inference at scale. Heavy reading; read after you’ve served something.
llama.cpp — when CPU/Mac serving matters. The GGUF ecosystem lives here.

Skip the “serverless GPU” pitches that promise zero ops. Cold starts and per-second billing make them great for spiky traffic and bad for steady-state — read their pricing carefully against your workload.

Checkpoints

If any wobbles, reread the corresponding section.

For a 7B model serving 4K-context requests, what’s the rough KV cache memory per concurrent request? Why does this cap your max batch size?
Walk through what continuous batching does at the GPU step level. Why does it beat static batching by 3-5x?
Your TTFT is fine but your tokens/sec is poor under load. What does that tell you about where the bottleneck is, and what would you change?
AWQ vs GPTQ vs GGUF — when is each the right choice? Be specific about the deployment target.
You’re getting 50K daily users sending 500-token prompts and getting 200-token responses. Pricing the API at $5/1M output tokens, what’s monthly cost? At what user volume does self-hosting Llama-8B start to look reasonable?

When you can answer all five from memory, move to 05.1 Designing systems on a whiteboard. Inference economics are one slice of a larger system; the next track is about the rest of it.