Observability and ops
Logs, metrics, traces. What to instrument, what alerts mean, what '99.9% uptime' actually buys you in minutes per month.
Prerequisites
03.5
Stack
OpenTelemetryGrafana + Loki + Prometheus + Tempo (or Honeycomb / Datadog)Sentry
By the end of this module
- Pick the right pillar (logs, metrics, or traces) for a given debugging question.
- Write structured logs with correlation IDs that survive across service boundaries.
- Set SLOs and alerting rules that don't page you for nothing.
- Instrument an app with OpenTelemetry, Sentry, and one Grafana dashboard.
Observability is the difference between “the site is down and we don’t know why” and “page p99 spiked at 14:23 because a deploy increased downstream calls by 3x.” The first is a Tuesday for most teams; the second takes five minutes of work and ten minutes of curiosity. This module is about the work and the curiosity.
The opinion: you do not need Datadog yet. You need the open-source LGTM stack (Loki, Grafana, Tempo, Mimir/Prometheus), or you need Honeycomb if you want managed and have budget. Datadog is a fine product whose pricing makes it the wrong choice for any company that hasn’t crossed a specific revenue line. Most teams pay it because their CTO last used it at the previous job, not because they evaluated it. The same goes for instrumentation: most “we have observability” really means “we have a vendor invoice.”
Set up
mkdir obs && cd obs
uv venv .venv && source .venv/bin/activate
uv pip install fastapi uvicorn opentelemetry-distro \
opentelemetry-exporter-otlp opentelemetry-instrumentation-fastapi \
opentelemetry-instrumentation-requests sentry-sdk
# Local LGTM stack via docker
cat > docker-compose.yml <<'EOF'
services:
grafana:
image: grafana/otel-lgtm
ports: ["3000:3000", "4317:4317", "4318:4318"]
EOF
docker compose up -d
That grafana/otel-lgtm image bundles Grafana + Loki + Prometheus + Tempo with OTLP receivers. One container to start; swap to managed services when you outgrow it.
Read these first
Three sources, in order, then stop:
- Google — Site Reliability Engineering, chapter 6 (Monitoring Distributed Systems). free online · 30 min · the canonical “four golden signals” framework, written by the team that operationalized it.
- Charity Majors — Observability: A 3-Year Retrospective. post · 30 min · the case for traces over metrics for debugging unknown unknowns.
- Honeycomb — Observability Engineering book, sample chapters. book · 1 hr · pragmatic and vendor-honest; the book itself is paywalled, the sample is enough for now.
Skip “monitoring vs observability” Twitter threads. They’re correct in spirit and useless in practice.
The three pillars
| Pillar | What it answers | Cost driver |
|---|---|---|
| Logs | ”What happened in this specific request?” | Volume × retention |
| Metrics | ”How is the system behaving in aggregate?” | Cardinality |
| Traces | ”How did this request flow through the system?” | Sample rate × span count |
Use them together. Logs are where you go to read narrative. Metrics are where you set alerts. Traces are where you debug “why is this one request slow.”
A reasonable starter rule: log every request with a correlation ID, expose Prometheus metrics for the four golden signals, sample traces at 1-10% in production. You can dial up later.
Structured logging
The biggest leverage move in observability: stop using printf-style logs. Use structured logs (JSON) with a stable set of fields.
import logging, json, sys, uuid
from contextvars import ContextVar
trace_id_var: ContextVar[str] = ContextVar("trace_id", default="")
class JSONFormatter(logging.Formatter):
def format(self, record):
payload = {
"ts": self.formatTime(record),
"level": record.levelname,
"msg": record.getMessage(),
"logger": record.name,
"trace_id": trace_id_var.get(),
}
if record.exc_info:
payload["exc"] = self.formatException(record.exc_info)
# Merge any extra fields
for k, v in record.__dict__.items():
if k not in ("msg", "args", "exc_info", "exc_text", "stack_info"):
continue
return json.dumps(payload)
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JSONFormatter())
logging.basicConfig(level=logging.INFO, handlers=[handler])
# Middleware: assign trace_id per request
@app.middleware("http")
async def add_trace_id(request, call_next):
tid = request.headers.get("x-trace-id", uuid.uuid4().hex)
token = trace_id_var.set(tid)
try:
response = await call_next(request)
response.headers["x-trace-id"] = tid
return response
finally:
trace_id_var.reset(token)
Now every log line carries a trace_id, and you can grep across services for a single request. This is the single highest-leverage instrumentation change you can make.
The four golden signals
From Google’s SRE book, these are what you should always have alerts on, for every user-facing service:
| Signal | Definition |
|---|---|
| Latency | Time to serve requests, split by success and failure |
| Traffic | Requests per second |
| Errors | Failed-request rate |
| Saturation | How “full” the service is — CPU, memory, queue depth |
Track p50, p95, p99 separately. Averages lie. The user who sees p99 latency is having a bad day even if p50 looks fine.
from prometheus_client import Counter, Histogram, make_asgi_app
http_requests = Counter(
"http_requests_total",
"HTTP requests by route, method, status",
["route", "method", "status"],
)
http_duration = Histogram(
"http_request_duration_seconds",
"HTTP request latency",
["route", "method"],
buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
)
app.mount("/metrics", make_asgi_app())
Cardinality — the metric that crashes your bill
Metric labels (route, method, status) explode combinatorially. Each unique label combination is a separate time series. If you label by user_id, you have one time series per user — and you’ve just turned your metrics database into a logs database, badly.
| Label | Cardinality | OK? |
|---|---|---|
| HTTP method | ~5 | Yes |
| Route name | ~50-200 | Yes |
| Status code | ~10 | Yes |
| Region/AZ | ~10 | Yes |
| User ID | millions | NO |
| Request ID | per-request | NO, NEVER |
| URL with query string | unbounded | NO |
The rule: labels are for grouping aggregates, not for identifying individuals. If you need to find “this specific request,” use logs and traces. Metrics are aggregates.
SLO, SLI, SLA — what they actually mean
Lots of teams use these terms wrong.
| Term | What it is |
|---|---|
| SLI (Indicator) | A measured number — “p99 latency of GET /api” |
| SLO (Objective) | A target on an SLI — “p99 latency under 500ms over a rolling 30 days” |
| SLA (Agreement) | A contractual promise to a customer, with consequences |
Most internal teams need SLOs, not SLAs. SLAs are customer contracts and require legal involvement. SLOs are internal targets that dictate operational behavior — “if we burn through our error budget, the next sprint is reliability work.”
What “99.9% uptime” actually buys you per month:
| SLO | Allowed monthly downtime |
|---|---|
| 99% | ~7.2 hours |
| 99.9% | ~43 minutes |
| 99.95% | ~22 minutes |
| 99.99% | ~4.3 minutes |
| 99.999% | ~26 seconds |
Each extra nine costs roughly 10x more engineering. Pick the SLO that matches what users actually need, not what sounds impressive.
Alerting that doesn’t page for nothing
Three rules.
- Alert on user-visible symptoms, not internal causes. “p99 latency over SLO” is a good page. “CPU over 80%” is not (CPU may be high and users may be fine).
- Multi-window, multi-burn-rate. A single threshold means you alert on a brief blip. The Google SRE workbook has the math: combine a fast-burn alert (problem in the last hour) and a slow-burn alert (problem over the last day).
- Every page must be actionable. If the on-call cannot do anything about it at 3 AM, it is not a page. Move it to a ticket.
A sensible starter alert set for a web service:
# Prometheus-flavored alert rules
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 10m
labels: { severity: page }
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
for: 5m
labels: { severity: page }
- alert: SaturationDiskFilling
expr: predict_linear(node_filesystem_free_bytes[1h], 4*3600) < 0
for: 30m
labels: { severity: ticket }
Notice the disk-fill alert is a ticket, not a page. There’s nothing useful to do at 3 AM about a disk that will fill in 4 hours. Page when human action is needed now.
Distributed tracing with OpenTelemetry
A trace is a tree of spans. Each span represents a unit of work (HTTP request, DB query, external call). Spans propagate a traceparent header so they can be assembled across services.
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)))
trace.set_tracer_provider(provider)
FastAPIInstrumentor.instrument_app(app)
tracer = trace.get_tracer(__name__)
@app.get("/users/{id}")
async def get_user(id: str):
with tracer.start_as_current_span("get_user") as span:
span.set_attribute("user.id", id)
with tracer.start_as_current_span("db_query"):
user = await db.fetch_user(id)
with tracer.start_as_current_span("cache_write"):
await r.set(f"user:{id}", json.dumps(user))
return user
Now in Tempo (or Jaeger, or Honeycomb), you can search for a slow request and see exactly which span took the time. This is a different debugging skill from grepping logs — it teaches you to see request shapes, not just events.
Sentry — the third tool
Logs and metrics tell you what is happening. Traces tell you how things flow. Sentry tells you which exact line of code threw an exception, with the full stack and the local variables.
import sentry_sdk
sentry_sdk.init(
dsn="https://...@sentry.io/...",
traces_sample_rate=0.1,
profiles_sample_rate=0.1,
environment="prod",
release="api@1.42.0",
)
Add this. It costs nothing on a hobby project (free tier is generous) and saves hours per bug. Most teams treat Sentry as the “production debugger” and that’s correct — it’s complementary to OpenTelemetry, not competing.
The build
Take a project from earlier (your 03.7 backend, say). Add:
- Structured JSON logging with a per-request trace_id middleware.
- Prometheus metrics on the four golden signals.
- A Grafana dashboard with: requests/sec, p50/p95/p99 latency, error rate, key business metrics (sign-ups/min, payments/min).
- OpenTelemetry tracing with at least 3 named spans per request.
- Sentry with environment + release tagging.
- Two alert rules: one for SLO-burn latency, one for error rate. Test both by inducing failure.
When all six are in place, you have observability that is realistically as good as a 10-engineer startup’s.
Going deeper
When you have specific questions, in this order:
- Google — SRE Workbook, chapter on alerting. free — the multi-window multi-burn-rate math. Read once, then copy the formula.
- Charity Majors’s blog and Honeycomb docs. honeycomb.io — high-cardinality observability is not really achievable in Prometheus; her writing is the best argument for why you eventually move.
- OpenTelemetry — semantic conventions. docs — name your span attributes consistently with the rest of the world; do not invent your own.
- Liz Fong-Jones — Observability for emerging infra. Talks at SREcon. The most pragmatic SRE thinking that exists.
Skip vendor blog posts that feel like sales pitches. They mostly are.
Checkpoints
If any wobbles, reread the corresponding section.
- The page just lit up: “p99 latency over SLO.” You have logs, metrics, traces, and Sentry. In what order do you check, and why?
- Why is
user_ida terrible metric label and a great log field? Explain in cardinality terms. - Your team wants to claim 99.99% uptime. How many minutes of downtime per month does that allow, and why is going from 99.9% to 99.99% a much bigger jump than it sounds?
- Walk through what a trace looks like for a request that hits your API → calls a database → calls a downstream service → returns. Where would you put spans?
- You’re paged at 2 AM because “queue depth high.” Is that a good page or a bad page? What would make it good?
When you can answer all five from memory, move to 05.4 Security — the parts you can’t skip. Your observed system needs to be a secure system; that’s the next module.