05.3 Observability and ops

Observability is the difference between “the site is down and we don’t know why” and “page p99 spiked at 14:23 because a deploy increased downstream calls by 3x.” The first is a Tuesday for most teams; the second takes five minutes of work and ten minutes of curiosity. This module is about the work and the curiosity.

The opinion: you do not need Datadog yet. You need the open-source LGTM stack (Loki, Grafana, Tempo, Mimir/Prometheus), or you need Honeycomb if you want managed and have budget. Datadog is a fine product whose pricing makes it the wrong choice for any company that hasn’t crossed a specific revenue line. Most teams pay it because their CTO last used it at the previous job, not because they evaluated it. The same goes for instrumentation: most “we have observability” really means “we have a vendor invoice.”

Set up

mkdir obs && cd obs
uv venv .venv && source .venv/bin/activate
uv pip install fastapi uvicorn opentelemetry-distro \
  opentelemetry-exporter-otlp opentelemetry-instrumentation-fastapi \
  opentelemetry-instrumentation-requests sentry-sdk

# Local LGTM stack via docker
cat > docker-compose.yml <<'EOF'
services:
  grafana:
    image: grafana/otel-lgtm
    ports: ["3000:3000", "4317:4317", "4318:4318"]
EOF
docker compose up -d

That grafana/otel-lgtm image bundles Grafana + Loki + Prometheus + Tempo with OTLP receivers. One container to start; swap to managed services when you outgrow it.

Read these first

Three sources, in order, then stop:

Google — Site Reliability Engineering, chapter 6 (Monitoring Distributed Systems). free online · 30 min · the canonical “four golden signals” framework, written by the team that operationalized it.
Charity Majors — Observability: A 3-Year Retrospective. post · 30 min · the case for traces over metrics for debugging unknown unknowns.
Honeycomb — Observability Engineering book, sample chapters. book · 1 hr · pragmatic and vendor-honest; the book itself is paywalled, the sample is enough for now.

Skip “monitoring vs observability” Twitter threads. They’re correct in spirit and useless in practice.

The three pillars

Pillar	What it answers	Cost driver
Logs	”What happened in this specific request?”	Volume × retention
Metrics	”How is the system behaving in aggregate?”	Cardinality
Traces	”How did this request flow through the system?”	Sample rate × span count

Use them together. Logs are where you go to read narrative. Metrics are where you set alerts. Traces are where you debug “why is this one request slow.”

A reasonable starter rule: log every request with a correlation ID, expose Prometheus metrics for the four golden signals, sample traces at 1-10% in production. You can dial up later.

Structured logging

The biggest leverage move in observability: stop using printf-style logs. Use structured logs (JSON) with a stable set of fields.

import logging, json, sys, uuid
from contextvars import ContextVar

trace_id_var: ContextVar[str] = ContextVar("trace_id", default="")

class JSONFormatter(logging.Formatter):
    def format(self, record):
        payload = {
            "ts": self.formatTime(record),
            "level": record.levelname,
            "msg": record.getMessage(),
            "logger": record.name,
            "trace_id": trace_id_var.get(),
        }
        if record.exc_info:
            payload["exc"] = self.formatException(record.exc_info)
        # Merge any extra fields
        for k, v in record.__dict__.items():
            if k not in ("msg", "args", "exc_info", "exc_text", "stack_info"):
                continue
        return json.dumps(payload)

handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JSONFormatter())
logging.basicConfig(level=logging.INFO, handlers=[handler])

# Middleware: assign trace_id per request
@app.middleware("http")
async def add_trace_id(request, call_next):
    tid = request.headers.get("x-trace-id", uuid.uuid4().hex)
    token = trace_id_var.set(tid)
    try:
        response = await call_next(request)
        response.headers["x-trace-id"] = tid
        return response
    finally:
        trace_id_var.reset(token)

Now every log line carries a trace_id, and you can grep across services for a single request. This is the single highest-leverage instrumentation change you can make.

The four golden signals

From Google’s SRE book, these are what you should always have alerts on, for every user-facing service:

Signal	Definition
Latency	Time to serve requests, split by success and failure
Traffic	Requests per second
Errors	Failed-request rate
Saturation	How “full” the service is — CPU, memory, queue depth

Track p50, p95, p99 separately. Averages lie. The user who sees p99 latency is having a bad day even if p50 looks fine.

from prometheus_client import Counter, Histogram, make_asgi_app

http_requests = Counter(
    "http_requests_total",
    "HTTP requests by route, method, status",
    ["route", "method", "status"],
)

http_duration = Histogram(
    "http_request_duration_seconds",
    "HTTP request latency",
    ["route", "method"],
    buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
)

app.mount("/metrics", make_asgi_app())

Cardinality — the metric that crashes your bill

Metric labels (route, method, status) explode combinatorially. Each unique label combination is a separate time series. If you label by user_id, you have one time series per user — and you’ve just turned your metrics database into a logs database, badly.

Label	Cardinality	OK?
HTTP method	~5	Yes
Route name	~50-200	Yes
Status code	~10	Yes
Region/AZ	~10	Yes
User ID	millions	NO
Request ID	per-request	NO, NEVER
URL with query string	unbounded	NO

The rule: labels are for grouping aggregates, not for identifying individuals. If you need to find “this specific request,” use logs and traces. Metrics are aggregates.

SLO, SLI, SLA — what they actually mean

Lots of teams use these terms wrong.

Term	What it is
SLI (Indicator)	A measured number — “p99 latency of GET /api”
SLO (Objective)	A target on an SLI — “p99 latency under 500ms over a rolling 30 days”
SLA (Agreement)	A contractual promise to a customer, with consequences

Most internal teams need SLOs, not SLAs. SLAs are customer contracts and require legal involvement. SLOs are internal targets that dictate operational behavior — “if we burn through our error budget, the next sprint is reliability work.”

What “99.9% uptime” actually buys you per month:

SLO	Allowed monthly downtime
99%	~7.2 hours
99.9%	~43 minutes
99.95%	~22 minutes
99.99%	~4.3 minutes
99.999%	~26 seconds

Each extra nine costs roughly 10x more engineering. Pick the SLO that matches what users actually need, not what sounds impressive.

Alerting that doesn’t page for nothing

Three rules.

Alert on user-visible symptoms, not internal causes. “p99 latency over SLO” is a good page. “CPU over 80%” is not (CPU may be high and users may be fine).
Multi-window, multi-burn-rate. A single threshold means you alert on a brief blip. The Google SRE workbook has the math: combine a fast-burn alert (problem in the last hour) and a slow-burn alert (problem over the last day).
Every page must be actionable. If the on-call cannot do anything about it at 3 AM, it is not a page. Move it to a ticket.

A sensible starter alert set for a web service:

# Prometheus-flavored alert rules
- alert: HighLatency
  expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5
  for: 10m
  labels: { severity: page }

- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
  for: 5m
  labels: { severity: page }

- alert: SaturationDiskFilling
  expr: predict_linear(node_filesystem_free_bytes[1h], 4*3600) < 0
  for: 30m
  labels: { severity: ticket }

Notice the disk-fill alert is a ticket, not a page. There’s nothing useful to do at 3 AM about a disk that will fill in 4 hours. Page when human action is needed now.

Distributed tracing with OpenTelemetry

A trace is a tree of spans. Each span represents a unit of work (HTTP request, DB query, external call). Spans propagate a traceparent header so they can be assembled across services.

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)))
trace.set_tracer_provider(provider)

FastAPIInstrumentor.instrument_app(app)

tracer = trace.get_tracer(__name__)

@app.get("/users/{id}")
async def get_user(id: str):
    with tracer.start_as_current_span("get_user") as span:
        span.set_attribute("user.id", id)
        with tracer.start_as_current_span("db_query"):
            user = await db.fetch_user(id)
        with tracer.start_as_current_span("cache_write"):
            await r.set(f"user:{id}", json.dumps(user))
        return user

Now in Tempo (or Jaeger, or Honeycomb), you can search for a slow request and see exactly which span took the time. This is a different debugging skill from grepping logs — it teaches you to see request shapes, not just events.

Sentry — the third tool

Logs and metrics tell you what is happening. Traces tell you how things flow. Sentry tells you which exact line of code threw an exception, with the full stack and the local variables.

import sentry_sdk
sentry_sdk.init(
    dsn="https://...@sentry.io/...",
    traces_sample_rate=0.1,
    profiles_sample_rate=0.1,
    environment="prod",
    release="api@1.42.0",
)

Add this. It costs nothing on a hobby project (free tier is generous) and saves hours per bug. Most teams treat Sentry as the “production debugger” and that’s correct — it’s complementary to OpenTelemetry, not competing.

The build

Take a project from earlier (your 03.7 backend, say). Add:

Structured JSON logging with a per-request trace_id middleware.
Prometheus metrics on the four golden signals.
A Grafana dashboard with: requests/sec, p50/p95/p99 latency, error rate, key business metrics (sign-ups/min, payments/min).
OpenTelemetry tracing with at least 3 named spans per request.
Sentry with environment + release tagging.
Two alert rules: one for SLO-burn latency, one for error rate. Test both by inducing failure.

When all six are in place, you have observability that is realistically as good as a 10-engineer startup’s.

Going deeper

When you have specific questions, in this order:

Google — SRE Workbook, chapter on alerting. free — the multi-window multi-burn-rate math. Read once, then copy the formula.
Charity Majors’s blog and Honeycomb docs. honeycomb.io — high-cardinality observability is not really achievable in Prometheus; her writing is the best argument for why you eventually move.
OpenTelemetry — semantic conventions. docs — name your span attributes consistently with the rest of the world; do not invent your own.
Liz Fong-Jones — Observability for emerging infra. Talks at SREcon. The most pragmatic SRE thinking that exists.

Skip vendor blog posts that feel like sales pitches. They mostly are.

Checkpoints

If any wobbles, reread the corresponding section.

The page just lit up: “p99 latency over SLO.” You have logs, metrics, traces, and Sentry. In what order do you check, and why?
Why is user_id a terrible metric label and a great log field? Explain in cardinality terms.
Your team wants to claim 99.99% uptime. How many minutes of downtime per month does that allow, and why is going from 99.9% to 99.99% a much bigger jump than it sounds?
Walk through what a trace looks like for a request that hits your API → calls a database → calls a downstream service → returns. Where would you put spans?
You’re paged at 2 AM because “queue depth high.” Is that a good page or a bad page? What would make it good?

When you can answer all five from memory, move to 05.4 Security — the parts you can’t skip. Your observed system needs to be a secure system; that’s the next module.