05.2 Caching, queues, rate limits

These three — caching, queues, rate limits — are the standard kit of any production backend. They’re also the components most often added because somebody at a previous job said “we should add Redis here,” not because anyone measured a problem they’d solve. This module is about using them with intent: knowing what each costs, when each helps, and the specific failure modes that turn a healthy system into a 3am page.

The opinion this module is built on: most “we have a scale problem” is actually a slow query problem, and adding a cache before profiling makes it worse. A cache hides a slow query, lets the codebase grow more dependencies on that slow path, and then the cache invalidates and you take the original problem at 10x the load. Cache as a deliberate choice, not as a reflex. Same goes for queues — half the queues in production exist because somebody read a blog post about decoupling, not because the workload demanded it.

Set up

mkdir backend-kit && cd backend-kit
uv venv .venv && source .venv/bin/activate
uv pip install fastapi uvicorn redis httpx tenacity nats-py

# Local services
cat > docker-compose.yml <<'EOF'
services:
  redis:
    image: redis:7
    ports: ["6379:6379"]
  nats:
    image: nats:2
    command: -js
    ports: ["4222:4222"]
  api:
    build: .
    ports: ["8000:8000"]
EOF
docker compose up -d redis nats

You don’t need cloud anything for this module. Docker compose, your laptop, and a load generator are sufficient.

Read these first

Three sources, in order, then stop:

AWS — Caching strategies whitepaper. docs · 30 min · the cleanest treatment of look-aside vs write-through vs write-behind.
Redis — Idempotency keys with Redis (and Stripe’s blog version). Stripe · 30 min · the canonical pattern for exactly-once-feeling APIs.
Marc Brooker — Exponential Backoff and Jitter. post · 20 min · why naive backoff causes thundering herds and the simple fix.

You’ll be tempted to go read about Kafka internals or Redis cluster sharding. Don’t yet — those are interesting but you’ll use Redis and a queue at the API level for years before any of that matters.

Caching — patterns and their failure modes

There are three patterns. Pick deliberately.

Pattern	How it works	Right for
Look-aside (cache-aside)	App reads cache; on miss, reads DB and writes cache	Read-heavy, eventual consistency OK
Write-through	App writes to cache and DB synchronously	Read-heavy with frequent writes that need to invalidate cache cleanly
Write-behind (write-back)	App writes to cache, cache writes to DB async	High write volume, can tolerate data loss on cache failure

Look-aside is the default. Reach for it 90% of the time.

import redis, json, asyncpg

r = redis.from_url("redis://localhost:6379")

async def get_user(user_id: str):
    cached = r.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)
    row = await db.fetchrow("SELECT * FROM users WHERE id = $1", user_id)
    user = dict(row)
    r.setex(f"user:{user_id}", 300, json.dumps(user, default=str))   # 5-min TTL
    return user

That’s it. The hard parts are everything around it.

Cache invalidation — the named hard problem

Two strategies and you should know when each fits.

Strategy	When
TTL only	Reads are slightly stale OK; data is mostly read-only
TTL + explicit invalidation	Mutation paths are well-known and writers can call invalidate
Versioning (key includes a version)	Writers can bump a global version on a structural change

The trap: people add explicit invalidation everywhere, miss one mutation path, and then ship a stale-data bug that takes weeks to find. Default to TTLs short enough that staleness is acceptable. Add explicit invalidation only on the specific keys where staleness matters more than the latency win.

Thundering herd / cache stampede

A popular cache key expires. 1000 requests miss simultaneously. They all hit the database at once. Database falls over.

The pattern that fixes it is called singleflight — only one of the concurrent requests actually does the work; the rest wait for the result.

import asyncio

class SingleFlight:
    def __init__(self):
        self._inflight: dict[str, asyncio.Future] = {}

    async def do(self, key, fn):
        if key in self._inflight:
            return await self._inflight[key]
        fut = asyncio.get_event_loop().create_future()
        self._inflight[key] = fut
        try:
            result = await fn()
            fut.set_result(result)
            return result
        finally:
            self._inflight.pop(key, None)

sf = SingleFlight()

async def get_user(user_id):
    cached = r.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)
    return await sf.do(f"user:{user_id}", lambda: load_and_cache(user_id))

In Go this is golang.org/x/sync/singleflight. In Node, libraries like dataloader. In Python, the pattern above works.

Other prevention tricks: stale-while-revalidate (serve stale during refresh), early refresh (refresh proactively before TTL expires), per-key locks in Redis. Pick one. Test it under load.

Queues — when to add, when not

Most services do not need a queue. Sync request/response is simpler, faster, and easier to reason about. Add a queue only when one of these is true:

Reason to queue	Concrete signal
The work is slow and the user shouldn’t wait	Email sending, image processing, video encoding
The downstream is unreliable or rate-limited	Calling third-party APIs that flake
You need to absorb traffic spikes	Black Friday checkouts, login storms
You need to fan out one event to many consumers	Webhook delivery, notification fanout

Otherwise: don’t queue. A queue adds at minimum:

Two failure modes (producer fails to enqueue, consumer fails to process).
New observability surface (queue depth, age of oldest message, DLQ).
Eventual consistency where you used to have synchronous results.

# Minimal NATS JetStream queue with idempotency
import nats, json, asyncio

async def enqueue_email_send(user_id, message_id):
    nc = await nats.connect("nats://localhost:4222")
    js = nc.jetstream()
    await js.publish(
        "emails.send",
        json.dumps({"user_id": user_id, "message_id": message_id}).encode(),
        headers={"Nats-Msg-Id": message_id},   # JetStream dedup
    )
    await nc.close()

The headers/Nats-Msg-Id is the dedup key — JetStream rejects duplicate publishes within its dedup window. This is the simplest path to “publish once” semantics.

At-least-once vs exactly-once — the illusion

Distributed systems give you at-least-once delivery. Exactly-once is mostly a marketing claim. The way to behave exactly-once is to make your consumers idempotent.

async def handle_email_send(msg):
    payload = json.loads(msg.data)
    message_id = payload["message_id"]

    # Idempotency: have we processed this message_id before?
    if r.set(f"processed:{message_id}", "1", ex=86400, nx=True):
        # We're the first; do the work
        await send_email(payload)
    # If nx=True returned False, someone already did it. Just ack.
    await msg.ack()

This is not an exotic technique. It is the technique for handling at-least-once delivery without producing duplicates. Make it a habit.

Idempotency keys for APIs

Same idea on the producer side. Stripe’s API takes an Idempotency-Key header on every state-changing request. The server stores (key → response) for a window. If the same key arrives again, return the cached response instead of double-processing.

@app.post("/payments")
async def create_payment(request: Request, body: PaymentBody):
    key = request.headers.get("Idempotency-Key")
    if not key:
        raise HTTPException(400, "Idempotency-Key header required")

    cached = r.get(f"idem:{key}")
    if cached:
        return json.loads(cached)

    result = await process_payment(body)

    r.setex(f"idem:{key}", 86400, json.dumps(result))
    return result

This is a 10-line pattern that prevents an entire class of customer-facing bugs (duplicate charges from retries). Add it to any state-changing endpoint, especially payments.

Rate limits — three algorithms

Three algorithms. Token bucket is the right default.

Algorithm	Behavior
Fixed window	”100 requests per minute, reset on the minute boundary.” Simple. Boundary spikes possible.
Sliding window	Smooth across boundaries. Slightly more state.
Token bucket	Burst up to capacity, then fill at rate R. Simple and burst-friendly.
Leaky bucket	Constant outflow rate. For smoothing, not throttling.

Token bucket in Redis with Lua atomicity:

TOKEN_BUCKET_LUA = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local b = redis.call("HMGET", key, "tokens", "ts")
local tokens = tonumber(b[1]) or capacity
local ts = tonumber(b[2]) or now

local delta = (now - ts) * refill_rate
tokens = math.min(capacity, tokens + delta)

if tokens >= 1 then
    tokens = tokens - 1
    redis.call("HMSET", key, "tokens", tokens, "ts", now)
    redis.call("EXPIRE", key, 3600)
    return 1
else
    redis.call("HMSET", key, "tokens", tokens, "ts", now)
    redis.call("EXPIRE", key, 3600)
    return 0
end
"""

def allow(user_id, capacity=100, refill_rate=10):
    return r.eval(TOKEN_BUCKET_LUA, 1, f"rl:{user_id}", capacity, refill_rate, time.time())

Where to enforce: at the edge (CDN/LB) for crude protection; in the application layer for per-user, per-endpoint precision. Both. Don’t trust just the edge; don’t trust just the app.

Exponential backoff with jitter

The pattern when calling an unreliable downstream:

import random, asyncio
from tenacity import retry, stop_after_attempt, wait_exponential_jitter

@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential_jitter(initial=1, max=30, jitter=2),
    reraise=True,
)
async def fetch_external(url):
    response = await httpx.get(url, timeout=5)
    response.raise_for_status()
    return response.json()

Jitter is non-negotiable. Without it, all your clients retry at exactly the same intervals, producing the thundering herd that knocks the downstream over again the moment it recovers.

”Your problem isn’t load, it’s a slow query”

Before you add Redis: profile your hot endpoints. Roughly 80% of “we need caching” turns out to be:

A query missing an index.
An N+1 query inside a loop.
A SELECT * pulling 50 columns when you need 3.
A page that does 6 sequential queries that could be one JOIN.

Each of these is a 10-line fix that beats adding a whole new dependency. Reach for the database EXPLAIN before reaching for the cache.

The build

Pick an existing API project (yours or a small open-source one). Add these in order; load-test after each:

Add Redis look-aside cache to the slowest GET endpoint. Measure.
Add an idempotency key to the slowest POST endpoint. Measure with retried clients.
Add token-bucket rate limiting per user. Verify it triggers on a hammering client.
Add a NATS queue and worker for one async task (email send, webhook fanout). Verify at-least-once → idempotent dedup.
Add exponential backoff with jitter on every external HTTP call.
Run a 5-minute load test (oha or k6) before and after the full set. Compare p50, p99, error rate.

You should see meaningful p99 improvements and zero increase in error rate. If you broke something, it’s almost always the cache invalidation.

Going deeper

When you have specific questions, in this order:

Stripe — Designing robust and predictable APIs with idempotency. post — the canonical reference. Re-read after shipping a payment system.
Confluent — Kafka in 100 seconds + Kafka under the hood. When you outgrow NATS or SQS.
Marc Brooker’s blog. marcbrooker.xyz — distributed systems thinking from inside AWS. Slow but worth it.
GitHub — How we scaled a critical service with no downtime. post — real production scaling story when you want to read instead of theorize.

Skip “Redis vs Memcached” think pieces. The answer is Redis unless you have a very specific reason.

Checkpoints

If any wobbles, reread the corresponding section.

A popular cache key expires. 500 requests miss at once. Walk through what happens with no protection vs with singleflight.
Why do you make consumers idempotent rather than insisting on exactly-once delivery? Give the protocol-level reason.
Token bucket with capacity 100 and refill rate 10/sec. A user sends 200 requests in 1 second. How many succeed? Now 20 requests/sec sustained — what happens?
A teammate proposes adding Kafka because “the API is slow under load.” Name three things you’d measure first to argue against (or for) the change.
Your idempotency key has a 24-hour TTL. A retry comes 25 hours later. What happens, and is that the behavior you want?

When you can answer all five from memory, move to 05.3 Observability and ops. The components you just added need instrumentation; that’s the next module.