Caching, queues, rate limits
Redis, Kafka or NATS, exponential backoff, idempotency keys. The standard kit of any production backend, with the failure modes that bite you when you cargo-cult them.
Prerequisites
03.3
Stack
RedisNATS or Kafka or BullMQ (depending on language)Python or TypeScriptDocker compose
By the end of this module
- Pick the right caching pattern (look-aside, write-through, write-behind) for a given access shape.
- Know when adding a queue helps and when it just adds latency and a new failure mode.
- Implement idempotency keys, exponential backoff, and a token-bucket rate limiter.
- Diagnose thundering herd, cache stampede, and the at-least-once / exactly-once confusion.
These three — caching, queues, rate limits — are the standard kit of any production backend. They’re also the components most often added because somebody at a previous job said “we should add Redis here,” not because anyone measured a problem they’d solve. This module is about using them with intent: knowing what each costs, when each helps, and the specific failure modes that turn a healthy system into a 3am page.
The opinion this module is built on: most “we have a scale problem” is actually a slow query problem, and adding a cache before profiling makes it worse. A cache hides a slow query, lets the codebase grow more dependencies on that slow path, and then the cache invalidates and you take the original problem at 10x the load. Cache as a deliberate choice, not as a reflex. Same goes for queues — half the queues in production exist because somebody read a blog post about decoupling, not because the workload demanded it.
Set up
mkdir backend-kit && cd backend-kit
uv venv .venv && source .venv/bin/activate
uv pip install fastapi uvicorn redis httpx tenacity nats-py
# Local services
cat > docker-compose.yml <<'EOF'
services:
redis:
image: redis:7
ports: ["6379:6379"]
nats:
image: nats:2
command: -js
ports: ["4222:4222"]
api:
build: .
ports: ["8000:8000"]
EOF
docker compose up -d redis nats
You don’t need cloud anything for this module. Docker compose, your laptop, and a load generator are sufficient.
Read these first
Three sources, in order, then stop:
- AWS — Caching strategies whitepaper. docs · 30 min · the cleanest treatment of look-aside vs write-through vs write-behind.
- Redis — Idempotency keys with Redis (and Stripe’s blog version). Stripe · 30 min · the canonical pattern for exactly-once-feeling APIs.
- Marc Brooker — Exponential Backoff and Jitter. post · 20 min · why naive backoff causes thundering herds and the simple fix.
You’ll be tempted to go read about Kafka internals or Redis cluster sharding. Don’t yet — those are interesting but you’ll use Redis and a queue at the API level for years before any of that matters.
Caching — patterns and their failure modes
There are three patterns. Pick deliberately.
| Pattern | How it works | Right for |
|---|---|---|
| Look-aside (cache-aside) | App reads cache; on miss, reads DB and writes cache | Read-heavy, eventual consistency OK |
| Write-through | App writes to cache and DB synchronously | Read-heavy with frequent writes that need to invalidate cache cleanly |
| Write-behind (write-back) | App writes to cache, cache writes to DB async | High write volume, can tolerate data loss on cache failure |
Look-aside is the default. Reach for it 90% of the time.
import redis, json, asyncpg
r = redis.from_url("redis://localhost:6379")
async def get_user(user_id: str):
cached = r.get(f"user:{user_id}")
if cached:
return json.loads(cached)
row = await db.fetchrow("SELECT * FROM users WHERE id = $1", user_id)
user = dict(row)
r.setex(f"user:{user_id}", 300, json.dumps(user, default=str)) # 5-min TTL
return user
That’s it. The hard parts are everything around it.
Cache invalidation — the named hard problem
Two strategies and you should know when each fits.
| Strategy | When |
|---|---|
| TTL only | Reads are slightly stale OK; data is mostly read-only |
| TTL + explicit invalidation | Mutation paths are well-known and writers can call invalidate |
| Versioning (key includes a version) | Writers can bump a global version on a structural change |
The trap: people add explicit invalidation everywhere, miss one mutation path, and then ship a stale-data bug that takes weeks to find. Default to TTLs short enough that staleness is acceptable. Add explicit invalidation only on the specific keys where staleness matters more than the latency win.
Thundering herd / cache stampede
A popular cache key expires. 1000 requests miss simultaneously. They all hit the database at once. Database falls over.
The pattern that fixes it is called singleflight — only one of the concurrent requests actually does the work; the rest wait for the result.
import asyncio
class SingleFlight:
def __init__(self):
self._inflight: dict[str, asyncio.Future] = {}
async def do(self, key, fn):
if key in self._inflight:
return await self._inflight[key]
fut = asyncio.get_event_loop().create_future()
self._inflight[key] = fut
try:
result = await fn()
fut.set_result(result)
return result
finally:
self._inflight.pop(key, None)
sf = SingleFlight()
async def get_user(user_id):
cached = r.get(f"user:{user_id}")
if cached:
return json.loads(cached)
return await sf.do(f"user:{user_id}", lambda: load_and_cache(user_id))
In Go this is golang.org/x/sync/singleflight. In Node, libraries like dataloader. In Python, the pattern above works.
Other prevention tricks: stale-while-revalidate (serve stale during refresh), early refresh (refresh proactively before TTL expires), per-key locks in Redis. Pick one. Test it under load.
Queues — when to add, when not
Most services do not need a queue. Sync request/response is simpler, faster, and easier to reason about. Add a queue only when one of these is true:
| Reason to queue | Concrete signal |
|---|---|
| The work is slow and the user shouldn’t wait | Email sending, image processing, video encoding |
| The downstream is unreliable or rate-limited | Calling third-party APIs that flake |
| You need to absorb traffic spikes | Black Friday checkouts, login storms |
| You need to fan out one event to many consumers | Webhook delivery, notification fanout |
Otherwise: don’t queue. A queue adds at minimum:
- Two failure modes (producer fails to enqueue, consumer fails to process).
- New observability surface (queue depth, age of oldest message, DLQ).
- Eventual consistency where you used to have synchronous results.
# Minimal NATS JetStream queue with idempotency
import nats, json, asyncio
async def enqueue_email_send(user_id, message_id):
nc = await nats.connect("nats://localhost:4222")
js = nc.jetstream()
await js.publish(
"emails.send",
json.dumps({"user_id": user_id, "message_id": message_id}).encode(),
headers={"Nats-Msg-Id": message_id}, # JetStream dedup
)
await nc.close()
The headers/Nats-Msg-Id is the dedup key — JetStream rejects duplicate publishes within its dedup window. This is the simplest path to “publish once” semantics.
At-least-once vs exactly-once — the illusion
Distributed systems give you at-least-once delivery. Exactly-once is mostly a marketing claim. The way to behave exactly-once is to make your consumers idempotent.
async def handle_email_send(msg):
payload = json.loads(msg.data)
message_id = payload["message_id"]
# Idempotency: have we processed this message_id before?
if r.set(f"processed:{message_id}", "1", ex=86400, nx=True):
# We're the first; do the work
await send_email(payload)
# If nx=True returned False, someone already did it. Just ack.
await msg.ack()
This is not an exotic technique. It is the technique for handling at-least-once delivery without producing duplicates. Make it a habit.
Idempotency keys for APIs
Same idea on the producer side. Stripe’s API takes an Idempotency-Key header on every state-changing request. The server stores (key → response) for a window. If the same key arrives again, return the cached response instead of double-processing.
@app.post("/payments")
async def create_payment(request: Request, body: PaymentBody):
key = request.headers.get("Idempotency-Key")
if not key:
raise HTTPException(400, "Idempotency-Key header required")
cached = r.get(f"idem:{key}")
if cached:
return json.loads(cached)
result = await process_payment(body)
r.setex(f"idem:{key}", 86400, json.dumps(result))
return result
This is a 10-line pattern that prevents an entire class of customer-facing bugs (duplicate charges from retries). Add it to any state-changing endpoint, especially payments.
Rate limits — three algorithms
Three algorithms. Token bucket is the right default.
| Algorithm | Behavior |
|---|---|
| Fixed window | ”100 requests per minute, reset on the minute boundary.” Simple. Boundary spikes possible. |
| Sliding window | Smooth across boundaries. Slightly more state. |
| Token bucket | Burst up to capacity, then fill at rate R. Simple and burst-friendly. |
| Leaky bucket | Constant outflow rate. For smoothing, not throttling. |
Token bucket in Redis with Lua atomicity:
TOKEN_BUCKET_LUA = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local b = redis.call("HMGET", key, "tokens", "ts")
local tokens = tonumber(b[1]) or capacity
local ts = tonumber(b[2]) or now
local delta = (now - ts) * refill_rate
tokens = math.min(capacity, tokens + delta)
if tokens >= 1 then
tokens = tokens - 1
redis.call("HMSET", key, "tokens", tokens, "ts", now)
redis.call("EXPIRE", key, 3600)
return 1
else
redis.call("HMSET", key, "tokens", tokens, "ts", now)
redis.call("EXPIRE", key, 3600)
return 0
end
"""
def allow(user_id, capacity=100, refill_rate=10):
return r.eval(TOKEN_BUCKET_LUA, 1, f"rl:{user_id}", capacity, refill_rate, time.time())
Where to enforce: at the edge (CDN/LB) for crude protection; in the application layer for per-user, per-endpoint precision. Both. Don’t trust just the edge; don’t trust just the app.
Exponential backoff with jitter
The pattern when calling an unreliable downstream:
import random, asyncio
from tenacity import retry, stop_after_attempt, wait_exponential_jitter
@retry(
stop=stop_after_attempt(5),
wait=wait_exponential_jitter(initial=1, max=30, jitter=2),
reraise=True,
)
async def fetch_external(url):
response = await httpx.get(url, timeout=5)
response.raise_for_status()
return response.json()
Jitter is non-negotiable. Without it, all your clients retry at exactly the same intervals, producing the thundering herd that knocks the downstream over again the moment it recovers.
”Your problem isn’t load, it’s a slow query”
Before you add Redis: profile your hot endpoints. Roughly 80% of “we need caching” turns out to be:
- A query missing an index.
- An N+1 query inside a loop.
- A SELECT * pulling 50 columns when you need 3.
- A page that does 6 sequential queries that could be one JOIN.
Each of these is a 10-line fix that beats adding a whole new dependency. Reach for the database EXPLAIN before reaching for the cache.
The build
Pick an existing API project (yours or a small open-source one). Add these in order; load-test after each:
- Add Redis look-aside cache to the slowest GET endpoint. Measure.
- Add an idempotency key to the slowest POST endpoint. Measure with retried clients.
- Add token-bucket rate limiting per user. Verify it triggers on a hammering client.
- Add a NATS queue and worker for one async task (email send, webhook fanout). Verify at-least-once → idempotent dedup.
- Add exponential backoff with jitter on every external HTTP call.
- Run a 5-minute load test (oha or k6) before and after the full set. Compare p50, p99, error rate.
You should see meaningful p99 improvements and zero increase in error rate. If you broke something, it’s almost always the cache invalidation.
Going deeper
When you have specific questions, in this order:
- Stripe — Designing robust and predictable APIs with idempotency. post — the canonical reference. Re-read after shipping a payment system.
- Confluent — Kafka in 100 seconds + Kafka under the hood. When you outgrow NATS or SQS.
- Marc Brooker’s blog. marcbrooker.xyz — distributed systems thinking from inside AWS. Slow but worth it.
- GitHub — How we scaled a critical service with no downtime. post — real production scaling story when you want to read instead of theorize.
Skip “Redis vs Memcached” think pieces. The answer is Redis unless you have a very specific reason.
Checkpoints
If any wobbles, reread the corresponding section.
- A popular cache key expires. 500 requests miss at once. Walk through what happens with no protection vs with singleflight.
- Why do you make consumers idempotent rather than insisting on exactly-once delivery? Give the protocol-level reason.
- Token bucket with capacity 100 and refill rate 10/sec. A user sends 200 requests in 1 second. How many succeed? Now 20 requests/sec sustained — what happens?
- A teammate proposes adding Kafka because “the API is slow under load.” Name three things you’d measure first to argue against (or for) the change.
- Your idempotency key has a 24-hour TTL. A retry comes 25 hours later. What happens, and is that the behavior you want?
When you can answer all five from memory, move to 05.3 Observability and ops. The components you just added need instrumentation; that’s the next module.