02.6 Databases and SQL, deep

There is one database every working backend engineer should know cold, and it’s PostgreSQL. Not MongoDB, not DynamoDB, not the database-of-the-month you saw on Hacker News. Postgres is boring, fast, correct, and 95% of applications never need anything else. The opinionated take of this module is: use Postgres until it actively can’t do the job. That bar is much higher than most people think.

This module assumes you have written a few SQL queries. It does not assume you understand transactions, isolation levels, or what the query planner is doing. Those are the gaps. They’re the gaps that turn into incidents.

The remaining 5% — when you genuinely need Redis, ClickHouse, or a vector database — gets covered honestly at the end. There are good reasons to reach for those tools, and the good reasons are different from the reasons most teams cite.

Set up

# Linux
sudo apt install postgresql-16 postgresql-client-16

# macOS
brew install postgresql@16
brew services start postgresql@16

Verify and create a working database:

psql --version
createdb shop
psql shop -c "select version();"

Make a project:

mkdir db && cd db
mkdir schema queries notes

You’ll spend most of this module in psql. Get used to its commands: \d table_name, \di, \dt, \timing on, \e to open the last query in your editor.

Read these first

In this order:

PostgreSQL official docs — chapters 5, 7, 11, 13, 14. docs · 4–5 hours · the actual source of truth, not someone’s blog interpretation. Chapter 14 (performance tips) is where most engineers grow up.
Markus Winand — Use The Index, Luke!. free site · 2 hours · the best index tutorial in existence. Vendor-agnostic, practical.
Designing Data-Intensive Applications (Kleppmann). book · chapters 2, 3, and 7 · for the conceptual depth on storage engines, indexes, and transactions.
PostgreSQL Wiki — Don’t Do This. wiki · 30 minutes · curated list of footguns.

After Use The Index, Luke! and the official docs on indexes and explain, you have most of what you need. Stop reading and start writing schemas.

The relational model, the parts you’ll use

A relation is a table. A row is a tuple. A column is an attribute. Keys come in two flavors:

Primary key. Uniquely identifies a row. Almost always a single column, almost always a serial integer or UUID.
Foreign key. A column that references the primary key of another table. The database enforces referential integrity if you ask it to.

Normalization to 3NF, in three sentences:

1NF. No repeating groups; each cell holds one value.
2NF. Every non-key column depends on the whole primary key (matters for composite keys).
3NF. Every non-key column depends on the primary key directly, not transitively through another non-key column.

The pragmatic rule: design to 3NF, then deliberately denormalize for read-heavy hot paths. A denormalized “current_balance” column that’s kept in sync by triggers or application code is faster than aggregating ten million transactions on every page load. Denormalization without intent, on the other hand, is technical debt.

SQL deeply

You are presumed to know basic SELECT, INSERT, UPDATE, DELETE. The four constructs that separate practitioners from beginners:

JOINs. INNER (intersection), LEFT (all of left + matches from right), RIGHT (mirror), FULL (union). Most production code uses INNER and LEFT. Do not use a Cartesian product unless you mean to:

-- INNER: customers who have placed orders
SELECT c.id, c.name, count(*) AS orders
FROM customers c JOIN orders o ON o.customer_id = c.id
GROUP BY c.id;

-- LEFT: all customers, with order count (zero for those without orders)
SELECT c.id, c.name, count(o.id) AS orders
FROM customers c LEFT JOIN orders o ON o.customer_id = c.id
GROUP BY c.id;

CTEs (WITH). Named subqueries that make complex logic readable. Postgres 12+ optimizes through them; treat them as a structural tool, not a performance lever.

WITH recent AS (
  SELECT customer_id, count(*) AS n
  FROM orders WHERE created_at > now() - interval '30 days'
  GROUP BY customer_id
)
SELECT c.name, r.n FROM customers c JOIN recent r ON r.customer_id = c.id
WHERE r.n >= 5;

Window functions. Aggregate without collapsing rows. Critical for ranking, running totals, and moving averages.

SELECT customer_id, created_at, amount,
       sum(amount) OVER (PARTITION BY customer_id ORDER BY created_at) AS running_total
FROM orders;

Subqueries vs JOINs. Either can express the same thing; the planner often produces the same plan. Prefer the form that reads more clearly. Use EXISTS / NOT EXISTS instead of IN / NOT IN when the inner result might be huge.

Transactions and isolation

A transaction is a unit of work the database treats atomically. ACID — Atomicity, Consistency, Isolation, Durability — is the contract.

Isolation levels, weakest to strongest:

Level	Prevents dirty read	Prevents non-repeatable read	Prevents phantom read
Read uncommitted	no	no	no
Read committed (Postgres default)	yes	no	no
Repeatable read	yes	yes	no (some DBs yes)
Serializable	yes	yes	yes

A dirty read sees uncommitted data. A non-repeatable read sees a row change between two reads in the same transaction. A phantom read sees a new row appear in a range you queried earlier.

The pragmatic guidance: most applications run on Read committed and are fine. Use Serializable when you genuinely have a constraint that spans rows (transferring money, decrementing inventory under contention). Postgres’s serializable is implemented with SSI (serializable snapshot isolation) and is correct without locking — read the Kleppmann chapter for the mechanism.

Always wrap multi-statement state changes in BEGIN / COMMIT. Never run schema migrations and DML in the same transaction unless you’re sure the migration tool supports it — most don’t.

Indexes, the right one for the job

An index is a data structure the database uses to find rows without scanning the whole table. Postgres has several:

B-tree. The default. Use for equality and range. Sorted, supports ORDER BY without a sort. ~99% of indexes you’ll create.
Hash. Equality only. Slightly faster than btree for pure equality, but rarely worth using; btree is good enough.
GIN. Generalized inverted index. Use for full-text search, JSONB containment, array containment.
GiST. Geospatial, range types, similarity. Use when btree can’t express your query.
BRIN. Block range. For huge tables where physically adjacent rows have correlated values (timestamps in a log table). Cheap to maintain, less precise.
Partial. A regular index with a WHERE clause. Index only the rows you care about.
Expression. Index a function of columns. CREATE INDEX ON users (lower(email)) if you query case-insensitively.

The rule of thumb: index columns that appear in WHERE, JOIN, or ORDER BY in queries that run frequently. Don’t index everything — every index slows writes and uses disk. A table with 12 indexes is a sign someone wasn’t thinking.

The composite-index ordering rule: put the most selective and most often equality-filtered columns first. An index on (customer_id, created_at) serves where customer_id = ? and where customer_id = ? and created_at > ? but not where created_at > ? alone.

EXPLAIN ANALYZE: read the plan

Every query has a plan. The planner picks one. When the query is slow, the plan tells you why.

EXPLAIN (ANALYZE, BUFFERS) SELECT ...;

What to look for:

Seq Scan on a large table. Usually wrong; usually means a missing index.
Estimated rows wildly different from actual rows. Stale stats. Run ANALYZE table_name;.
Nested Loop joining millions of rows. Almost always wrong; the planner should pick Hash Join or Merge Join. Check that the join keys are indexed and the row estimates are sane.
Sort step using disk. Spilling to disk; bump work_mem for the session.

The single most useful tool for this is explain.dalibo.com — paste a plan, get a visual tree. It is faster than reading the text plan once you have a few of these under your belt.

The n+1 query problem

The single most common performance bug in application code:

# anti-pattern
for customer in db.execute("SELECT id FROM customers LIMIT 100"):
    orders = db.execute("SELECT * FROM orders WHERE customer_id = ?", customer.id)
    # use orders

That’s 1 + 100 = 101 round trips. Each round trip is hundreds of microseconds, often milliseconds. Page render times explode.

Three ways to fix it:

JOIN. One query, one plan, the database handles it.
IN clause. SELECT * FROM orders WHERE customer_id IN (?, ?, ?, ...).
ORM eager loading. SQLAlchemy selectinload, Django prefetch_related. Same idea, framework-level.

If you remember exactly one thing from this module, make it: every loop that does a database query is probably wrong.

The build: an e-commerce schema

Build a schema for a small e-commerce store. Tables: customers, products, orders, order_items, inventory_movements. Relationships, constraints, and indexes are part of the design.

-- schema/001_init.sql
CREATE TABLE customers (
  id BIGSERIAL PRIMARY KEY,
  email TEXT NOT NULL UNIQUE,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE products (
  id BIGSERIAL PRIMARY KEY,
  sku TEXT NOT NULL UNIQUE,
  name TEXT NOT NULL,
  price_cents INTEGER NOT NULL CHECK (price_cents >= 0)
);

CREATE TABLE orders (
  id BIGSERIAL PRIMARY KEY,
  customer_id BIGINT NOT NULL REFERENCES customers(id),
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  status TEXT NOT NULL CHECK (status IN ('pending','paid','shipped','cancelled'))
);

CREATE TABLE order_items (
  order_id BIGINT NOT NULL REFERENCES orders(id) ON DELETE CASCADE,
  product_id BIGINT NOT NULL REFERENCES products(id),
  quantity INTEGER NOT NULL CHECK (quantity > 0),
  unit_price_cents INTEGER NOT NULL,
  PRIMARY KEY (order_id, product_id)
);

CREATE INDEX ON orders (customer_id, created_at DESC);
CREATE INDEX ON order_items (product_id);

Seed it with 100k customers, 10k products, 1M orders, 5M order_items (a generator script is fine). Then write 10 hand-picked queries with EXPLAIN ANALYZE:

Top 10 customers by total spend in the last 30 days.
Products that have never been ordered.
Daily revenue for the last 90 days as a window-function running total.
Customers whose most recent order was more than 60 days ago.
Inventory turnover by product (unit_price × quantity sold per month).
The 95th percentile order value, using percentile_cont.
Customer cohort retention: of customers who ordered in month X, how many ordered again in month X+1?
A pivot of order counts by status by week.
The full text search over product names with to_tsvector and a GIN index.
A query that without an index does a Seq Scan, with an index does an Index Scan; show the plan diff.

If you can write all 10 in under an hour each, with sane plans, you have outpaced 90% of working backend engineers’ SQL skills.

When to actually reach for something else

Honest decision tree:

Need single-key reads at sub-millisecond, very high QPS, willing to lose data on crash? Redis. Cache, session store, rate limiter. Not your primary store.
Need analytical queries over hundreds of millions of rows? ClickHouse. Columnar, vectorized, fast. Don’t use it for OLTP.
Need similarity search over embeddings? pgvector first (Postgres extension). If pgvector hits its limits, then a vector DB (Qdrant, Pinecone). Most teams don’t reach pgvector’s limits.
Need a document model with deeply nested optional fields? JSONB columns in Postgres. MongoDB only when you’ve measured a real reason — and the real reason is rarely what people say.
Need horizontal scale beyond what one big Postgres can do? That’s a lot of Postgres. If you genuinely need it, look at Citus, CockroachDB, or sharded Postgres before reaching for NoSQL.

The pattern in every one of those: Postgres until proven insufficient. The threshold for “proven insufficient” is concrete, measurable evidence — not a hunch.

Going deeper

When you have specific questions, in this order:

PostgreSQL official docs. docs · the source of truth. Read with intent, not cover to cover.
The Internals of PostgreSQL. free site · how the engine actually works. Wonderful.
Designing Data-Intensive Applications (Kleppmann). book · the conceptual breadth.
Markus Winand — SQL Performance Explained. book · the paid follow-up to Use The Index, Luke!.
Brent Ozar’s blog. link · SQL Server-flavored but the principles transfer.

Skip “MongoDB vs Postgres” content marketing. The right answer is almost always Postgres.

Checkpoints

If any one wobbles, redo the corresponding section.

You have a users table with 50M rows and a query “where lower(email) = ?” that’s slow. What index, exactly, do you create, and why won’t a regular btree on email help?
Two transactions both read a row, both write a new value based on the read. Walk through what happens under Read committed, Repeatable read, and Serializable.
A nightly report query takes 45 minutes. EXPLAIN ANALYZE shows a Seq Scan on a 200M-row table feeding a Nested Loop. What two specific things do you check first?
You see a 95th-percentile API latency spike at 3 AM every night. The DB is the bottleneck. Name three plausible causes and how you’d distinguish them.
A coworker proposes moving from Postgres to MongoDB because “Postgres can’t handle our scale.” You currently run on a 4 vCPU instance at 30% CPU. What do you ask, and what do you propose instead?

If you can answer all five with conviction, you’ve earned 02.6. Move on to 02.7 (Computer architecture and performance) — where the cache and memory hierarchy explain why your code is slow regardless of how good your queries are.