04.6 RAG done right

Most RAG demos look amazing and most production RAG systems disappoint. The gap is not subtle — pure vector search over naively chunked documents simply does not retrieve well enough on real data. This module is about closing that gap. By the end you’ll have built a RAG system that uses hybrid search, reranking, and query rewriting, and you’ll have an eval harness that proves it’s better than the demo version everyone else ships.

The opinion: if you’re using LangChain’s default RAG chain unmodified, you have not built a RAG system. You have built a debugging exercise. The defaults are catastrophically bad on real corpora — pure vector, fixed-size chunks, no reranker, no query rewrite. Every one of those choices loses retrieval quality on real data. Fix them in this order, measure each, and you’ll have something that actually works.

Set up

mkdir rag && cd rag
uv venv .venv && source .venv/bin/activate
uv pip install anthropic voyageai cohere psycopg2-binary pgvector \
  rank_bm25 sentence-transformers tiktoken python-dotenv

# Postgres with pgvector via docker
cat > docker-compose.yml <<'EOF'
services:
  db:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_PASSWORD: rag
    ports: ["5432:5432"]
EOF
docker compose up -d

cat > .env <<'EOF'
ANTHROPIC_API_KEY=sk-ant-...
VOYAGE_API_KEY=...
COHERE_API_KEY=...
EOF

git init && echo ".env\n.venv/" >> .gitignore

You need a real corpus of your own. The built-in benchmarks are misleading — they’re too clean. Use your notes, your company’s docs, a Wikipedia subset, a codebase. Something with roughly 1000-10000 documents and real messiness.

Read these first

Three sources, in order, then stop:

Anthropic — Contextual Retrieval. post · 30 min · the cleanest explanation of why naive RAG fails and what fixes it.
Pinecone — Hybrid Search guide. docs · 20 min · the case for BM25 + vector together.
Lewis et al. — Retrieval-Augmented Generation. arxiv · 30 min · the original paper. Mostly historical — what’s in production today is much further along.

Skip the LangChain tutorials and the “build RAG in 10 lines” YouTube videos. Those are how everyone produces broken RAG.

The demo-to-production gap

Demo RAG looks great because:

The demo dataset is small enough that any retrieval works.
The demo questions match the document phrasing word-for-word.
The demo is evaluated by the person who built it on the questions they had in mind.

Production RAG looks bad because:

The corpus is large; retrieval has to discriminate between many similar documents.
Real users phrase questions in ways totally unlike the source documents.
The embeddings of “What’s our refund policy?” and “Can I get my money back?” are similar but not identical, and the document containing the answer is in the top-50 but not top-5.

The fix isn’t a magic prompt. It’s plumbing: better chunks, hybrid search, a reranker, query rewriting, and evals to verify each change.

Chunking that actually makes sense

The default everyone uses: split on every 500 tokens. This is wrong.

Strategy	When it works	When it fails
Fixed-size with overlap	Plain prose, blog posts	Splits structured docs across chunks
Recursive structural	Markdown, code, HTML	When structure is missing or noisy
Semantic (cluster on embeddings)	Long meandering docs	Slow, marginal gains
Sentence-window	Q+A retrieval	Loses context for narrative
Document-as-chunk	Short docs (under 1000 tokens)	Long docs are too coarse

Two specific things that matter:

Respect the structure your data already has. Markdown headings. JSON keys. Code function boundaries. PDFs with sections. Don’t shred it; chunk along the boundaries.
Add context to each chunk. A bare chunk like “…the limit is 30 days…” is useless when retrieved without context. Anthropic’s contextual retrieval idea: prepend each chunk with a 1-2 sentence summary of where it sits in the document. Cheap (one Claude call per chunk) and meaningfully better.

def chunk_markdown(text, target_tokens=400):
    # Split on H1/H2/H3, fall back to paragraphs
    sections = []
    for line in text.split("\n"):
        if line.startswith("#") or not sections:
            sections.append([line])
        else:
            sections[-1].append(line)
    chunks = []
    for section in sections:
        body = "\n".join(section)
        # Further split if too big
        if len(body) // 4 > target_tokens:
            # ... recursive splitting
            pass
        chunks.append(body)
    return chunks

Embeddings — don’t default to OpenAI

The OpenAI embedding models are fine but they’re rarely the best choice. As of late 2025, the leaderboard moves quickly, but a useful rule:

Embedding model	When to choose
voyage-3 / voyage-3-large	Best general-purpose. Default unless you have a reason.
text-embedding-3-large	Solid, ubiquitous, slightly behind voyage on quality.
Cohere embed-english-v3	Strong on enterprise text.
BGE-large / e5-mistral	Open weight; self-host for cost. Slower than API.
nomic-embed-v1.5	Open weight; smaller and faster. Good for high-volume.

Run a quick eval on your data before committing. The MTEB leaderboard is a starting point, not an answer.

Vector DB choice

Most people pick a vector DB before they have data and live with that choice forever. Don’t.

Option	Right for
pgvector	under 1M vectors, you already have Postgres, sane default
Qdrant	1M+ vectors, want filters and quantization
LanceDB	Local-first, embedded, no server
Pinecone	You want managed and have budget
FAISS	In-process, single-machine, you’ll handle persistence
No vector DB	Tiny corpus (under 10K). Just embed and brute force.

For most projects in this module: pgvector. It scales further than people think, your auth and backups already work, and you avoid running a second database.

-- pgvector schema
CREATE EXTENSION vector;

CREATE TABLE chunks (
    id BIGSERIAL PRIMARY KEY,
    doc_id TEXT,
    text TEXT,
    embedding vector(1024),
    metadata JSONB
);

CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON chunks USING gin (to_tsvector('english', text));

Hybrid search — the single biggest win

Pure vector search loses to pure BM25 on certain query types (acronyms, names, exact phrases) and BM25 loses to vector on semantic queries (“explain how X works”). Combining both wins on almost everything.

def search_hybrid(query, k=20):
    # Vector branch
    q_embed = embed(query)
    vector_results = db.execute("""
        SELECT id, text, 1 - (embedding <=> %s) AS score
        FROM chunks ORDER BY embedding <=> %s LIMIT %s
    """, (q_embed, q_embed, k)).fetchall()

    # BM25 branch (Postgres ts_vector or rank_bm25)
    keyword_results = db.execute("""
        SELECT id, text, ts_rank(to_tsvector('english', text),
                                 plainto_tsquery('english', %s)) AS score
        FROM chunks WHERE to_tsvector('english', text) @@ plainto_tsquery('english', %s)
        ORDER BY score DESC LIMIT %s
    """, (query, query, k)).fetchall()

    # Reciprocal Rank Fusion
    return rrf(vector_results, keyword_results, k=60)

def rrf(*rankings, k=60):
    scores = {}
    for ranking in rankings:
        for rank, (id_, text, _) in enumerate(ranking):
            scores[id_] = scores.get(id_, 0) + 1 / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

Reciprocal Rank Fusion is dumb and works. It needs no tuning, doesn’t care about score scales, and beats most weighted-sum schemes. Use RRF unless you have a strong reason not to.

Rerankers — the second biggest win

After hybrid retrieval gives you 50 candidates, run a cross-encoder reranker on (query, candidate) pairs. The reranker scores actual relevance with a model that sees both the query and the chunk together — much more accurate than embedding similarity.

import cohere
co = cohere.Client(os.environ["COHERE_API_KEY"])

def rerank(query, candidates, top_k=5):
    docs = [c["text"] for c in candidates]
    response = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=docs,
        top_n=top_k,
    )
    return [candidates[r.index] for r in response.results]

This is one API call per query and it raises retrieval quality more than any other single change after hybrid search. Open-weight alternatives: bge-reranker-v2-m3 (self-host), Jina rerankers.

Query rewriting

Your retrieval is only as good as the query you give it. Two patterns help:

HyDE (Hypothetical Document Embeddings). Have the model write a hypothetical answer to the query, then embed that and search. The hypothetical answer often matches the real document better than the question does.
Sub-query decomposition. For multi-hop questions (“Compare X’s policy to Y’s”), use the model to split into atomic queries, retrieve for each, then synthesize.

def hyde_search(query):
    hypothetical = anthropic_client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=300,
        messages=[{"role": "user", "content": f"Write a one-paragraph answer to: {query}"}],
    ).content[0].text
    return search_hybrid(hypothetical)

Worth it when queries are short and abstract. Adds latency; skip if your queries are already concrete and verbose.

Evals — the part most teams skip

Without evals you cannot tell if a change made retrieval better or worse. Build a small eval set (50-100 questions) on your real corpus.

# eval.jsonl
{"q": "What's our return policy for digital products?",
 "must_retrieve_doc_id": "policy-returns-v3"}
{"q": "How do I configure SSO with Okta?",
 "must_retrieve_doc_id": "sso-okta-setup"}

Track three metrics:

Metric	What it tells you
Recall@k	Did we retrieve the right doc in the top k?
MRR	Where in the ranking is the right doc?
End-to-end accuracy	Does the final generated answer match?

Run the eval before any change. Run it after. If a change drops Recall@5 without a clear reason, revert it.

The build

Build it in this order. Measure after each step.

Ingest your corpus, chunk by structure, store in pgvector.
Pure vector search. Run eval. Record baseline.
Add BM25. Combine with RRF. Run eval. Record gain.
Add Cohere reranker on top-50 → top-5. Run eval. Record gain.
Add HyDE for abstract queries. Run eval. Decide if worth latency.
Final stage: Claude generates answer with retrieved chunks as context.

You should see Recall@5 improve at each step from 1 → 4. If a step doesn’t help, that means it doesn’t help on your data — keep the simpler version.

Going deeper

When you have specific questions, in this order:

Anthropic Contextual Retrieval cookbook — implementation for the Anthropic post linked above.
Pinecone — Vector DB benchmarks — when you outgrow pgvector and need to pick.
Galileo — RAG evaluation framework — for production RAG monitoring beyond static evals.
Cohere — Rerank docs — including their open-source rerank model options.

Skip “RAG vs fine-tuning” think pieces. They’re almost always written by people who haven’t shipped either to production.

Checkpoints

If any wobbles, reread the corresponding section.

You have a corpus of code documentation. Why would chunking by markdown headers beat fixed 500-token chunks here?
Walk through Reciprocal Rank Fusion on two ranked lists. Why does it work without tuning?
Why does a cross-encoder reranker beat embedding similarity for the final ranking? What’s the cost?
When does HyDE help and when is it pure latency overhead with no benefit?
You change your chunking strategy and your end-to-end accuracy goes up but Recall@5 goes down. What does this tell you?

When you can answer all five from memory, move to 04.7 Inference, deployment, costs. RAG quality and inference economics are the two biggest determinants of whether your AI product can actually ship.