RAG done right
Chunking, embeddings, hybrid search, rerankers, query rewriting. Why most RAG demos fall apart in production — and the design that doesn't.
Prerequisites
04.2
Stack
Python 3.12voyage-3 or text-embedding-3-largePostgres + pgvector OR QdrantBM25 (rank_bm25 or ts_vector)cohere rerank or similar rerankerAnthropic API
By the end of this module
- Build a RAG system that beats pure vector search by a measurable margin on your own corpus.
- Pick chunking, embeddings, and vector store appropriately for the data you actually have.
- Implement hybrid retrieval (BM25 + vector) and a reranker stage.
- Write a small RAG eval set and use it to validate every change.
Most RAG demos look amazing and most production RAG systems disappoint. The gap is not subtle — pure vector search over naively chunked documents simply does not retrieve well enough on real data. This module is about closing that gap. By the end you’ll have built a RAG system that uses hybrid search, reranking, and query rewriting, and you’ll have an eval harness that proves it’s better than the demo version everyone else ships.
The opinion: if you’re using LangChain’s default RAG chain unmodified, you have not built a RAG system. You have built a debugging exercise. The defaults are catastrophically bad on real corpora — pure vector, fixed-size chunks, no reranker, no query rewrite. Every one of those choices loses retrieval quality on real data. Fix them in this order, measure each, and you’ll have something that actually works.
Set up
mkdir rag && cd rag
uv venv .venv && source .venv/bin/activate
uv pip install anthropic voyageai cohere psycopg2-binary pgvector \
rank_bm25 sentence-transformers tiktoken python-dotenv
# Postgres with pgvector via docker
cat > docker-compose.yml <<'EOF'
services:
db:
image: pgvector/pgvector:pg16
environment:
POSTGRES_PASSWORD: rag
ports: ["5432:5432"]
EOF
docker compose up -d
cat > .env <<'EOF'
ANTHROPIC_API_KEY=sk-ant-...
VOYAGE_API_KEY=...
COHERE_API_KEY=...
EOF
git init && echo ".env\n.venv/" >> .gitignore
You need a real corpus of your own. The built-in benchmarks are misleading — they’re too clean. Use your notes, your company’s docs, a Wikipedia subset, a codebase. Something with roughly 1000-10000 documents and real messiness.
Read these first
Three sources, in order, then stop:
- Anthropic — Contextual Retrieval. post · 30 min · the cleanest explanation of why naive RAG fails and what fixes it.
- Pinecone — Hybrid Search guide. docs · 20 min · the case for BM25 + vector together.
- Lewis et al. — Retrieval-Augmented Generation. arxiv · 30 min · the original paper. Mostly historical — what’s in production today is much further along.
Skip the LangChain tutorials and the “build RAG in 10 lines” YouTube videos. Those are how everyone produces broken RAG.
The demo-to-production gap
Demo RAG looks great because:
- The demo dataset is small enough that any retrieval works.
- The demo questions match the document phrasing word-for-word.
- The demo is evaluated by the person who built it on the questions they had in mind.
Production RAG looks bad because:
- The corpus is large; retrieval has to discriminate between many similar documents.
- Real users phrase questions in ways totally unlike the source documents.
- The embeddings of “What’s our refund policy?” and “Can I get my money back?” are similar but not identical, and the document containing the answer is in the top-50 but not top-5.
The fix isn’t a magic prompt. It’s plumbing: better chunks, hybrid search, a reranker, query rewriting, and evals to verify each change.
Chunking that actually makes sense
The default everyone uses: split on every 500 tokens. This is wrong.
| Strategy | When it works | When it fails |
|---|---|---|
| Fixed-size with overlap | Plain prose, blog posts | Splits structured docs across chunks |
| Recursive structural | Markdown, code, HTML | When structure is missing or noisy |
| Semantic (cluster on embeddings) | Long meandering docs | Slow, marginal gains |
| Sentence-window | Q+A retrieval | Loses context for narrative |
| Document-as-chunk | Short docs (under 1000 tokens) | Long docs are too coarse |
Two specific things that matter:
- Respect the structure your data already has. Markdown headings. JSON keys. Code function boundaries. PDFs with sections. Don’t shred it; chunk along the boundaries.
- Add context to each chunk. A bare chunk like “…the limit is 30 days…” is useless when retrieved without context. Anthropic’s contextual retrieval idea: prepend each chunk with a 1-2 sentence summary of where it sits in the document. Cheap (one Claude call per chunk) and meaningfully better.
def chunk_markdown(text, target_tokens=400):
# Split on H1/H2/H3, fall back to paragraphs
sections = []
for line in text.split("\n"):
if line.startswith("#") or not sections:
sections.append([line])
else:
sections[-1].append(line)
chunks = []
for section in sections:
body = "\n".join(section)
# Further split if too big
if len(body) // 4 > target_tokens:
# ... recursive splitting
pass
chunks.append(body)
return chunks
Embeddings — don’t default to OpenAI
The OpenAI embedding models are fine but they’re rarely the best choice. As of late 2025, the leaderboard moves quickly, but a useful rule:
| Embedding model | When to choose |
|---|---|
| voyage-3 / voyage-3-large | Best general-purpose. Default unless you have a reason. |
| text-embedding-3-large | Solid, ubiquitous, slightly behind voyage on quality. |
| Cohere embed-english-v3 | Strong on enterprise text. |
| BGE-large / e5-mistral | Open weight; self-host for cost. Slower than API. |
| nomic-embed-v1.5 | Open weight; smaller and faster. Good for high-volume. |
Run a quick eval on your data before committing. The MTEB leaderboard is a starting point, not an answer.
Vector DB choice
Most people pick a vector DB before they have data and live with that choice forever. Don’t.
| Option | Right for |
|---|---|
| pgvector | under 1M vectors, you already have Postgres, sane default |
| Qdrant | 1M+ vectors, want filters and quantization |
| LanceDB | Local-first, embedded, no server |
| Pinecone | You want managed and have budget |
| FAISS | In-process, single-machine, you’ll handle persistence |
| No vector DB | Tiny corpus (under 10K). Just embed and brute force. |
For most projects in this module: pgvector. It scales further than people think, your auth and backups already work, and you avoid running a second database.
-- pgvector schema
CREATE EXTENSION vector;
CREATE TABLE chunks (
id BIGSERIAL PRIMARY KEY,
doc_id TEXT,
text TEXT,
embedding vector(1024),
metadata JSONB
);
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON chunks USING gin (to_tsvector('english', text));
Hybrid search — the single biggest win
Pure vector search loses to pure BM25 on certain query types (acronyms, names, exact phrases) and BM25 loses to vector on semantic queries (“explain how X works”). Combining both wins on almost everything.
def search_hybrid(query, k=20):
# Vector branch
q_embed = embed(query)
vector_results = db.execute("""
SELECT id, text, 1 - (embedding <=> %s) AS score
FROM chunks ORDER BY embedding <=> %s LIMIT %s
""", (q_embed, q_embed, k)).fetchall()
# BM25 branch (Postgres ts_vector or rank_bm25)
keyword_results = db.execute("""
SELECT id, text, ts_rank(to_tsvector('english', text),
plainto_tsquery('english', %s)) AS score
FROM chunks WHERE to_tsvector('english', text) @@ plainto_tsquery('english', %s)
ORDER BY score DESC LIMIT %s
""", (query, query, k)).fetchall()
# Reciprocal Rank Fusion
return rrf(vector_results, keyword_results, k=60)
def rrf(*rankings, k=60):
scores = {}
for ranking in rankings:
for rank, (id_, text, _) in enumerate(ranking):
scores[id_] = scores.get(id_, 0) + 1 / (k + rank)
return sorted(scores.items(), key=lambda x: -x[1])
Reciprocal Rank Fusion is dumb and works. It needs no tuning, doesn’t care about score scales, and beats most weighted-sum schemes. Use RRF unless you have a strong reason not to.
Rerankers — the second biggest win
After hybrid retrieval gives you 50 candidates, run a cross-encoder reranker on (query, candidate) pairs. The reranker scores actual relevance with a model that sees both the query and the chunk together — much more accurate than embedding similarity.
import cohere
co = cohere.Client(os.environ["COHERE_API_KEY"])
def rerank(query, candidates, top_k=5):
docs = [c["text"] for c in candidates]
response = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=docs,
top_n=top_k,
)
return [candidates[r.index] for r in response.results]
This is one API call per query and it raises retrieval quality more than any other single change after hybrid search. Open-weight alternatives: bge-reranker-v2-m3 (self-host), Jina rerankers.
Query rewriting
Your retrieval is only as good as the query you give it. Two patterns help:
- HyDE (Hypothetical Document Embeddings). Have the model write a hypothetical answer to the query, then embed that and search. The hypothetical answer often matches the real document better than the question does.
- Sub-query decomposition. For multi-hop questions (“Compare X’s policy to Y’s”), use the model to split into atomic queries, retrieve for each, then synthesize.
def hyde_search(query):
hypothetical = anthropic_client.messages.create(
model="claude-haiku-4-5",
max_tokens=300,
messages=[{"role": "user", "content": f"Write a one-paragraph answer to: {query}"}],
).content[0].text
return search_hybrid(hypothetical)
Worth it when queries are short and abstract. Adds latency; skip if your queries are already concrete and verbose.
Evals — the part most teams skip
Without evals you cannot tell if a change made retrieval better or worse. Build a small eval set (50-100 questions) on your real corpus.
# eval.jsonl
{"q": "What's our return policy for digital products?",
"must_retrieve_doc_id": "policy-returns-v3"}
{"q": "How do I configure SSO with Okta?",
"must_retrieve_doc_id": "sso-okta-setup"}
Track three metrics:
| Metric | What it tells you |
|---|---|
| Recall@k | Did we retrieve the right doc in the top k? |
| MRR | Where in the ranking is the right doc? |
| End-to-end accuracy | Does the final generated answer match? |
Run the eval before any change. Run it after. If a change drops Recall@5 without a clear reason, revert it.
The build
Build it in this order. Measure after each step.
- Ingest your corpus, chunk by structure, store in pgvector.
- Pure vector search. Run eval. Record baseline.
- Add BM25. Combine with RRF. Run eval. Record gain.
- Add Cohere reranker on top-50 → top-5. Run eval. Record gain.
- Add HyDE for abstract queries. Run eval. Decide if worth latency.
- Final stage: Claude generates answer with retrieved chunks as context.
You should see Recall@5 improve at each step from 1 → 4. If a step doesn’t help, that means it doesn’t help on your data — keep the simpler version.
Going deeper
When you have specific questions, in this order:
- Anthropic Contextual Retrieval cookbook — implementation for the Anthropic post linked above.
- Pinecone — Vector DB benchmarks — when you outgrow pgvector and need to pick.
- Galileo — RAG evaluation framework — for production RAG monitoring beyond static evals.
- Cohere — Rerank docs — including their open-source rerank model options.
Skip “RAG vs fine-tuning” think pieces. They’re almost always written by people who haven’t shipped either to production.
Checkpoints
If any wobbles, reread the corresponding section.
- You have a corpus of code documentation. Why would chunking by markdown headers beat fixed 500-token chunks here?
- Walk through Reciprocal Rank Fusion on two ranked lists. Why does it work without tuning?
- Why does a cross-encoder reranker beat embedding similarity for the final ranking? What’s the cost?
- When does HyDE help and when is it pure latency overhead with no benefit?
- You change your chunking strategy and your end-to-end accuracy goes up but Recall@5 goes down. What does this tell you?
When you can answer all five from memory, move to 04.7 Inference, deployment, costs. RAG quality and inference economics are the two biggest determinants of whether your AI product can actually ship.