Retrieval that actually works: lessons from 50 deployments
Chunking, reranking, and evaluation tactics we learned the hard way while grounding agents on messy enterprise knowledge bases.
Almost every failed agent we have been called in to rescue had the same root cause, and it was almost never the model. The agent was fluent, confident, and wrong — because the three paragraphs we handed it were the wrong three paragraphs. Retrieval is the quiet half of RAG that nobody demos, and it is where the deployments live or die. After roughly fifty of them on enterprise knowledge bases that no one had cleaned in a decade, the patterns have stopped surprising us.
This is the field guide we wish we had at deployment number one. It is not about which vector database to buy or which embedding model topped a leaderboard last month. It is about the unglamorous decisions — how you cut documents, what you put in the index, how you reorder results, and crucially how you know any of it is working — that separate a grounded agent from a confident liar.
Key takeaways
- Chunking is a content decision. Splitting on structure (headings, tables, code blocks) and enriching each chunk with its breadcrumb context before embedding consistently outperforms character-count splits.
- Hybrid search is not optional. BM25 catches exact tokens — error codes, SKUs, version strings — that dense embeddings silently round away. Run both and fuse them with Reciprocal Rank Fusion.
- Reranking is the highest-leverage upgrade. A cross-encoder reranker reading query and chunk together — after a wide first-stage recall pass — is the single biggest quality jump most pipelines can make.
- Evaluate retrieval independently from generation. Recall@k and Mean Reciprocal Rank (MRR), measured on a labeled question-to-chunk set, reveal which layer is failing. End-to-end accuracy alone cannot.
- Freshness is a signal, not a nicety. Without explicit metadata filtering, a hybrid pipeline has no way to prefer the 2024 policy over the 2019 one that uses nearly identical language.
- Context window ≠ a recall buffer. Stuffing twenty loosely relevant chunks into a prompt offloads a precision problem onto the model. Retrieve wide; rerank narrow.
The mess is the job
Tutorials run on clean corpora: a tidy folder of markdown, a Wikipedia dump, the same PDF everyone benchmarks on. Real enterprise knowledge looks nothing like that. It is a SharePoint with eleven copies of the same policy, half of them superseded and none of them marked as such. It is PDFs that are scanned images of tables, Confluence pages last touched in 2019, and a "source of truth" spreadsheet whose truth lives in cell comments. Slack threads where the actual answer is the fourth reply, not the question.
If you treat retrieval as a solved library problem and just point an embedding model at this pile, you get exactly what you would expect: plausible chunks that are out of date, out of context, or out of scope. The work — the real work — is engineering around the mess. Everything below is a tactic for doing that, roughly in the order the pipeline runs.
The model was never the problem. We handed it the wrong three paragraphs and acted surprised when it believed them.
Chunking is a content decision, not a config value
The most common mistake we inherit is a flat "split every 1,000 characters" rule applied to everything. It is fast to write and it quietly destroys meaning. It cuts a procedure in half, it severs a heading from the paragraph it governs, it splits a table from the row that explains its units. The retriever then returns a fragment that is technically relevant and practically useless.
Chunk on the document's own structure first. Markdown and HTML give you headings and lists for free; respect them. Tables should stay whole or be serialized row-by-row with their header repeated, never sliced down the middle. Code and config blocks are atomic — splitting a YAML block is just generating garbage tokens. Only when a structural unit is genuinely too large do you fall back to a sliding window, and even then you overlap by a sentence or two so a fact that straddles the boundary survives in at least one chunk.
The second move matters as much as the first: enrich every chunk before you embed it. A bare paragraph that says "the maximum is 500 per account" is nearly unretrievable and dangerously ambiguous. Prepend the document title and the heading breadcrumb — "Billing Policy > Rate Limits > Free Tier" — so both the embedding and the model that reads it later know exactly what 500 refers to. Adding that breadcrumb context did more for our recall than any reranker or model upgrade we tried in the same period.
Anthropic formalized this pattern in their Contextual Retrieval research, where an LLM generates a short situating sentence for each chunk before it is embedded. Their internal experiments found that contextual embeddings combined with contextual BM25 reduced the top-20-chunk retrieval failure rate by 49% (from 5.7% to 2.9%); adding a reranker on top pushed that reduction to 67% (down to 1.9%). The gains are not marginal.
Real enterprise knowledge is messy and sprawling — how you chunk and index it decides what an agent can ever retrieve.
Embeddings are necessary and not sufficient
Dense embeddings are extraordinary at capturing meaning. They are also lossy in exactly the places enterprise users care about. Ask for "error 0x80070057" and a pure vector search will happily hand you semantically adjacent errors, because the embedding compresses that precise string into a fuzzy neighborhood. The same goes for SKUs, contract numbers, statute references, and the literal phrase a frustrated user copy-pasted out of a log.
So we stopped treating the embedding model as the whole retrieval system. The MTEB leaderboard (Massive Text Embedding Benchmark) is the standard way to compare models across retrieval, classification, clustering, and other tasks — use it to short-list candidates for your domain. But pick one that fits your domain and latency budget, normalize your vectors, and move on — the marginal points between this quarter's top two models almost never survive contact with a messy corpus. Where you spend your effort is everything around the embedding: what you index, how you combine signals, and how you reorder the result. Which brings us to the two changes that consistently move the needle.
Hybrid search: stop choosing between meaning and exactness
For a long time the field argued vectors versus keywords as if you had to pick. You do not, and you should not. Run both. A classic BM25 keyword index catches the exact tokens — codes, names, rare jargon — that embeddings round away, while dense retrieval catches the paraphrases and the conceptual matches that keywords miss entirely. Fuse the two ranked lists, and the combined recall beats either leg alone on basically every corpus we have measured.
Benchmark data backs this up. A 2026 arXiv study on financial and table-heavy documents (From BM25 to Corrective RAG, arXiv 2604.01733) found that Hybrid RRF fusion achieved Recall@5 of 0.695, outperforming both BM25 alone (0.644) and dense-only retrieval (0.587). Adding a neural reranker on top pushed that further to 0.816 — a 39% relative improvement over dense retrieval alone. Across other corpora, hybrid systems routinely deliver 15–30% better recall than either method in isolation.
The practical detail people get wrong is the fusion. Do not try to compare a cosine similarity against a BM25 score directly — they are not on the same scale and tuning a weight between them is a losing game across heterogeneous queries. Reciprocal Rank Fusion (RRF) sidesteps the whole problem: it scores each document by its rank position in each list, not its raw score, so the two systems can disagree on magnitude and still combine cleanly. RRF was introduced by Cormack, Clarke, and Buettcher at SIGIR 2009 (ACM DL 10.1145/1571941.1572114), where it outperformed Condorcet Fuse, CombMNZ, and every individual learning-to-rank method they benchmarked. It is a dozen lines of code and it has outperformed every hand-tuned weighting we attempted.
| Retrieval method | Recall@5 (T2-RAGBench) |
|---|---|
| Dense only | 0.587 |
| BM25 only | 0.644 |
| Hybrid RRF | 0.695 |
| Hybrid RRF + neural reranker | 0.816 |
Source: arXiv 2604.01733, financial and table-heavy document benchmark.
A mistake we keep seeing — Teams crank first-stage retrieval up to top-20, stuff all twenty chunks into the prompt, and call it grounding. You have not grounded the agent — you have buried the answer in noise and handed the model the job of finding it. Retrieve wide for recall, then narrow hard with a reranker. The context window is a precision instrument, not a dumpster.
Reranking is the highest-leverage stage
First-stage retrieval — vector, keyword, or both — is optimized for recall. Its job is to make sure the right chunk is somewhere in the top fifty, not to put it at position one. That is a different and harder task, and it is what a reranker is for. A cross-encoder reads the query and each candidate together rather than comparing precomputed vectors, so it judges true relevance instead of geometric proximity. You take the fifty candidates from stage one, rerank them, and keep the best handful.
This was, without much competition, the single biggest quality jump we have shipped. Answers got more grounded, citations got more accurate, and hallucinations dropped — not because the model changed, but because it finally received eight tightly relevant chunks instead of twenty loosely relevant ones. The cost is real but bounded: reranking fifty short candidates is cheap next to a generation call, and it runs in parallel with nothing blocking it. If you adopt one tactic from this entire piece, adopt this one.
The T2-RAGBench data above makes the reranking gain concrete: adding Cohere Rerank to a hybrid RRF pipeline boosted MRR@3 from 0.433 to 0.605 — a 39.7% relative improvement (arXiv 2604.01733). That is the reranker's job in one number.
Retrieve wide for recall, then rerank down to a precise few — recall first, precision last.
Grounding and citations: make the agent show its work
Retrieving the right chunk is necessary but not the finish line. The agent still has to use it, and only it. We require every claim in an answer to carry a citation back to a specific chunk, and we render those citations in the UI so a user can click straight to the source. That single requirement does two jobs: it gives the human a fast way to verify, and it gives us a precise place to look when an answer goes wrong — was the bad citation a retrieval failure, or did the model ignore good context?
Prompting matters here too. We tell the model explicitly that if the retrieved context does not contain the answer, the correct response is to say so and offer to escalate — not to fill the gap from its parameters. An agent that confidently invents a refund policy is far more dangerous than one that says "I do not have that documented." Pair that instruction with a retrieval confidence threshold: if even the top reranked chunk scores poorly, treat it as a no-answer and route to a human rather than forcing a grounded-looking guess.
How we actually evaluate retrieval
Here is the discipline that took us the longest to internalize: evaluate retrieval separately from generation. When the final answer is wrong, two very different failures look identical from the outside — the retriever missed, or the model fumbled good context. If you only ever score the end-to-end answer, you cannot tell which knob to turn, and you will waste weeks tuning prompts when the real problem was a chunk that never got retrieved.
So we build a labeled retrieval set. For a few hundred real questions — pulled from actual usage and support logs, not invented at a desk — we annotate which chunks genuinely contain the answer. Then we score the retriever in isolation: Recall@k tells us whether the right chunk made it into the candidate pool at all, and Mean Reciprocal Rank (MRR) tells us how close to the top it landed after reranking. MRR is defined as the average of the reciprocal rank of the first relevant result across queries: if the right chunk ranks third, its contribution is 1/3. Those two numbers, tracked over time, are our retrieval dashboard. We run them as a regression gate, so any change to chunking, the embedding model, or fusion weights has to prove it did not quietly tank recall before it ships.
For the generation half we layer on a separate check using RAGAS (Retrieval-Augmented Generation Assessment), an open-source framework introduced by Shahul Es et al. (2023) that measures faithfulness (does every claim trace to a retrieved chunk) and answer relevance, without requiring hand-written ground-truth answers. We sample those scores and spot-audit by humans so the LLM-as-judge itself stays honest. Keeping the two scores apart is what lets us debug. A high recall and a low faithfulness score points squarely at the prompt; a low recall points squarely at the index. The hardest-won lesson across fifty deployments is almost embarrassingly simple: you cannot improve what you have not separated, and you cannot trust a retrieval system you have never measured on its own.
The tactics, in one place
If you are starting fresh or auditing a pipeline that is underperforming, this is the checklist we run through, in roughly the order the data flows.
Chunk on structure, not character counts. Split on headings, list boundaries, and table rows before you ever reach for a fixed token window. A chunk that ends mid-sentence or mid-table is a retrieval miss waiting to happen. When you must fall back to size, overlap by a sentence or two so meaning never falls into the gap.
Carry context into the chunk. A standalone paragraph rarely says what document, section, or product it belongs to. Prepend the title and breadcrumb path to each chunk before embedding it, so "the limit is 500" still knows which limit it means. Anthropic's contextual retrieval research (anthropic.com/news/contextual-retrieval) shows this alone cuts retrieval failures by 35%; combining it with contextual BM25 and reranking takes that to 67%. This one change moved more recall for us than any model swap.
Run hybrid search, always. Vectors are great at meaning and terrible at exact tokens — error codes, SKUs, version numbers, the literal phrase a user pasted. Run BM25 and dense retrieval in parallel and fuse them with Reciprocal Rank Fusion (the original SIGIR 2009 paper showed RRF outperforming every alternative they benchmarked). The keyword leg catches everything the embedding quietly rounds off.
Rerank the top-k before you trust it. First-stage retrieval is a recall tool, not a precision tool. Pull 50 candidates, then run a cross-encoder reranker to reorder them and keep the best 5 to 8. Cheap relative to generation, and it is the single biggest lever on answer quality we have found.
Make freshness a first-class signal. Enterprise corpora rot. The same policy exists in four documents across three years, and the embedding has no idea which one is current. Push effective dates and supersession into your metadata and filter or boost on them — relevance without recency just confidently returns the wrong year.
Evaluate retrieval on its own. Do not judge retrieval through the lens of the final answer — a good model papers over a bad context, and a bad model buries a good one. Build a labeled set of question-to-chunk pairs and track Recall@k and MRR on retrieval in isolation, every time you change anything. Use a framework like RAGAS for the generation side.
What we would tell our past selves
None of this is exotic. There is no secret model, no proprietary index, no clever prompt that makes messy data clean. The teams that get grounded agents into production are the ones that treat retrieval as its own engineering surface — with its own structure decisions, its own signals, its own evaluation harness — instead of a setup step they did once and never looked at again.
Start with chunking, because everything downstream inherits its mistakes. Add hybrid search and a reranker before you touch the model, because they will out-earn any model swap. And build the retrieval eval set early, because it is the only thing that tells you, honestly, whether the next change helped or just felt like it did. Fifty deployments in, that is the whole secret — and it is far less about intelligence than about being willing to measure the unglamorous middle.
Frequently asked questions
What is the difference between recall@k and MRR in RAG evaluation? Recall@k measures whether the correct chunk appears anywhere in the top-k results — it answers "did we retrieve it at all?" Mean Reciprocal Rank (MRR) measures how high the first correct chunk ranks — it answers "did we retrieve it near the top?" For reranking evaluation, MRR is typically more informative because you care about rank, not just membership in the candidate set. Track both: a high Recall@10 with a low MRR often signals that reranking is underperforming.
Why use Reciprocal Rank Fusion instead of a weighted combination of BM25 and vector scores? BM25 scores and cosine similarity scores are on incompatible scales. A BM25 score of 18 and a cosine score of 0.72 cannot be meaningfully compared or summed without careful per-query normalization — and any fixed weight will be wrong for some queries. RRF sidesteps the problem entirely by working only on rank positions: each document is scored as 1/(k + rank) and the scores from each retrieval leg are summed. No calibration required, and the original Cormack et al. (2009) paper showed it outperforming every tuned alternative they tested.
When should I add contextual chunk enrichment versus just improving my embedding model? Add contextual enrichment first — it is cheaper and more durable. Swapping embedding models requires re-embedding your entire corpus and rarely survives contact with messy enterprise data. Adding a breadcrumb prefix or LLM-generated context sentence to each chunk before embedding is a one-time indexing cost that compounds across every query. Anthropic's research shows it reduces retrieval failures by 35% on its own, before any model change.
How do I know if my retrieval problem is a chunking issue or an embedding issue? Run your labeled retrieval eval set and look at the failure mode. If the correct chunk exists but scores low despite good wording, suspect the embedding model or its query-document alignment. If the correct chunk is never in the top-k at all, suspect chunking — a chunk that cuts across a structural boundary may never surface regardless of how good the embedding model is. Hybrid search failures on exact tokens (codes, names) are almost always a missing BM25 leg, not an embedding problem.
Sources
- Contextual Retrieval — Anthropic (2024) — 49% retrieval failure reduction with contextual embeddings + BM25; 67% with reranking added.
- From BM25 to Corrective RAG: Benchmarking Retrieval Strategies for Text-and-Table Documents — arXiv 2604.01733 (2026) — Recall@5 benchmarks: dense 0.587, BM25 0.644, Hybrid RRF 0.695, Hybrid + reranker 0.816; MRR@3 gain from reranking (+39.7%).
- Reciprocal Rank Fusion outperforms Condorcet and Individual Rank Learning Methods — Cormack, Clarke & Buettcher, SIGIR 2009 (ACM DL 10.1145/1571941.1572114) — Original RRF paper; RRF outperforms all alternatives on LETOR 3 benchmark.
- RAGAS: Automated Evaluation of Retrieval Augmented Generation — Shahul Es et al., arXiv 2309.15217 (2023) — Framework for reference-free RAG evaluation using faithfulness, answer relevancy, context precision, and context recall metrics.
- RAG-Fusion: a New Take on Retrieval-Augmented Generation — Rackauckas, arXiv 2402.03367 (2024) — Applies RRF to multiple generated queries for improved retrieval diversity.
- MTEB Leaderboard (Massive Text Embedding Benchmark) — Hugging Face — Standard benchmark for comparing embedding models across retrieval and other tasks.
- Retrieval-Augmented Generation for Large Language Models: A Survey — Gao et al., arXiv 2312.10997 (2023) — Comprehensive survey of RAG paradigms covering Naive, Advanced, and Modular RAG.