Why most RAG systems fail before retrieval
The retrieval algorithm is rarely the problem. Most RAG failures happen earlier, at stages the team isn't looking at. Here's the failure shape we keep seeing and the order we'd actually debug it in.
The retrieval algorithm is rarely the problem.
If you have built a RAG system and watched it ship to production, you have probably had this conversation. The system answers well in the demo. It answers worse with real users. Somebody asks "is it the model" and the team says no, the model is fine. So you start tuning retrieval. You swap dense for hybrid. You add a reranker. You bump top-k. You try Cohere, then Voyage, then Jina. Each change moves the needle a percent or two. None of them fix the failures your users keep flagging.
That is because the user-reported failures almost never start at retrieval. They start before retrieval. The chunks being indexed are wrong. The metadata is missing. The query the user typed never gets translated into something the index can answer. And nobody has built the evaluation surface to tell the difference between a retrieval failure and a generation failure, so the team is optimising blind.
Source ingestion ← ① data hygiene
↓
Chunking ← ② no inherent context
↓
Embedding ← ④ wrong model for corpus
↓
Indexing ← ③ empty metadata
↓
Query translation
↓
─── Retrieval ─── where most teams look
↓
Generation
↓
Validation ← ⑤ no eval harness
↓
Response
The retrieval-and-generation middle gets all the attention. The numbered failures cluster outside it, and they map to the five sections below.
This is the failure shape we keep seeing. Here is what it actually looks like, in roughly the order it goes wrong.
The data being indexed is already broken
Most teams import their knowledge base into the vector store using whatever loader the framework happens to ship with. The default Notion loader strips formatting. The default Confluence loader flattens tables into runs of spaces. The default Salesforce loader pulls one record at a time and creates one chunk per record, regardless of whether the record has three fields or thirty.
By the time the content reaches the embedding model, headings are gone, code blocks have lost their delimiters, tables have lost their column structure, and the document's hierarchy has been compressed into a single flat string. The embedding model is now embedding a degraded version of your source material. The user is searching against that degraded version.
The fix is unglamorous. For sources that matter, teams usually end up writing at least part of the ingestion layer themselves rather than trusting the default loader. Read the source format. Preserve the structure that humans use to find information in the original document. If a heading in the original means "this section is about refunds," that heading needs to live somewhere queryable, either inline in the chunk or in the metadata. When you audit a struggling RAG system, this is almost always where the worst damage has been done. It happens before anyone has run a single query, and the team that built it usually has not gone back to look.
The chunks have no idea what they are about
A second class of failure shows up around the chunking strategy. Most systems chunk by fixed token count, sometimes with a fifty or hundred token overlap. This is fine for novels. It is bad for documentation.
A chunk in the middle of a how-to article does not say "this is part of the refund workflow." It says "click the green button and confirm the dialog." The embedding for that chunk is closer to other instructional text than it is to anything about refunds. When the user asks "how do I refund a customer," the system fails to retrieve this chunk not because the retrieval algorithm is bad but because the chunk genuinely does not look like an answer to that question on its own.
There are a few real fixes. The simplest is to prepend the document's hierarchical context to every chunk before embedding. Refunds > Issuing a refund > Step 3: click the green button and confirm the dialog. Now the chunk looks like what the user is asking about, because it carries its own context. Header-aware chunking, semantic chunking using a small model to find natural boundaries, sliding-window chunking with sentence respect. Several approaches work well in practice, all of which outperform fixed-size token chunking on most documentation-heavy corpora.
We default to header-aware chunking with context-prepending unless there is a reason not to. It is cheap. It is interpretable. It is one of the highest-leverage retrieval fixes available on documentation-heavy systems, and it consistently outperforms the more glamorous alternatives.
The metadata is empty when it should not be
Vector stores have a metadata field per record. It is routinely treated as optional when it should not be.
You should use it. Every chunk should carry, at minimum: source document URL, last-updated date, author or owner, scope (which product, team, customer segment), and an authority signal (is this the canonical doc, a personal note, a deprecated page).
The metadata earns its place for two reasons. The first is filtering. A support agent asking about the current refund policy should not get a chunk from a deprecated 2022 page that says "we no longer offer refunds on subscriptions." The filter at retrieval time is is_canonical = true AND last_updated > '2025-01-01', and now the deprecated chunk simply cannot be returned, no matter how semantically close its embedding is to the query.
The second reason is grounding the generation. When you pass the retrieved chunks to the model for synthesis, you also pass the metadata. The model can now write "According to the Refunds Policy (last updated April 2026), the answer is X" instead of asserting X as a flat fact. This is one of the biggest reductions in user-perceived hallucination we consistently see when rebuilding struggling RAG systems. The model stops sounding confident about answers it does not actually have current sources for, because the metadata makes it clear which sources do and do not exist.
Metadata is not an enrichment. It is part of the substrate.
The embedding model is wrong for the corpus
The default embedding model in every framework is OpenAI's text-embedding-3-large. It is good. It is also generic. It was trained on a broad slice of English text, with a relatively even distribution across domains.
Your corpus is not evenly distributed across domains. Your corpus is, for example, ninety percent medical claims documents. The terminology in that corpus is dense and specialised, and the words that distinguish between two adjacent concepts are sometimes single tokens whose meaning the generic embedding model has averaged across hundreds of other usages it has seen in news, novels, and Wikipedia.
The recall hit from running a domain-tuned embedding model on heavily specialised content is real and measurable. Healthcare embeddings, legal embeddings, code embeddings, and even financial embeddings now exist and are usually available from the same providers as the generic ones. Or you can fine-tune your own on a few thousand positive pairs from your own corpus, which is not expensive if you actually do it.
On specialised corpora, changing the embedding model often moves the needle more than retrieval tuning does. We have watched it produce double-digit recall improvements on the queries that mattered most, without touching the retrieval algorithm or the chunking. It is rarely the first thing teams try.
If you are not measuring retrieval recall as a separate metric from generation accuracy, you would never have known to do this. Which brings us to the last thing.
You cannot diagnose what you cannot measure
The single most common operational gap in a RAG system in production is the absence of a real evaluation harness.
When the support team says "the answers are wrong," nobody on the engineering side knows whether the wrong answers are because the right chunks are not being retrieved, or because the right chunks are retrieved but the model synthesises them poorly. These are very different failure modes. They have very different fixes. Conflating them sends engineering teams down weeks of rabbit holes optimising the wrong stage.
The evaluation harness has to measure retrieval and generation as separate metrics, against a shared eval set of real user queries paired with the chunks that should have been retrieved.
Retrieval recall: for a given user query, did the top-k retrieved chunks include the chunks your subject matter expert flagged as correct sources?
Generation accuracy: given those correct chunks, does the model's answer agree with what the subject matter expert wrote as the ground-truth answer?
You need both. Recall in the high nineties and accuracy in the low seventies tells you the generation step is failing. Recall in the seventies and accuracy in the high nineties tells you to fix the retrieval step. Without splitting the two, you are optimising blind.
Building the eval set is the work most teams skip because it is unsexy. It involves sitting with the people who actually do the work, walking through fifty real queries with them, asking them to flag the documents that should have answered each one, and writing down the ground-truth answer in their own words. It takes a week. It is the single highest-leverage week of work in the entire RAG lifecycle.
Half the value of building it is forcing your subject-matter experts to disagree out loud about what "correct" means for a given query. The disagreements expose decisions that were never made, never written down, and never communicated to the people writing the prompts. The eval set ends up encoding those decisions, which is the second-order value: it becomes the canonical artefact of what the system is supposed to do.
When we scope a RAG engagement, building the eval set is week one, not week six. Without it, every subsequent decision is a guess. With it, the production failures stop being mysteries.
Where this leaves you
If you are debugging a RAG system that works in the demo and disappoints in production, the order to actually try things:
- Look at the chunks. Pull twenty random chunks from your index. Read them as a human. If you cannot tell what they are about without seeing the surrounding document, your chunking is the problem.
- Look at the metadata. Open one chunk in the index. If the metadata is
source: null, last_updated: null, authority: null, that is the problem regardless of what the retrieval algorithm is doing. - Look at the embedding model. If your corpus is domain-specific and you are using a generic embedding model, this is probably the single largest available win that nobody is talking about.
- Build the eval set. Until you can separate retrieval recall from generation accuracy, you are guessing.
- Only then, look at the retrieval algorithm itself.
The retrieval algorithm is rarely the problem. Most of the time the problem is earlier than that, and the fix takes engineering taste rather than a new library. Most framework discussions start further down the pipeline than the actual failures do.
It is also why most of our first week on a RAG engagement is spent reading data and writing the eval harness, not picking a vector store. By the time we are tuning retrieval, we already know which queries are failing and why. Most teams never get to that level of clarity, and then they wonder why every retrieval change feels like throwing darts.
