How is this different from hiring full-time engineers?

Full-time engineers are usually the right answer eventually. We're the bridge. You hire us when you need someone shipping by Monday and a job posting won't close for months. When you've hired the right team, we hand off everything. The code is already yours, the infra is already in your cloud, the runbook is already written. Working ourselves out of the job is the goal, and we don't mind when it happens.

What if our codebase is a mess?

Most are. Our engineers' decade-each in production has touched most stack vintages still in use: Rails monoliths, 4-year-old Next.js apps with three router migrations, greenfield TypeScript. We adapt to your conventions, your CI, your branching model. We won't try to rewrite your stack to use our preferred one. That's a tell of an agency that's actually selling templates.

Do you really have no contracts longer than a month?

Real answer: our MSA is signed once, and engagements run on monthly purchase orders. You can pause or end at any time on 30 days' notice. That's not a fine-print clause, it's the operating model. We'd rather stay because we're earning it than because you signed a year of it.

What does 'senior engineer' actually mean here?

Engineers with a decade each in production across AI, infrastructure, and platform work. Founder-led, with no junior bench to hide behind. You get the people who'd actually be writing the architecture doc anywhere else. We don't post bios on the site because LinkedIn isn't the right hiring surface; we'd rather you meet whoever is staffed on the first call and decide from there.

How fast can you actually start?

Sprint engagements: typically 7 days from signature. Pod engagements: 14 days. We won't lie about availability. If taking your engagement would mean starting late or staffing it thin, we'll say 'we can't take you for six weeks' instead, and you can hold us to the date we do give.

We've been burned by agencies before. Why is this different?

Mostly because we don't try to be everything. Six services. Senior engineers only. Code in your repo. Monthly contracts. Weekly demos. Founder-led. No PMs in the loop, no offshore handoff, no proprietary platform. If you've been burned, you know what bit you. We've tried to make ourselves the opposite of that.

What does week one actually look like?

Day 1: kickoff call, Slack channel created, repo access exchanged, problem statement written and pinned. Day 2: scoping doc with the smallest shippable thing identified. Day 3–4: working prototype in a sandbox. Day 5: Loom walkthrough, demo on your calendar for next Friday. Week one is choreographed. The improvising starts in week two.

Do you sign NDAs? BAAs? SOC 2 vendor questionnaires?

Yes to all three. Vendor security questionnaires are turned around quickly because there's no committee to route around. BAAs are ready for legal review on day one for healthcare engagements. The goal is to be the easiest vendor your procurement team deals with this quarter.

May 11, 20269 min readrag · retrieval · evaluation · production

Why most RAG systems fail before retrieval

The retrieval algorithm is rarely the problem. Most RAG failures happen earlier, at stages the team isn't looking at. Here's the failure shape we keep seeing and the order we'd actually debug it in.

Written by the milebits founders.

The retrieval algorithm is rarely the problem.

If you have built a RAG system and watched it ship to production, you have probably had this conversation. The system answers well in the demo. It answers worse with real users. Somebody asks "is it the model" and the team says no, the model is fine. So you start tuning retrieval. You swap dense for hybrid. You add a reranker. You bump top-k. You try Cohere, then Voyage, then Jina. Each change moves the needle a percent or two. None of them fix the failures your users keep flagging.

That is because the user-reported failures almost never start at retrieval. They start before retrieval. The chunks being indexed are wrong. The metadata is missing. The query the user typed never gets translated into something the index can answer. And nobody has built the evaluation surface to tell the difference between a retrieval failure and a generation failure, so the team is optimising blind.

A RAG pipeline drawn vertically: source ingestion, chunking, metadata, embedding, query translation, retrieval, generation, validation, response. The retrieval step is highlighted as where teams usually start tuning. Five numbered failures sit outside it: broken source data, chunks that forget their purpose, empty metadata, generic embeddings, and no eval harness. — The failures cluster before and after retrieval, not at it. Debug order: chunks, then metadata, then recall versus answer quality, then the retrieval algorithm last.

The retrieval-and-generation middle gets all the attention. The numbered failures cluster outside it, and they map to the five sections below.

This is the failure shape we keep seeing. Here is what it actually looks like, in roughly the order it goes wrong.

The data being indexed is already broken

Most teams import their knowledge base into the vector store using whatever loader the framework happens to ship with. The default Notion loader strips formatting. The default Confluence loader flattens tables into runs of spaces. The default Salesforce loader pulls one record at a time and creates one chunk per record, regardless of whether the record has three fields or thirty.

By the time the content reaches the embedding model, headings are gone, code blocks have lost their delimiters, tables have lost their column structure, and the document's hierarchy has been compressed into a single flat string. The embedding model is now embedding a degraded version of your source material. The user is searching against that degraded version.

The fix is unglamorous. For sources that matter, teams usually end up writing at least part of the ingestion layer themselves rather than trusting the default loader. Read the source format. Preserve the structure that humans use to find information in the original document. If a heading in the original means "this section is about refunds," that heading needs to live somewhere queryable, either inline in the chunk or in the metadata. When you audit a struggling RAG system, this is almost always where the worst damage has been done. It happens before anyone has run a single query, and the team that built it usually has not gone back to look.

The chunks have no idea what they are about

A second class of failure shows up around the chunking strategy. Most systems chunk by fixed token count, sometimes with a fifty or hundred token overlap. This is fine for novels. It is bad for documentation.

A chunk in the middle of a how-to article does not say "this is part of the refund workflow." It says "click the green button and confirm the dialog." The embedding for that chunk is closer to other instructional text than it is to anything about refunds. When the user asks "how do I refund a customer," the system fails to retrieve this chunk not because the retrieval algorithm is bad but because the chunk genuinely does not look like an answer to that question on its own.

There are a few real fixes. The simplest is to prepend the document's hierarchical context to every chunk before embedding. Refunds > Issuing a refund > Step 3: click the green button and confirm the dialog. Now the chunk looks like what the user is asking about, because it carries its own context. Header-aware chunking, semantic chunking using a small model to find natural boundaries, sliding-window chunking with sentence respect. Several approaches work well in practice, all of which outperform fixed-size token chunking on most documentation-heavy corpora.

We default to header-aware chunking with context-prepending unless there is a reason not to. It is cheap. It is interpretable. It is one of the highest-leverage retrieval fixes available on documentation-heavy systems, and it consistently outperforms the more glamorous alternatives.

The metadata is empty when it should not be

Vector stores have a metadata field per record. It is routinely treated as optional when it should not be.

You should use it. Every chunk should carry, at minimum: source document URL, last-updated date, author or owner, scope (which product, team, customer segment), and an authority signal (is this the canonical doc, a personal note, a deprecated page).

The metadata earns its place for two reasons. The first is filtering. A support agent asking about the current refund policy should not get a chunk from a deprecated 2022 page that says "we no longer offer refunds on subscriptions." The filter at retrieval time is is_canonical = true AND last_updated > '2025-01-01', and now the deprecated chunk simply cannot be returned, no matter how semantically close its embedding is to the query.

The second reason is grounding the generation. When you pass the retrieved chunks to the model for synthesis, you also pass the metadata. The model can now write "According to the Refunds Policy (last updated April 2026), the answer is X" instead of asserting X as a flat fact. This is one of the biggest reductions in user-perceived hallucination we consistently see when rebuilding struggling RAG systems. The model stops sounding confident about answers it does not actually have current sources for, because the metadata makes it clear which sources do and do not exist.

Metadata is not an enrichment. It is part of the substrate.

The embedding model is wrong for the corpus

Most teams reach for a generic embedding model, usually whichever one the framework's quickstart used. The popular defaults are good. They are also generic, trained on a broad slice of English text with a relatively even distribution across domains.

Your corpus is not evenly distributed across domains. Your corpus is, for example, ninety percent medical claims documents. The terminology in that corpus is dense and specialised, and the words that distinguish between two adjacent concepts are sometimes single tokens whose meaning the generic embedding model has averaged across hundreds of other usages it has seen in news, novels, and Wikipedia.

The recall hit from running a domain-tuned embedding model on heavily specialised content is real and measurable. Healthcare embeddings, legal embeddings, code embeddings, and even financial embeddings now exist and are usually available from the same providers as the generic ones. Or you can fine-tune your own on a few thousand positive pairs from your own corpus, which is not expensive if you actually do it.

On specialised corpora, changing the embedding model often moves the needle more than retrieval tuning does. We have watched it produce double-digit recall improvements on the queries that mattered most, without touching the retrieval algorithm or the chunking. It is rarely the first thing teams try.

If you are not measuring retrieval recall as a separate metric from generation accuracy, you would never have known to do this. Which brings us to the last thing.

You cannot diagnose what you cannot measure

The single most common operational gap in a RAG system in production is the absence of a real evaluation harness.

When the support team says "the answers are wrong," nobody on the engineering side knows whether the wrong answers are because the right chunks are not being retrieved, or because the right chunks are retrieved but the model synthesises them poorly. These are very different failure modes. They have very different fixes. Conflating them sends engineering teams down weeks of rabbit holes optimising the wrong stage.

The evaluation harness has to measure retrieval and generation as separate metrics, against a shared eval set of real user queries paired with the chunks that should have been retrieved.

Retrieval recall: for a given user query, did the top-k retrieved chunks include the chunks your subject matter expert flagged as correct sources?

Generation accuracy: given those correct chunks, does the model's answer agree with what the subject matter expert wrote as the ground-truth answer?

You need both. Recall in the high nineties and accuracy in the low seventies tells you the generation step is failing. Recall in the seventies and accuracy in the high nineties tells you to fix the retrieval step. Without splitting the two, you are optimising blind.

Building the eval set is the work most teams skip because it is unsexy. It involves sitting with the people who actually do the work, walking through fifty real queries with them, asking them to flag the documents that should have answered each one, and writing down the ground-truth answer in their own words. It takes a week. It is the single highest-leverage week of work in the entire RAG lifecycle.

Half the value of building it is forcing your subject-matter experts to disagree out loud about what "correct" means for a given query. The disagreements expose decisions that were never made, never written down, and never communicated to the people writing the prompts. The eval set ends up encoding those decisions, which is the second-order value: it becomes the canonical artefact of what the system is supposed to do.

When we scope a RAG engagement, building the eval set is week one, not week six. Without it, every subsequent decision is a guess. With it, the production failures stop being mysteries.

Where this leaves you

If you are debugging a RAG system that works in the demo and disappoints in production, the order to actually try things:

Look at the chunks. Pull twenty random chunks from your index. Read them as a human. If you cannot tell what they are about without seeing the surrounding document, your chunking is the problem.
Look at the metadata. Open one chunk in the index. If the metadata is source: null, last_updated: null, authority: null, that is the problem regardless of what the retrieval algorithm is doing.
Look at the embedding model. If your corpus is domain-specific and you are using a generic embedding model, this is probably the single largest available win that nobody is talking about.
Build the eval set. Until you can separate retrieval recall from generation accuracy, you are guessing.
Only then, look at the retrieval algorithm itself.

The retrieval algorithm is rarely the problem. Most of the time the problem is earlier than that, and the fix takes engineering taste rather than a new library. Most framework discussions start further down the pipeline than the actual failures do.

It is also why most of our first week on a RAG engagement is spent reading data and writing the eval harness, not picking a vector store. By the time we are tuning retrieval, we already know which queries are failing and why. Most teams never get to that level of clarity, and then they wonder why every retrieval change feels like throwing darts.

Watching a RAG system underperform in production? We can find where it breaks.

Book a 20-min call

More field notes