How do I make RAG work in production over real documents?

Not with embed-and-retrieve. Production RAG over a real corpus is a layered pipeline: hybrid search (dense vectors plus keyword BM25), a reranking pass, hierarchical summaries so retrieval works at the right altitude, and graph context to connect related material. And the failure that bites hardest is silent: an embedding-dimension mismatch between your index and your queries returns confident, plausible, wrong results with no error. Pin the embedding model as a contract and test retrieval against a fixed question set.

Updated 19 June 2026

Go deeper: read the full write-up on the blog.

Why naive RAG fails on real documents

Pure vector search optimizes for 'sounds similar,' which is exactly wrong for technical or regulatory text full of exact terms and references. You need keyword search alongside vectors, then a reranker to order candidates by true relevance.

The pipeline that holds

Hybrid retrieval for meaning and exactness, a cross-encoder rerank because order is most of the answer, hierarchical summaries so a broad question is answered from a summary not a brittle stitch of fragments, and graph context so a rule connects to the rules it references.

The silent failure to design out

If the embeddings in your store were made with a different model or config than your queries, similarity is computed against the wrong space and you get confident garbage with no exception. Pin the model and dimensions, and test retrieval quality against known questions.

Straight answers.

Is RAG just embeddings and a vector DB?: That's the demo. Production needs hybrid search, reranking, hierarchy, and a relevance test that would catch a silent embedding mismatch before users do.
Why hybrid search instead of pure vectors?: Vectors blur exact terms, codes, and references that matter. Keyword search catches those; vectors catch meaning when wording differs. Real corpora need both.
What's the embedding-dimension mismatch?: When stored and query embeddings come from different model configs, they don't line up. Nothing errors; retrieval just returns wrong, confident results. It's a classic silent failure.

Keep reading.

How do AI agents remember context across a long project?

read

Can we use AI agents in a regulated or compliance-heavy business?