DataMay 14, 20269 min read

How to build hybrid search for RAG

Hybrid search for RAG combines BM25, dense embeddings, reciprocal rank fusion, reranking, and retrieval evaluation so your system retrieves better evidence.

Dave EbbelaarSenior AI Engineer

Most RAG systems fail before the answer is generated.

The model gets the wrong context. Then everyone blames the prompt, the model, or the UI, while the real problem sits one layer earlier, in retrieval.

Hybrid search for RAG is a practical way to fix that. You combine a keyword-based retriever such as BM25 with dense embeddings, fuse the result lists, rerank the candidates, and then measure whether the system actually retrieves the documents it should retrieve.

Evaluation carries the most weight here. Without it, you are just looking at a few examples and deciding if they feel right.

This is the retrieval stack I would reach for when a basic vector search starts to feel shaky and the system needs to work on real company data, client data, or a serious product workflow.

Why vector search alone is not enough

A lot of RAG tutorials start with embeddings and stop there. Take documents, chunk them, create vectors, store them in a vector database, retrieve the nearest chunks, send them to an LLM.

Fine for a first demo. Not where I would stop for production.

Dense embeddings are good when the user asks the same thing in different words. The document might talk about "emergency funds" and the user might ask about "rainy day money". A dense retriever can still connect those ideas because it searches by meaning, not only by exact overlap.

But dense search can miss exact terms that matter. Product codes. Acronyms. Legal phrases. Financial terms. Internal project names. Support ticket IDs. These are often the terms users care about most, and a pure semantic search can treat them as less important than they really are.

BM25 has the opposite profile. It is old, simple, and still useful. It looks for keyword overlap with a weighting mechanism, so it is strong when the exact words matter. It is weaker when the user paraphrases the source text.

That is why the combination works. Dense search wins where BM25 loses. BM25 catches things dense search can miss.

Start with data, not tools

Before you build the retrievers, define the retrieval task.

In the hybrid search walkthrough, the dataset is FinanceQA from the BEIR benchmarks. The nice thing about that setup is that it gives you the three pieces you need for evaluation:

A corpus of documents.
A list of user queries.
A relevance mapping between query IDs and document IDs.

The corpus is the material you search over. In a company project, that could be policies, customer support docs, product specs, sales playbooks, contracts, filings, or internal notes.

The queries are the questions users ask.

The relevance mapping is the part most teams do not have. It tells you which documents should be returned for each query. Without that, you cannot say much about whether a retrieval change improved the system.

For the tutorial dataset, I start with about 1,700 queries and filter down to 648 queries that actually have linked relevant documents. That is the right instinct. Do not evaluate on questions where you do not know the answer source.

If you are building this for your own project, you probably will not have a BEIR-style dataset sitting there. Create one. Start with your documents, generate or collect realistic questions, and map each question to one or more documents that answer it.

It does not need to be huge at first. It needs to be repeatable.

Build the BM25 retriever

BM25 is the sparse retrieval side of the system.

In practice, the setup is straightforward. Load the corpus, keep the document IDs and text, tokenize the documents, build the index, and save it. In the tutorial, the BM25 index for roughly 57,000 short documents is about 33 MB on disk.

People often overcomplicate this part. For many RAG applications, especially internal tools and client projects, the corpus is not massive. You may not need a separate search service just to get started. A local index loaded in memory can be enough for a serious first version.

BM25 gives you a ranked list of document IDs for a query. If the query is "where should I park my rainy day funds", it tokenizes the query, removes stop words, searches the index, and returns the top documents with BM25 scores.

The library matters less than understanding what the method is good at.

Use BM25 when exact words matter. Use it to catch identifiers, rare terms, abbreviations, or terms that should not be smoothed away by semantic similarity.

Build the dense retriever

The dense retriever is the embedding side.

You take the same corpus, create embeddings for each document, and store the resulting vectors. In the tutorial, I use text-embedding-3-small, which produces 1,536-dimensional vectors. The stored embedding file for the corpus is about 350 MB on disk.

Again, that does not automatically require a vector database. For a lot of corpora, you can store the embeddings in a NumPy file, load them in memory, and run similarity search yourself. Once the corpus, latency, write pattern, or deployment setup demands more, move to Postgres with pgvector or another database layer.

For the query path, you embed the user query with the same model, normalize the vector, and compute similarity against the document embeddings. With normalized vectors, the dot product gives you the ranking you need for cosine-style similarity.

This is the part most people already know from basic RAG. The difference is that in a hybrid setup, dense search is not the whole retrieval layer. It is one candidate generator.

Fuse the rankings with RRF

Now you have two ranked lists:

BM25 returns documents ranked by keyword relevance.
Dense retrieval returns documents ranked by semantic similarity.

Do not compare the raw scores directly. BM25 scores and cosine similarity scores are not on the same scale. If you try to sort both lists by raw score, you are pretending two different measurement systems mean the same thing.

Reciprocal Rank Fusion solves that by using rank instead of raw score.

The idea is simple. If a document ranks highly in one list, it gets points. If it ranks highly in both lists, it gets more points. The common formula uses a smoothing value of 60, which is also what I use in the walkthrough.

The practical pattern is:

Retrieve a larger candidate set from BM25.
Retrieve a larger candidate set from dense search.
Score candidates by rank with RRF.
Sort the fused list.

This gives you a single candidate list that benefits from both retrieval methods without forcing their scores into the same scale.

Use a larger top_k for the candidate stage than you plan to send into the final answer step. For example, retrieve 50 candidates from each method, fuse them, then keep the best 10 or 20 for reranking.

GenAI Accelerator

The gap between a demo and production

Anyone can wire up an LLM call. The real skill is designing, evaluating, and shipping systems that hold up.

See Curriculum

Add a reranker

RRF gives you a better combined list, but it is still based on two retrieval systems that looked at the query and documents in limited ways.

A reranker takes the next step. It looks at the query and the candidate documents together, then orders the candidates again. In the tutorial, I use a Cohere rerank model, but the same stage could use another hosted reranker or an open-source cross-encoder.

This extra step costs more than the first retrieval stage, so you do not run it across the whole corpus. You run it on the fused candidates.

The full pipeline looks like this:

Use BM25 and dense search to cast a wider net.
Use RRF to combine the lists.
Use the reranker to order the best candidates.
Send only the final context to the answer model.

The gains can look subtle when you inspect one query. A document moves from rank five to rank one. Another drops out of the top five. It is easy to shrug at that.

But across many questions, those small rank changes matter. RAG quality is often the difference between the right document landing in the context window and the right document sitting just outside it.

Evaluate retrieval quality

This is where the work becomes real engineering.

You need to run the same queries through each retrieval variant and compare the results against ground truth. In the walkthrough, I use NDCG@10, normalized discounted cumulative gain at 10. The score runs between 0 and 1, and the tutorial multiplies it by 100 so it reads like a percentage-style score.

The exact metric is less important than the discipline. Use one metric, run it consistently, and compare changes on the same sample.

In the tutorial, I take a random sample of 50 queries with seed 42 so every run uses the same slice. On that setup, BM25 scores around 28, dense retrieval performs better, RRF lands between the two, and hybrid plus reranking jumps to about 47.

That result is also a good warning. Hybrid search does not automatically beat every individual method before reranking. On this dataset, dense retrieval alone is stronger than BM25, and RRF by itself sits between them. The reranker is what creates the biggest lift.

So do not blindly copy the stack and assume it wins. Test it on your corpus.

Try dense plus reranking. Try BM25 plus dense plus reranking. Try different chunk sizes. Try a different embedding model. Try a larger candidate count before reranking. Keep the query sample fixed while you compare. Then you are improving retrieval instead of guessing.

How this maps to Postgres

You do not have to run the whole system in local files forever.

For a production application, Postgres is often a practical place to put the retrieval layer. My Postgres RAG tutorials show the same basic pattern with pgvector, Postgres full-text search, and reranking.

The architecture looks familiar:

Dense search can run through pgvector or pgvectorscale similarity search.
Keyword search can run through Postgres full-text search with tsvector and ts_rank_cd.
Hybrid retrieval combines semantic and keyword candidates before the answer step.
Reranking orders the combined candidates.
Answer generation receives only the selected context.

Postgres full-text search is not exactly BM25 out of the box. There are extensions and other search systems if you need BM25 specifically. But for many applications, Postgres keyword search plus vector search plus reranking gets you a long way, and you keep the documents, metadata, embeddings, chat history, and application records in one database.

The appeal for many AI applications is exactly that. You stay close to tools you already understand.

Build your own evaluation set

The biggest gap in real projects is the evaluation data.

Benchmarks give you corpus, queries, and relevance labels. Your internal docs do not. You have to create that structure yourself.

A practical first pass is to start from your documents and generate user-style questions that each document can answer. The prompt can be simple. Act as a user, read this document, write one realistic question that is answerable from it. Store the question, the document ID, and the relationship between them.

Then improve it manually. Remove weak questions. Add real user questions. Add multiple relevant documents where needed. Keep the set small enough that you can inspect it, but large enough that one lucky query does not fool you.

Once you have that dataset, every retrieval experiment becomes cleaner. You can change chunking, embeddings, BM25 settings, candidate counts, fusion, or reranking and see if the score moves.

That is how you get beyond vibes.

A practical build order

If I were building hybrid search for a client project, I would not start with every component at once.

Start with the corpus and a small evaluation set. Build dense retrieval first, because it is the baseline most teams already expect. Add BM25 or keyword search next so exact terms do not get missed. Fuse the rankings with RRF. Add reranking once you have enough candidate volume to make it useful.

Then evaluate.

Only after that would I wire the retrieval layer into the answer generation path. If retrieval is bad, the answer model cannot save you. It can only write a more convincing answer from weak evidence.

This is also why hybrid search belongs in the broader skill set for production AI systems. The model call is only one layer. The data and retrieval layer decides what the model is allowed to know.

If you want to see the full implementation behind this post, watch the hybrid search walkthrough. If you want the database version, the Postgres hybrid search tutorial shows the same direction with Postgres, pgvector, full-text search, and reranking.