Chunking, reranking, and hybrid search: make RAG actually work
Most RAG implementations work poorly because they get three things wrong. A practical guide to chunking documents, reranking results, and combining keyword with semantic search — without becoming a search engineer.
Most RAG implementations work poorly. The model is fine — Claude or GPT-5 are happy to answer based on retrieved documents. The retrieval is what fails. You ask a question, the system returns the wrong chunks, and the model produces a confident answer based on the wrong information.
This article is the practical guide to the three things that fix bad retrieval: chunking strategy, reranking, and hybrid search. Get these right and most "RAG doesn't work for our use case" complaints evaporate.
We'll skip the deep technical math and focus on what to actually do.
Why retrieval fails
The default RAG pipeline:
- Split documents into chunks.
- Embed each chunk.
- When a question comes in, embed it and find the most similar chunks (cosine similarity).
- Pass those chunks to the LLM.
Each step has failure modes:
Bad chunking produces chunks that are too short to be useful or too long to be coherent. Or chunks that break in the middle of a logical idea, so neither half retrieves well.
Naive similarity finds chunks that share vocabulary with the query but are not actually relevant. The query "how do I cancel my subscription" might retrieve any chunk that mentions "subscription," including unrelated marketing copy.
No reranking means whatever similarity returns is what the LLM sees. The most-similar chunk is not always the most-relevant chunk.
Pure semantic search misses exact-keyword matches that matter. A query for "error code 503" might miss a chunk that has the exact text "error code 503" because the semantic vector for the query doesn't match the chunk's overall topic.
The three fixes — better chunking, reranking, hybrid search — address each of these.
Fix 1: Better chunking
Chunking is the most impactful single decision in your RAG pipeline and the one most people don't think about until results are bad.
The naive default in most no-code tools is to chunk by fixed token count (e.g., 500 tokens with 50 token overlap). This works okay but breaks frequently.
Better strategies:
Semantic chunking
Split documents at semantic boundaries — places where the topic changes — rather than at arbitrary token counts. Most modern frameworks (LangChain, LlamaIndex) have semantic chunkers that detect topic shifts and split there.
The win: chunks contain complete ideas. The model gets coherent context instead of half-arguments.
Structure-aware chunking
If your documents have natural structure (Markdown headers, HTML sections, PDF chapters, code function boundaries), use it. Chunk by section, not by token.
For Markdown:
- Each H2 section is a chunk (with H1 context prepended).
- Long H2 sections are sub-chunked, with the H2 heading repeated as context.
For code:
- Each function or class is a chunk.
- The chunk includes the file path and any imports.
This is dramatically better than blind fixed-size chunking for any structured content.
Hierarchical chunking
A pattern from recent RAG research. Create chunks at multiple levels:
- Small chunks (200-500 tokens) for precise retrieval.
- Medium chunks (1000-2000 tokens) for context.
- Document summaries (50-100 tokens) for high-level matching.
When retrieving, match against small chunks but return the surrounding medium chunk. The LLM gets precise relevance plus enough context to make sense of it.
Choosing chunk size
A rough heuristic by content type:
| Content type | Chunk size | Rationale | | --- | --- | --- | | Technical docs / manuals | 500-1000 tokens | Concepts are self-contained, moderate density | | Code | One function/class per chunk | Logical units, not arbitrary slices | | Long-form articles / books | 1000-2000 tokens | Ideas develop over several paragraphs | | Customer support tickets | One ticket = one chunk | Don't split within a ticket | | Legal / contracts | Section-based, often 500-1500 tokens | Logical units; preserve clause boundaries | | Spreadsheet data | Row + headers | Each row as a chunk, with column headers |
If you're using a no-code tool that hides chunking, run a test: ask 10 questions whose answers you know are in the corpus. If retrieval is missing the right chunk regularly, chunking is your problem.
A pattern that works well: "page-or-section, then question-grounded"
For most personal RAG use cases, the pattern that works:
- Chunk by document section (Markdown H2, PDF chapter, etc.).
- For chunks over ~2000 tokens, sub-chunk by paragraph.
- Prepend each chunk with the document title and section heading as context.
- Append a brief description of what kinds of questions this chunk answers (auto-generated by a fast LLM on indexing).
The auto-generated "what questions this chunk answers" trick is surprisingly underused and surprisingly powerful. It works because user queries are often phrased as questions, and matching them against question-form metadata is more precise than matching against the raw document text.
Fix 2: Reranking
After the initial retrieval (typically returning 20-50 chunks based on vector similarity), use a reranker to reorder them by actual relevance to the query.
The reranker is a separate model — usually a smaller, specialised one — that takes (query, chunk) pairs and outputs a relevance score. You apply it to your top 20-50 results, sort by the new score, and pass the top 3-5 to the LLM.
The win: dramatically better precision. Vector similarity is fast but not super accurate; rerankers are slower but much more accurate. Combining them gives you fast retrieval with accurate ordering.
Reranker options in 2026:
- Cohere Rerank — the established API-based reranker. ~$1/1000 queries.
- Voyage AI rerank-2 — strong commercial option, often better than Cohere on niche domains.
- bge-reranker-v2-m3 — open-source, runs locally or on cheap hosting.
- Jina Reranker — another strong open-source option.
In a typical pipeline:
- Vector search returns top 50 chunks (cheap, fast).
- Reranker scores all 50 against the query (more expensive, slower).
- Top 5 by rerank score get passed to the LLM.
Total latency added: ~200-500ms. Quality improvement: often 20-40% on retrieval accuracy benchmarks.
For no-code tools that don't include reranking by default (NotebookLM, most basic n8n setups), this is the single highest-impact upgrade you can make. n8n has a Cohere Rerank node; LangChain and LlamaIndex include reranker integrations natively.
Fix 3: Hybrid search
Vector similarity catches semantic matches. Keyword search (BM25 or similar) catches exact matches. Each misses things the other catches.
A query for "how to fix HTTP 503 errors on our gateway":
- Vector search finds chunks about HTTP errors, gateway issues, troubleshooting.
- Keyword search finds chunks that specifically mention "503" — which might be the actual answer.
Hybrid search runs both and combines the results. The combination uses a method called Reciprocal Rank Fusion (RRF) — given the rankings from each method, RRF produces a combined ranking that takes both signals into account.
The implementation in most tools is simple:
- Run vector search → get ranked list A.
- Run BM25 / keyword search → get ranked list B.
- For each chunk, compute its RRF score = 1/(k + rank_in_A) + 1/(k + rank_in_B) (with k typically 60).
- Sort by combined score, return top results.
In 2026, hybrid search is supported natively by:
- Weaviate (vector store) — hybrid search out of the box.
- Qdrant — hybrid search via filtering.
- Pinecone — hybrid search via sparse vectors.
- Elastic / OpenSearch — combined keyword + vector.
- Most n8n RAG templates — hybrid is the default in modern templates.
The win: dramatically better recall on queries that contain specific identifiers, codes, names, or jargon. For technical content (code, error codes, product names, regulatory references), hybrid search is essentially required.
For your use case: if your queries often contain specific terms that should match exactly (numbers, names, codes, exact phrases), turn on hybrid search. The cost is low and the gain is large.
Putting them together
The state-of-the-art RAG pipeline in 2026:
Query
↓
Query rewriter (optional — clean up the query, expand abbreviations)
↓
Hybrid retrieval: vector + keyword search
↓
Top 30-50 results
↓
Reranker
↓
Top 5 by reranker score
↓
LLM with retrieved chunks + query
↓
Cited answerEach step is cheap individually. Combined, they produce retrieval quality that is qualitatively different from "vector similarity → top 5 → LLM."
A few less common but powerful additions:
Query expansion. Rewrite the user query into multiple variations and search each. Catches different phrasings.
Multi-step retrieval. For complex questions, do multiple retrievals. First retrieval identifies sub-questions; second retrieval fetches answers to each sub-question.
Conversational retrieval. In a multi-turn conversation, use the conversation history to inform retrieval ("they asked about X earlier, so for this question, prioritise content related to X").
Source filtering. Use metadata filters to scope retrieval. "Only search documents tagged 'EU regulations' and dated after 2023."
These are increasingly available in no-code RAG tools but always check whether you're getting the basic three (good chunking, reranker, hybrid search) before reaching for the advanced ones.
How to measure RAG quality
You cannot improve what you don't measure. A few practical evaluation strategies:
The "golden questions" test. Pick 20 questions whose correct answers you know. Run them through your RAG. Score: did the right chunks get retrieved? Did the model produce the right answer? Do this monthly.
Retrieval recall at K. For each golden question, identify which chunks contain the answer. Then check: did the retrieval system return any of those chunks in its top K (5, 10, 20)? Measure the fraction.
LLM-as-judge evaluation. A more advanced version: have a strong model (Claude Opus, GPT-5) score whether the answer is correct, citation-accurate, complete, and well-grounded. Run on a batch of representative questions. We have a whole article on evals.
User-perceived quality. For team or production RAG, add a thumbs-up / thumbs-down to every answer. Look at the patterns of thumbs-down. They cluster around specific question types — fix those.
The biggest mistake is to skip measurement entirely. "It feels okay" is not measurement. You will not know if your improvements are working without it.
A worked example: improving a flagging RAG
Suppose you've built a personal RAG for your company's internal documentation. Quality is mediocre — about 60% of questions get a useful answer. The improvements you'd apply, in order:
Audit retrieval first. For ten of the bad-quality answers, look at which chunks were retrieved. Did the right chunk get retrieved? If yes, the problem is with the model or prompt. If no, the problem is retrieval.
If retrieval is the problem:
- Check chunking. Are chunks coherent? Is your chunker splitting in the middle of important ideas? Switch to semantic or structure-aware chunking.
- Add a reranker. If you're using basic vector search and top-5 results, add Cohere Rerank or bge-reranker-v2-m3 between retrieval and the LLM. Often a 20-30% quality lift.
- Add hybrid search. Especially if your queries contain specific terms (product names, error codes, jargon).
- Examine your indexing. Are chunks tagged with metadata (document type, section, date)? Use metadata filters in retrieval.
If the model is the problem (right chunks retrieved, wrong answer):
- Tighten the prompt. Tell the model explicitly: "answer based only on the provided context. If the context doesn't cover the question, say so."
- Add citations. Require the model to cite the specific chunk used. This both helps you debug and reduces hallucination.
- Use a stronger model. If you're using a small fast model, try Claude Sonnet 4.5 or GPT-5.
After two or three rounds of this kind of investigation-and-fix, most RAG implementations reach 85-90% useful-answer rate, which is the threshold where users actually adopt and trust the system.
Common mistakes
A few specific things that consistently catch people:
Indexing the wrong content. Marketing pages, outdated docs, low-quality blog content. The model can't distinguish; everything in the index is treated as canon. Curate ruthlessly.
Forgetting to update the index when docs change. A RAG with stale content gives confidently wrong answers. Either re-index regularly or use a system that auto-syncs.
Ignoring evaluation. Most RAGs are deployed without measurement and then never improved. The "feels okay" phase lasts forever. Build evaluation in from day one.
Treating retrieval like a black box. "It just doesn't work" is not a diagnosis. Open up the retrieval and look at what's coming back. Almost always, the problem becomes obvious once you see it.
Over-engineering early. You don't need to start with the state-of-the-art pipeline. Start with NotebookLM or basic vector search. Only add complexity when you've identified a specific bottleneck.
When this matters
The three fixes — better chunking, reranking, hybrid search — matter most when:
- The corpus is technical, with specific terminology.
- Queries contain exact-match requirements (codes, names, IDs).
- Users care about precision (legal, compliance, customer support).
- The system is used at scale (a small quality gain matters a lot when it's hit 10,000 times a day).
They matter less when:
- The corpus is small (under 100 documents) and well-organised.
- Queries are open-ended ("what's our take on X").
- Users are forgiving and will iterate to find the answer.
- The use case is exploratory rather than precise.
For most personal and small-team RAG, simply moving from basic vector similarity to vector + reranker is the highest-leverage upgrade. Hybrid search is the next addition. Chunking matters at every scale.
The takeaway
Bad RAG is almost always bad retrieval, and bad retrieval almost always comes down to three things: chunking, reranking, and search method. Fix the three and most of your RAG quality complaints disappear.
You don't need to be a search engineer to apply these. The tools — Weaviate, Pinecone, Qdrant, n8n templates, LangChain integrations — have made the techniques accessible. The bottleneck now is mostly knowing that they exist and being deliberate about applying them.
If your RAG isn't working, don't blame the model. Look at the chunks coming back, fix the retrieval, and then re-evaluate. Most "RAG isn't working" complaints become "RAG works well now" within a week of doing this work seriously.