Building a production RAG: ingestion, embedding, retrieval, reranking, eval
A production RAG pipeline is six stages, each with specific patterns that determine quality. The architecture, the choices at each stage, and the iterative evaluation discipline that distinguishes RAG that works from RAG that disappoints.
If you've shipped a simple RAG system, you know the pattern: chunk documents, embed them, store in a vector DB, retrieve top-k on a query, stuff into the prompt. It works for demos. In production, it often disappoints.
The gap between demo RAG and production RAG is large. Real-world documents are messy. Real-world queries are ambiguous. Quality varies wildly across query types. Cost and latency are real constraints. Updates and versioning matter.
This article covers what a production-grade RAG pipeline looks like — the six stages, the patterns at each, and the eval discipline that distinguishes systems that work from systems that disappoint.
The pipeline
A production RAG system has six logical stages:
[Documents]
↓
1. Ingestion (parsing, cleaning, metadata extraction)
↓
2. Chunking (splitting into retrieval units)
↓
3. Embedding (vectors + storage)
↓
[Query]
↓
4. Retrieval (semantic + lexical + filters)
↓
5. Reranking (top-k → top-N)
↓
6. Generation (LLM + context)
↓
[Response]Each stage has its own quality concerns. Improving any one improves the whole.
We'll go through each.
Stage 1: Ingestion
Garbage in, garbage out. The quality of your ingestion determines the ceiling on everything downstream.
Source types you'll encounter:
- PDFs (mostly the worst).
- Word docs.
- HTML pages.
- Markdown.
- Spreadsheets.
- Slides (PowerPoint, Google Slides).
- Emails.
- Code.
- Structured data (CSVs, JSON).
Each has its own parsing challenges.
PDF parsing. PDFs are a presentation format, not a data format. They're notoriously hard to parse. Strategies:
- Text-based PDFs:
pdfplumber,pymupdf,unstructured. Work for clean text. - Scanned PDFs: OCR with Tesseract, Google Cloud Vision, or AWS Textract.
- Layout-aware:
LayoutLM, Mathpix, or LLM-based parsing (GPT-4 vision) for complex layouts (tables, multi-column).
Modern approach for hard PDFs: use a vision-capable LLM to OCR and structure the content. Cost is higher but quality is dramatically better than traditional OCR.
Handling tables. Tables in unstructured text are a pain. Either:
- Convert to markdown tables (preserves structure).
- Linearize into prose ("Row 1: customer A had 50 orders...").
- Treat as separate retrievable units with a structured schema.
The right choice depends on what queries you'll get.
Metadata extraction. Each document has metadata that matters:
- Title, author, date, version.
- Topic, category, tags.
- Source URL or location.
- Permissions / visibility.
Capture this at ingestion. It becomes filter parameters at retrieval time.
Cleaning. Strip boilerplate:
- Headers/footers that repeat on every page.
- Navigation, ads, cookie banners.
- "Table of contents" pages.
- Empty or duplicate paragraphs.
A clean corpus retrieves better. Boilerplate causes false matches.
Quality checks.
- Did the parser actually extract text? (Some PDFs return empty strings.)
- Are there encoding issues?
- Are tables preserved?
- Are figures captioned or skipped?
Build sanity checks into the ingestion pipeline. Catch broken parses before they pollute the index.
Stage 2: Chunking
You have clean documents. Now you split them into retrieval units (chunks).
The fundamental trade-off:
- Small chunks: precise retrieval (focused on the question), but missing context.
- Large chunks: more context, but less precise (the relevant bit is buried).
Both extremes lose quality. The sweet spot varies by content type.
Common strategies:
Fixed-size chunks with overlap. Split into N-token chunks (300-800 tokens) with 50-100 token overlap. Simple, works as a baseline.
Sentence-aware chunking. Split on sentence boundaries (using nltk, spaCy, or similar). Avoids mid-sentence splits.
Paragraph-based. Each paragraph is a chunk. Works for documents with well-structured paragraphs.
Hierarchical (small + large). Two indexes:
- Small chunks (300 tokens) for precise retrieval.
- Large chunks (1500 tokens) or full sections for context. At retrieval, retrieve small; serve to LLM the large parent.
Document-structure-aware. Use the document's structure (headings, sections) to inform chunks. Each section becomes a chunk, with hierarchy preserved.
Semantic chunking. Use embeddings to find natural breakpoints (places where topic shifts). More expensive but produces better chunks for some content.
LLM-based summarization chunking. Long documents are summarized into hierarchical chunks at multiple levels (paragraph summary, section summary, document summary). LLM generates these once at ingestion.
The right strategy depends on content. Articles and wikis: paragraph or hierarchical. Code: function-level. Conversations: turn-based. Technical docs: structure-aware.
Chunk metadata. Each chunk should carry:
- Source document ID and URL.
- Section path (chapter > section > subsection).
- Page number (for citation).
- Headings above this chunk (for context).
- Document-level metadata (date, author, type).
This metadata enables filtering and citation in the final response.
Stage 3: Embedding
Convert chunks to vectors.
Embedding model choice.
In 2026:
- OpenAI text-embedding-3-large: strong all-rounder, expensive.
- Voyage Voyage-3: competitive, often better on technical content.
- Cohere embed-v4: strong multilingual.
- BAAI bge-large: strong open-source.
- Nomic embed-text: good open-source, free to self-host.
- Domain-specialised models: for code (Voyage Code, OpenAI text-embedding-3-large works for code too), legal, medical, etc.
Picking a model: test against your domain. Don't assume the leaderboard winner is best for your data.
Embedding dimension. Models offer 256-3072 dimensions. Higher dimensions = better quality, higher cost/storage. For most use cases, 768-1536 is the sweet spot.
Versioning. Embedding models update; you may want to switch. Plan for this:
- Track which embedding model version produced each vector.
- Re-embed everything when migrating (or use a dual-index transition).
- Don't mix vectors from different models in the same search.
Cost. Embedding millions of chunks costs money. €0.10-€0.50 per million tokens (varies by provider). For large corpora, do the math upfront.
Storage. Vectors are large. 1M chunks × 1536 dimensions × 4 bytes = 6GB. Plan storage accordingly.
Stage 4: Retrieval
The query comes in. You need to find relevant chunks.
Vector search (dense retrieval). Embed the query, find nearest neighbors. Captures semantic similarity. Standard.
Keyword search (BM25 / lexical). Traditional text search. Captures exact matches, rare terms, specific names. Often complements vector search.
Hybrid search. Run both, fuse results. Reciprocal Rank Fusion (RRF) is the standard fusion approach.
def reciprocal_rank_fusion(rankings, k=60):
scores = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: -x[1])In practice, hybrid beats vector-only or keyword-only on most benchmarks. Use it by default.
Metadata filtering. Filter retrieval by metadata before scoring. "Only documents from 2025+". "Only documents in the user's accessible set." "Only documents of type policy."
Critical for multi-tenant systems: every query is scoped to the tenant's accessible documents.
Query understanding. Pre-process the query:
- Spell correction. "engineerin" → "engineering."
- Acronym expansion. "MFA" → "Multi-Factor Authentication MFA."
- Synonyms / query expansion. Generate alternative phrasings; search with each.
- Decomposition. Multi-part questions split into sub-queries, each retrieved separately.
These pre-processing steps significantly improve retrieval on real-world queries.
HyDE (Hypothetical Document Embeddings). Generate a fake "ideal answer" with an LLM. Embed that. Use the result as the retrieval query. Often works better than embedding the question directly because the fake answer is more similar to actual answer documents.
def hyde_retrieve(query, k=20):
hypothetical = llm("Write a passage answering: " + query)
embedding = embed(hypothetical)
return vector_search(embedding, k=k)Trade-off: extra LLM call per query (latency, cost). Worth it for hard queries; overkill for simple ones.
Multi-vector retrieval. Instead of one vector per chunk, multiple (e.g., one for the chunk, one for a summary, one for hypothetical questions). Adds storage and complexity but improves retrieval for some content.
Stage 5: Reranking
After initial retrieval (top-k, k=20-50), use a reranker to score the candidates more carefully.
Why rerank?
Initial retrieval is fast but imprecise. Vector similarity is a rough proxy for relevance. A reranker (typically a cross-encoder model) reads the query and each candidate together, producing a more accurate score.
Reranker options:
- Cohere Rerank. Industry standard. Good quality, hosted.
- Voyage Rerank. Strong competitor.
- BGE Rerank-large. Open-source, free to self-host.
- MiniLM cross-encoders. Smaller, faster, lower quality.
- LLM-based reranking. Use a small LLM to score query-candidate pairs. Highest quality, highest cost.
The impact.
In our experience, adding reranking improves end-task quality by 10-20% on most RAG systems. It's nearly always worth the latency and cost.
Typical setup:
- Retrieve 30-50 candidates with hybrid search.
- Rerank to top 5-10.
- Pass top results to LLM.
Adaptive reranking.
For queries where the top initial result has a high vector similarity (clear match), skip reranking. For queries with close scores (multiple candidates tied), rerank aggressively.
Stage 6: Generation
The LLM produces the answer using retrieved context.
Prompt structure:
You are an assistant answering questions based on the provided context.
Context:
[Document 1 - title, URL]
[chunk 1 content]
[Document 2 - title, URL]
[chunk 2 content]
...
User question: {query}
Instructions:
- Answer based only on the provided context.
- If the context doesn't contain the answer, say so. Don't make up information.
- Cite sources using [Source N] notation.
- Be concise but complete.The prompt patterns that matter:
- Source tagging. Each context chunk is tagged with its source for citation.
- Grounding instructions. Explicitly tell the model to use only the context.
- Citation requirement. Force citation. Catches hallucinations.
- Fallback instruction. "If the context doesn't have the answer, say so." Prevents confabulation.
Citation handling.
A common pattern: include source links in the response. UI renders them as clickable.
"The company's remote work policy allows up to 4 days/week from home [1].
[1]: https://wiki.company.com/policies/remote-work"This makes the response verifiable. Users trust grounded answers more.
Context size management.
Long contexts degrade. Most models work best with focused contexts (5-10 highly relevant chunks) rather than dumping everything. Quality drops with context bloat.
If you must include lots of context, summarize less-relevant items rather than including them in full.
Eval at every stage
You can't optimize what you don't measure. Each stage needs its own evals.
Ingestion eval. Did the documents parse correctly? Sample documents; check that key content is preserved.
Chunking eval. Are chunks at the right size? Do they preserve context? Do they break at sensible boundaries?
Embedding eval. Are similar concepts embedded similarly? On a test set of known-similar and known-different pairs, does cosine similarity match expectations?
Retrieval eval. Given a query, are the relevant chunks in the top-K? Common metric: recall@K (what % of queries have at least one relevant chunk in top K).
Reranking eval. Given retrieved candidates, does the reranker put the most relevant first? Metric: NDCG (normalized discounted cumulative gain) or MRR (mean reciprocal rank).
Generation eval. Given context and query, does the LLM produce a correct answer? Metrics: faithfulness (does the answer use the context?), correctness (is the answer right?), helpfulness (does it address the user's intent?).
End-to-end eval. Given a real query, does the system produce the right answer? Most important; depends on all stages.
You need test sets for each. A common bootstrap:
- Build 50-100 queries with known-good answers.
- For each query, identify which document(s) contain the answer.
- Use this to evaluate retrieval (do we retrieve the right docs?) and generation (is the answer right?).
Tools: Ragas, TruLens, custom suites. The exact tool matters less than running the evals.
Operational concerns
A few production realities:
Indexing pipelines. New documents arrive. Re-embed. Update indexes. Handle deletes and updates. This pipeline runs continuously, not just at setup. Build it as a system, not a one-time script.
Latency.
Typical breakdown:
- Embedding (query): 50-200ms.
- Vector search: 50-200ms.
- Reranking: 200-500ms (depending on K).
- Generation: 1-5s (depending on context and model).
- Total: 1.5-6s.
Optimization:
- Cache embeddings of frequent queries.
- Cache reranking for repeated query-candidate pairs.
- Stream generation.
- Parallelize where possible.
For sub-2s latency, you typically need a fast embedding model, a fast vector DB, and either a fast reranker or skip reranking for cached/simple queries.
Cost.
Per-query costs:
- Embedding: ~€0.0001.
- Vector search: ~€0.0001 (infrastructure-dependent).
- Reranking: €0.001-0.01 (model-dependent).
- Generation: €0.005-0.05 (model + context-dependent).
- Total: ~€0.01-0.05 per query.
At scale, this adds up. A million queries = €10K-50K.
Cost optimization:
- Cache.
- Use cheaper models where quality allows.
- Compress context (summaries vs full chunks).
Updates. Documents change. Patterns:
- Versioning: every document has a version; old versions are kept or deleted based on policy.
- Incremental: new versions are re-chunked and re-embedded; old chunks deleted.
- Diff-aware: only changed sections are re-processed.
For high-update scenarios, this matters significantly.
Permissions. RAG over private data: documents have access controls; retrieval must respect them.
- Metadata-based filtering (each chunk has accessible-by metadata).
- Tenant-isolated indexes for hard isolation.
- Audit logging of who accessed what.
Don't trust the LLM to enforce permissions. Enforce at retrieval.
Common failure modes
A few patterns we see in failing RAG systems:
Failure 1: Bad chunking. Chunks split mid-thought, table broken across chunks, headers separated from their content. Fix: better chunking strategy.
Failure 2: Retrieval missing the answer. The relevant chunk exists but isn't retrieved. Often a query/document mismatch. Fix: HyDE, query expansion, better embeddings.
Failure 3: Right chunks, wrong order. Relevant chunk is retrieved but ranked low. LLM uses higher-ranked irrelevant chunks. Fix: reranking.
Failure 4: Right chunks, hallucination. Chunks are right but LLM makes up additional info. Fix: stricter grounding prompt, citation requirements, lower temperature.
Failure 5: Right chunks, wrong answer. Chunks contain the answer but LLM extracts the wrong part. Fix: better generation prompt, possibly larger model.
Failure 6: Stale data. Index hasn't been updated. Fix: continuous ingestion pipeline.
Failure 7: Permission leaks. Users see chunks they shouldn't. Fix: enforce filtering at retrieval; audit.
Failure 8: Cost spirals. Latency and cost grew as corpus and traffic grew. Fix: caching, model routing, possibly architecture changes.
A worked example: company knowledge base RAG
To make it concrete: a typical company-knowledge-base RAG system.
Corpus: ~50,000 documents (wiki pages, policies, runbooks, meeting notes, slack threads).
Pipeline:
- Ingestion: - Notion: API export. - Slack: archive export, filtered to relevant channels. - Google Drive: API export of approved folders. - Cleaning: remove emojis-only messages, boilerplate signatures.
- Chunking: - Hierarchical: paragraph-level small chunks, section-level parent chunks. - ~250K total chunks at small level. - Metadata: source, section path, last_updated, accessible_to.
- Embedding: - Voyage-3 embeddings, 1024 dimensions. - Stored in Pinecone (managed). - Re-embedded weekly for changed documents.
- Retrieval: - Hybrid: vector search + BM25 (via Pinecone's hybrid). - HyDE for the query. - Metadata filter for accessibility. - Top 30 retrieved.
- Reranking: - Cohere Rerank. - Top 30 → top 8.
- Generation: - Claude 4 Sonnet. - Top 8 parent chunks (not small chunks) provided as context. - Citation required in response.
Eval:
- 100 hand-crafted query/answer pairs.
- Retrieval recall@30: ~92%.
- Final answer correctness: ~85%.
- Faithfulness (no hallucination): ~95%.
Operations:
- Daily incremental ingestion.
- Weekly full re-embedding for changed docs.
- Per-query observability.
- Monthly eval re-run; trend monitoring.
- Quarterly query log review to spot new failure patterns.
Costs:
- Embedding: ~€500/month (steady-state).
- Vector DB hosting: ~€800/month (Pinecone).
- Retrieval + generation per query: ~€0.02.
- Total: ~€2,500/month + €0.02 × query volume.
For most companies, this is fully justified by the productivity gains.
A 90-day RAG buildout plan
If you're starting from zero:
Days 1-30: MVP.
- Identify corpus and primary use case.
- Build basic ingestion (one source).
- Basic chunking (fixed-size with overlap).
- Standard embedding model.
- Vector search only (no rerank).
- Simple generation prompt.
Days 31-60: Quality.
- Hand-craft an eval set (50-100 queries).
- Identify failure modes.
- Add reranking.
- Add hybrid search.
- Improve chunking.
Days 61-90: Operations.
- Continuous ingestion pipeline.
- Observability.
- Eval automation.
- Permission/auth.
- Cost monitoring.
After 90 days, you have a system that's not just demoable, but actually usable. Real users can rely on it.
The takeaway
Production RAG is a six-stage pipeline with quality concerns at every stage. The patterns that distinguish systems that work from systems that disappoint:
- Ingestion: thorough parsing, structure preservation, metadata capture.
- Chunking: appropriate strategy for content type, often hierarchical.
- Embedding: quality model, versioning, planning for migration.
- Retrieval: hybrid by default, query understanding, metadata filtering.
- Reranking: virtually always worth it.
- Generation: grounding, citations, fallback for unknowns.
- Eval: at every stage, continuously.
These aren't optional polish. They're the difference between RAG that hits 60% answer quality and RAG that hits 90%.
The investment is real — a production RAG buildout is months, not weeks. The payoff is a system your users can actually trust. That's the threshold that matters.