Secure document ingestion for RAG: PDFs, OCR, metadata, and retention
RAG quality starts before retrieval. A secure ingestion guide for PDFs, OCR, metadata, permissions, source freshness, deletion, malware risk, and operational ownership.
Outcome: Design a secure document-ingestion pipeline for RAG with permission metadata, OCR quality checks, source freshness, retention rules, deletion behavior, and ingestion tests.
Most RAG failures start before retrieval. The document was stale. The OCR missed a table. The source had no owner. The permission metadata was lost. A deleted contract remained in the vector store. A scanned PDF had hidden text that nobody reviewed. The system answered confidently because the ingestion pipeline treated "text exists" as "knowledge is safe to use."
Secure document ingestion is source control for company knowledge. It decides what enters the retrieval system, who may see it, how freshness is tracked, how deletion works, and how bad inputs are caught.
This article covers the ingestion layer: PDFs, OCR, metadata, permissions, retention, and operational checks.
If a user should not access a document in the source system, they should not access its chunks, embeddings, summaries, or cached answers in the RAG system. Permission metadata is not optional.
The ingestion pipeline
A production pipeline should have explicit stages:
- Source registration.
- Data classification.
- File safety checks.
- Text extraction and OCR.
- Structure preservation.
- Metadata attachment.
- Permission mapping.
- Chunking and embedding.
- Quality checks.
- Index publication.
- Retention and deletion handling.
The exact tools can vary. The control points should not.
Stage 1: source registration
Do not ingest random folders because they are easy to connect.
For each source, record:
- source name,
- system of record,
- source owner,
- data owner,
- allowed users or roles,
- document types,
- sensitivity level,
- retention rule,
- update frequency,
- deletion behavior,
- review schedule.
Example sources:
- public help center,
- internal support playbook,
- sales collateral,
- customer contracts,
- HR policies,
- engineering runbooks,
- product documentation,
- meeting transcripts.
These sources should not all land in the same index with the same permissions.
Stage 2: classify data before extraction
Classify the source before the model or embedding provider sees the content.
Useful classes:
| Class | Example | Default posture | | --- | --- | --- | | Public | published docs, marketing pages | Allowed for broad retrieval | | Internal | playbooks, process docs | Company-only, role filtered | | Confidential | contracts, customer details, finance | Restricted roles, stronger logging | | Regulated/sensitive | health, legal, HR, payroll, security incidents | Avoid unless explicitly approved |
Classification is not just compliance paperwork. It decides whether content can be sent to a hosted embedding API, stored in a shared vector database, included in logs, or used for eval examples.
Stage 3: file safety checks
Documents can be hostile or simply broken.
Before extraction:
- check file type against an allowlist,
- enforce file size limits,
- scan for malware where your environment requires it,
- reject encrypted files unless there is an approved decrypt path,
- reject files with unsupported embedded objects,
- normalize filenames,
- store original file hash,
- record who uploaded or connected the file.
This is especially important if non-admin users can upload documents. Admin-only ingestion lowers risk, but it does not remove it.
Antivirus and content-disarm controls are not automatic in most RAG stacks. If untrusted users can upload files, add a real file-safety layer before parsing.
Stage 4: text extraction and OCR
PDFs are not one format in practice. Some contain selectable text. Some are scans. Some have columns, tables, footnotes, forms, comments, stamps, or hidden text layers.
Track extraction quality:
- extraction method,
- OCR confidence,
- page count,
- extracted character count,
- table extraction status,
- language detected,
- pages with no text,
- parser warnings.
Low-quality extraction should not quietly enter the index. Route it to review or mark it as low confidence.
Common problems:
- columns read in the wrong order,
- table rows merged incorrectly,
- headers repeated in every chunk,
- scanned pages missing entirely,
- handwritten notes ignored,
- hidden text layer contradicting the visible scan,
- OCR converting account numbers incorrectly.
For high-value documents, spot-check the rendered page against extracted text.
Stage 5: preserve structure
RAG systems need more than text. They need enough structure to produce useful, source-grounded answers.
Preserve:
- title,
- heading path,
- section number,
- page number,
- table captions,
- list boundaries,
- document version,
- effective date,
- source URL or storage path.
Chunk text with headings and page references. A chunk that says "The following applies" without the preceding heading is weak evidence.
For tables, decide whether to:
- keep the table as Markdown,
- convert it to structured JSON,
- store both text and structured rows,
- exclude it until a better parser is available.
Do not pretend table extraction is solved if your use case depends on exact prices, dates, limits, or thresholds.
Stage 6: attach metadata
Every chunk should carry metadata that can survive retrieval:
{
"sourceId": "policy-2026-expenses",
"documentId": "doc_123",
"tenantId": "tenant_a",
"visibility": "internal",
"allowedRoles": ["finance", "leadership"],
"sensitivity": "confidential",
"sourceOwner": "Finance",
"version": "2026-02",
"lastReviewedAt": "2026-02-10",
"effectiveFrom": "2026-03-01",
"page": 7,
"headingPath": ["Travel", "Hotel limits"],
"contentHash": "sha256:..."
}Metadata is how the application enforces policy after retrieval. Without it, the model receives text detached from the rules that make it safe to use.
Stage 7: permission mapping
Permission mapping must happen before retrieval results reach the model.
Good pattern:
- User asks a question.
- Application derives tenant, user, roles, groups, and data permissions from auth.
- Retriever filters candidate chunks by permissions.
- Ranking happens within allowed chunks.
- Model receives only allowed chunks.
Bad pattern:
- Retrieve broadly.
- Send all likely chunks to the model.
- Prompt says "only answer using chunks the user may access."
The bad pattern has already exposed data to the model context.
If source permissions are complex, start narrower. It is better to miss an answer than leak a confidential document.
Stage 8: chunking and embedding
Chunking is a security and quality decision, not only a search tuning decision.
Guidelines:
- Keep chunks inside permission boundaries.
- Do not merge public and confidential text into one chunk.
- Include headings and source references.
- Avoid giant chunks that contain unrelated sections.
- Avoid tiny chunks that lose context.
- Re-embed when source text or metadata changes.
- Store embedding model and version.
For sensitive sources, confirm whether your embedding provider, vector store, and logs are approved for that data class.
Stage 9: quality gates
Before publishing a source into production retrieval, run checks:
- all documents have owners,
- all chunks have permission metadata,
- stale documents are flagged,
- pages with failed extraction are excluded or reviewed,
- sample questions retrieve expected sources,
- unauthorized users retrieve zero restricted chunks,
- deleted documents disappear from search,
- citations point to valid source locations,
- suspicious instructions in documents are isolated as content, not followed.
The last point matters. Documents can contain prompt injection. The ingestion pipeline should not remove all such text, because sometimes users need to know what a document says. But the runtime must treat it as untrusted document content.
Stage 10: retention and deletion
RAG systems often accidentally keep data longer than the source system.
Deletion must cover:
- original file cache,
- extracted text,
- chunks,
- embeddings,
- summaries,
- thumbnails or rendered pages,
- eval samples,
- logs where legally required,
- backups according to policy.
When a document is deleted or access is revoked, retrieval should stop returning its chunks. Ideally the system should support hard deletion for sensitive sources and documented retention for backups.
Track:
- deletedAt,
- deletedBy or source event,
- deletion reason,
- downstream cleanup status,
- verification result.
Do not rely on "we removed it from the UI." Vector stores and caches are easy to forget.
Stage 11: operational ownership
Every source needs an owner. Every owner needs a review cadence.
For each source, define:
- who approves ingestion,
- who approves permission changes,
- who reviews stale documents,
- who handles extraction failures,
- who responds to data deletion requests,
- who investigates retrieval mistakes.
If nobody owns a source, it should not be in a production RAG system.
The takeaway
Secure RAG ingestion is boring in the best way. It makes retrieval predictable.
The core controls:
- register sources,
- classify data,
- check files before parsing,
- measure extraction quality,
- preserve structure,
- attach metadata,
- enforce permissions before retrieval,
- test unauthorized access,
- support deletion,
- assign owners.
Good answers come from good sources. Safe answers come from good source controls.