Advanced11 min readAI Safety & Data Privacy

Secure document ingestion for RAG: PDFs, OCR, metadata, and retention

RAG quality starts before retrieval. A secure ingestion guide for PDFs, OCR, metadata, permissions, source freshness, deletion, malware risk, and operational ownership.

What you should be able to do

Design a secure document-ingestion pipeline for RAG with permission metadata, OCR quality checks, source freshness, retention rules, deletion behavior, and ingestion tests.

May 17, 2026

In this article

The ingestion pipeline
Stage 1: source registration
Stage 2: classify data before extraction
Stage 3: file safety checks
Stage 4: text extraction and OCR
Stage 5: preserve structure
Stage 6: attach metadata
Stage 7: permission mapping
Stage 8: chunking and embedding
Stage 9: quality gates
Stage 10: retention and deletion
Stage 11: operational ownership
The takeaway

Most RAG failures start before retrieval. The document was stale. The OCR missed a table. The source had no owner. The permission metadata was lost. A deleted contract remained in the vector store. A scanned PDF had hidden text that nobody reviewed. The system answered confidently because the ingestion pipeline treated "text exists" as "knowledge is safe to use."

Secure document ingestion is source control for company knowledge. It decides what enters the retrieval system, who may see it, how freshness is tracked, how deletion works, and how bad inputs are caught.

This article covers the ingestion layer: PDFs, OCR, metadata, permissions, retention, and operational checks.

If a user should not access a document in the source system, they should not access its chunks, embeddings, summaries, or cached answers in the RAG system. Permission metadata is not optional.

The ingestion pipeline

A production pipeline should have explicit stages:

Source registration.
Data classification.
File safety checks.
Text extraction and OCR.
Structure preservation.
Metadata attachment.
Permission mapping.
Chunking and embedding.
Quality checks.
Index publication.
Retention and deletion handling.

The exact tools can vary. The control points should not.

Stage 1: source registration

Do not ingest random folders because they are easy to connect.

For each source, record:

source name,
system of record,
source owner,
data owner,
allowed users or roles,
document types,
sensitivity level,
retention rule,
update frequency,
deletion behavior,
review schedule.

Example sources:

public help center,
internal support playbook,
sales collateral,
customer contracts,
HR policies,
engineering runbooks,
product documentation,
meeting transcripts.

These sources should not all land in the same index with the same permissions.

Stage 2: classify data before extraction

Classify the source before the model or embedding provider sees the content.

Useful classes:

Class	Example	Default posture
Public	published docs, marketing pages	Allowed for broad retrieval
Internal	playbooks, process docs	Company-only, role filtered
Confidential	contracts, customer details, finance	Restricted roles, stronger logging
Regulated/sensitive	health, legal, HR, payroll, security incidents	Avoid unless explicitly approved

Classification is not just compliance paperwork. It decides whether content can be sent to a hosted embedding API, stored in a shared vector database, included in logs, or used for eval examples.

Stage 3: file safety checks

Documents can be hostile or simply broken.

Before extraction:

check file type against an allowlist,
enforce file size limits,
scan for malware where your environment requires it,
reject encrypted files unless there is an approved decrypt path,
reject files with unsupported embedded objects,
normalize filenames,
store original file hash,
record who uploaded or connected the file.

This is especially important if non-admin users can upload documents. Admin-only ingestion lowers risk, but it does not remove it.

Antivirus and content-disarm controls are not automatic in most RAG stacks. If untrusted users can upload files, add a real file-safety layer before parsing.

Stage 4: text extraction and OCR

PDFs are not one format in practice. Some contain selectable text. Some are scans. Some have columns, tables, footnotes, forms, comments, stamps, or hidden text layers.

Track extraction quality:

extraction method,
OCR confidence,
page count,
extracted character count,
table extraction status,
language detected,
pages with no text,
parser warnings.

Low-quality extraction should not quietly enter the index. Route it to review or mark it as low confidence.

Common problems:

columns read in the wrong order,
table rows merged incorrectly,
headers repeated in every chunk,
scanned pages missing entirely,
handwritten notes ignored,
hidden text layer contradicting the visible scan,
OCR converting account numbers incorrectly.

For high-value documents, spot-check the rendered page against extracted text.

Stage 5: preserve structure

RAG systems need more than text. They need enough structure to produce useful, source-grounded answers.

Preserve:

title,
heading path,
section number,
page number,
table captions,
list boundaries,
document version,
effective date,
source URL or storage path.

Chunk text with headings and page references. A chunk that says "The following applies" without the preceding heading is weak evidence.

For tables, decide whether to:

keep the table as Markdown,
convert it to structured JSON,
store both text and structured rows,
exclude it until a better parser is available.

Do not pretend table extraction is solved if your use case depends on exact prices, dates, limits, or thresholds.

Stage 6: attach metadata

Every chunk should carry metadata that can survive retrieval:

{
  "sourceId": "policy-2026-expenses",
  "documentId": "doc_123",
  "tenantId": "tenant_a",
  "visibility": "internal",
  "allowedRoles": ["finance", "leadership"],
  "sensitivity": "confidential",
  "sourceOwner": "Finance",
  "version": "2026-02",
  "lastReviewedAt": "2026-02-10",
  "effectiveFrom": "2026-03-01",
  "page": 7,
  "headingPath": ["Travel", "Hotel limits"],
  "contentHash": "sha256:..."
}

Metadata is how the application enforces policy after retrieval. Without it, the model receives text detached from the rules that make it safe to use.

Stage 7: permission mapping

Permission mapping must happen before retrieval results reach the model.

Good pattern:

User asks a question.
Application derives tenant, user, roles, groups, and data permissions from auth.
Retriever filters candidate chunks by permissions.
Ranking happens within allowed chunks.
Model receives only allowed chunks.

Bad pattern:

Retrieve broadly.
Send all likely chunks to the model.
Prompt says "only answer using chunks the user may access."

The bad pattern has already exposed data to the model context.

If source permissions are complex, start narrower. It is better to miss an answer than leak a confidential document.

Stage 8: chunking and embedding

Chunking is a security and quality decision, not only a search tuning decision.

Guidelines:

Keep chunks inside permission boundaries.
Do not merge public and confidential text into one chunk.
Include headings and source references.
Avoid giant chunks that contain unrelated sections.
Avoid tiny chunks that lose context.
Re-embed when source text or metadata changes.
Store embedding model and version.

For sensitive sources, confirm whether your embedding provider, vector store, and logs are approved for that data class.

Stage 9: quality gates

Before publishing a source into production retrieval, run checks:

all documents have owners,
all chunks have permission metadata,
stale documents are flagged,
pages with failed extraction are excluded or reviewed,
sample questions retrieve expected sources,
unauthorized users retrieve zero restricted chunks,
deleted documents disappear from search,
citations point to valid source locations,
suspicious instructions in documents are isolated as content, not followed.

The last point matters. Documents can contain prompt injection. The ingestion pipeline should not remove all such text, because sometimes users need to know what a document says. But the runtime must treat it as untrusted document content.

Stage 10: retention and deletion

RAG systems often accidentally keep data longer than the source system.

Deletion must cover:

original file cache,
extracted text,
chunks,
embeddings,
summaries,
thumbnails or rendered pages,
eval samples,
logs where legally required,
backups according to policy.

When a document is deleted or access is revoked, retrieval should stop returning its chunks. Ideally the system should support hard deletion for sensitive sources and documented retention for backups.

Track:

deletedAt,
deletedBy or source event,
deletion reason,
downstream cleanup status,
verification result.

Do not rely on "we removed it from the UI." Vector stores and caches are easy to forget.

Stage 11: operational ownership

Every source needs an owner. Every owner needs a review cadence.

For each source, define:

who approves ingestion,
who approves permission changes,
who reviews stale documents,
who handles extraction failures,
who responds to data deletion requests,
who investigates retrieval mistakes.

If nobody owns a source, it should not be in a production RAG system.

The takeaway

Secure RAG ingestion is boring in the best way. It makes retrieval predictable.

The core controls:

register sources,
classify data,
check files before parsing,
measure extraction quality,
preserve structure,
attach metadata,
enforce permissions before retrieval,
test unauthorized access,
support deletion,
assign owners.

Good answers come from good sources. Safe answers come from good source controls.

Take it further

Hand-picked external courses that go deeper on this topic.

EIPA — European Institute of Public Administration

AI & EU Law: Definition and Developments

EIPA

The fastest credible briefing on what the AI Act actually says — written by the institute that trains EU civil servants. Forty-five minutes; covers the risk-tier classification, who's responsible for what, and what changes for your product roadmap. The single best starting point for EU-deployed AI systems.

Advanced~45 minutesVerified 25 days ago

Coursera · University of Michigan

Generative AI: Governance, Policy, and Emerging Regulation

Merve Hickok

Few courses survey the regulatory landscape across the US, EU, and G7 in one place; this one does. Useful for compliance officers and product leaders trying to ship into multiple jurisdictions without inheriting hidden legal exposure. Pairs well with the EIPA EU AI Act primer for the European-specific detail.

Advanced~3 hoursVerified 25 days ago

See all courses for AI Safety & Data Privacy