Эксперт11 мин чтенияБезопасность ИИ и приватность данных

Secure document ingestion для RAG: PDF, OCR, metadata и retention

Качество RAG начинается до retrieval. Руководство по secure ingestion для PDF, OCR, metadata, permissions, source freshness, deletion, malware risk и operational ownership.

Что вы сможете сделать

Спроектировать secure document-ingestion pipeline для RAG с permission metadata, OCR quality checks, source freshness, retention rules, deletion behavior и ingestion tests.

17 мая 2026 г.

В этой статье

Ingestion pipeline
Stage 1: source registration
Stage 2: classify data before extraction
Stage 3: file safety checks
Stage 4: text extraction and OCR
Stage 5: preserve structure
Stage 6: attach metadata
Stage 7: permission mapping
Stage 8: chunking and embedding
Stage 9: quality gates
Stage 10: retention and deletion
Stage 11: operational ownership
Итог

Большинство RAG failures начинается до retrieval. Документ был stale. OCR пропустил таблицу. У source не было owner. Permission metadata потерялась. Удаленный contract остался в vector store. У scanned PDF был hidden text layer, который никто не проверил. Система отвечала уверенно, потому что ingestion pipeline считала "text exists" равным "knowledge is safe to use."

Secure document ingestion - это source control для company knowledge. Он решает, что попадает в retrieval system, кто может это видеть, как отслеживается freshness, как работает deletion и как ловятся bad inputs.

Эта статья покрывает ingestion layer: PDFs, OCR, metadata, permissions, retention и operational checks.

Если пользователь не должен иметь доступ к document в source system, он не должен иметь доступ к его chunks, embeddings, summaries или cached answers в RAG system. Permission metadata не optional.

Ingestion pipeline

Production pipeline должен иметь explicit stages:

Source registration.
Data classification.
File safety checks.
Text extraction and OCR.
Structure preservation.
Metadata attachment.
Permission mapping.
Chunking and embedding.
Quality checks.
Index publication.
Retention and deletion handling.

Конкретные tools могут отличаться. Control points не должны.

Stage 1: source registration

Не ingest random folders только потому, что их легко подключить.

Для каждого source записывайте:

source name,
system of record,
source owner,
data owner,
allowed users or roles,
document types,
sensitivity level,
retention rule,
update frequency,
deletion behavior,
review schedule.

Примеры sources:

public help center,
internal support playbook,
sales collateral,
customer contracts,
HR policies,
engineering runbooks,
product documentation,
meeting transcripts.

Эти sources не должны все попадать в один index с одинаковыми permissions.

Stage 2: classify data before extraction

Классифицируйте source до того, как model или embedding provider увидит content.

Полезные classes:

Class	Example	Default posture
Public	published docs, marketing pages	Allowed for broad retrieval
Internal	playbooks, process docs	Company-only, role filtered
Confidential	contracts, customer details, finance	Restricted roles, stronger logging
Regulated/sensitive	health, legal, HR, payroll, security incidents	Avoid unless explicitly approved

Classification - не просто compliance paperwork. Она решает, можно ли отправлять content в hosted embedding API, хранить в shared vector database, включать в logs или использовать для eval examples.

Stage 3: file safety checks

Documents могут быть hostile или просто broken.

Перед extraction:

проверяйте file type against an allowlist,
enforce file size limits,
scan for malware там, где environment это требует,
reject encrypted files, если нет approved decrypt path,
reject files with unsupported embedded objects,
normalize filenames,
store original file hash,
record who uploaded or connected the file.

Это особенно важно, если non-admin users могут upload documents. Admin-only ingestion снижает риск, но не устраняет его.

Antivirus и content-disarm controls не автоматические в большинстве RAG stacks. Если untrusted users могут upload files, добавьте реальный file-safety layer before parsing.

Stage 4: text extraction and OCR

PDFs на практике не один формат. В некоторых есть selectable text. Некоторые являются scans. В некоторых есть columns, tables, footnotes, forms, comments, stamps или hidden text layers.

Отслеживайте extraction quality:

extraction method,
OCR confidence,
page count,
extracted character count,
table extraction status,
language detected,
pages with no text,
parser warnings.

Low-quality extraction не должна тихо попадать в index. Направьте ее на review или пометьте как low confidence.

Common problems:

columns read in the wrong order,
table rows merged incorrectly,
headers repeated in every chunk,
scanned pages missing entirely,
handwritten notes ignored,
hidden text layer contradicting the visible scan,
OCR converting account numbers incorrectly.

Для high-value documents выборочно сверяйте rendered page с extracted text.

Stage 5: preserve structure

RAG systems нужны не только text. Им нужно достаточно structure, чтобы давать useful, source-grounded answers.

Сохраняйте:

title,
heading path,
section number,
page number,
table captions,
list boundaries,
document version,
effective date,
source URL или storage path.

Chunk text with headings and page references. Chunk, который говорит "The following applies" без предыдущего heading, является weak evidence.

Для tables решите, нужно ли:

keep the table as Markdown,
convert it to structured JSON,
store both text and structured rows,
exclude it until a better parser is available.

Не притворяйтесь, что table extraction решена, если ваш use case зависит от точных prices, dates, limits или thresholds.

Stage 6: attach metadata

Каждый chunk должен нести metadata, которая survives retrieval:

{
  "sourceId": "policy-2026-expenses",
  "documentId": "doc_123",
  "tenantId": "tenant_a",
  "visibility": "internal",
  "allowedRoles": ["finance", "leadership"],
  "sensitivity": "confidential",
  "sourceOwner": "Finance",
  "version": "2026-02",
  "lastReviewedAt": "2026-02-10",
  "effectiveFrom": "2026-03-01",
  "page": 7,
  "headingPath": ["Travel", "Hotel limits"],
  "contentHash": "sha256:..."
}

Metadata - это то, как application применяет policy after retrieval. Без нее model получает text, отделенный от rules, которые делают его безопасным для использования.

Stage 7: permission mapping

Permission mapping должен происходить до того, как retrieval results попадают к model.

Хороший pattern:

User asks a question.
Application derives tenant, user, roles, groups, and data permissions from auth.
Retriever filters candidate chunks by permissions.
Ranking happens within allowed chunks.
Model receives only allowed chunks.

Плохой pattern:

Retrieve broadly.
Send all likely chunks to the model.
Prompt says "only answer using chunks the user may access."

Плохой pattern уже exposed data to the model context.

Если source permissions сложные, начинайте с более узкого доступа. Лучше miss an answer, чем leak a confidential document.

Stage 8: chunking and embedding

Chunking - это security and quality decision, а не только search tuning decision.

Guidelines:

Keep chunks inside permission boundaries.
Do not merge public and confidential text into one chunk.
Include headings and source references.
Avoid giant chunks that contain unrelated sections.
Avoid tiny chunks that lose context.
Re-embed when source text or metadata changes.
Store embedding model and version.

Для sensitive sources подтвердите, что embedding provider, vector store и logs approved for that data class.

Stage 9: quality gates

До publication source в production retrieval запустите checks:

all documents have owners,
all chunks have permission metadata,
stale documents are flagged,
pages with failed extraction are excluded or reviewed,
sample questions retrieve expected sources,
unauthorized users retrieve zero restricted chunks,
deleted documents disappear from search,
citations point to valid source locations,
suspicious instructions in documents are isolated as content, not followed.

Последний пункт важен. Documents могут содержать prompt injection. Ingestion pipeline не должна удалять весь такой text, потому что иногда users нужно знать, что говорит document. Но runtime должен относиться к этому как к untrusted document content.

Stage 10: retention and deletion

RAG systems часто случайно хранят data дольше, чем source system.

Deletion должен покрывать:

original file cache,
extracted text,
chunks,
embeddings,
summaries,
thumbnails or rendered pages,
eval samples,
logs where legally required,
backups according to policy.

Когда document удален или access revoked, retrieval должен перестать возвращать его chunks. В идеале system должна поддерживать hard deletion для sensitive sources и documented retention для backups.

Отслеживайте:

deletedAt,
deletedBy или source event,
deletion reason,
downstream cleanup status,
verification result.

Не полагайтесь на "we removed it from the UI." Vector stores и caches легко забыть.

Stage 11: operational ownership

Каждому source нужен owner. Каждому owner нужен review cadence.

Для каждого source определите:

кто approves ingestion,
кто approves permission changes,
кто reviews stale documents,
кто handles extraction failures,
кто responds to data deletion requests,
кто investigates retrieval mistakes.

Если у source нет owner, он не должен быть в production RAG system.

Итог

Secure RAG ingestion скучен в лучшем смысле. Он делает retrieval predictable.

Core controls:

register sources,
classify data,
check files before parsing,
measure extraction quality,
preserve structure,
attach metadata,
enforce permissions before retrieval,
test unauthorized access,
support deletion,
assign owners.

Good answers come from good sources. Safe answers come from good source controls.

Читать дальше

Продолжайте тот же учебный путь со следующими практическими статьями.

Company knowledge RAG: права доступа, утечки и границы источников

Спроектировать company knowledge RAG с permission-aware retrieval, ownership источников, leakage controls и безопасным refusal behavior.

Сбои production AI: что ломается после демо

Построить production AI failure-mode register с контролями для hallucination, stale context, prompt injection, unsafe tool use и weak fallbacks.

Prompt injection и безопасность LLM: модели угроз и многоуровневая защита

Построить модель угроз для LLM-workflow и добавить конкретные контроли для недоверенного контента, retrieval, вызовов инструментов, авторизации, мониторинга и реагирования на инциденты.

Углубиться

Тщательно подобранные внешние курсы, которые глубже раскрывают эту тему.

EIPA — European Institute of Public Administration

AI & EU Law: Definition and Developments

EIPA

Короткое и надёжное введение в то, что на практике означает EU AI Act. EIPA обучает европейский публичный сектор, поэтому это хороший первый источник для понимания уровней риска, ролей и ответственности.

Эксперт~45 минутПроверено 25 дней назад

Coursera · University of Michigan

Generative AI: Governance, Policy, and Emerging Regulation

Merve Hickok

Немногие курсы дают обзор регуляторного ландшафта США, ЕС и G7 в одном месте. Подходит продуктовым руководителям и специалистам по комплаенсу, которые выводят ИИ-решения на несколько рынков и не хотят унаследовать скрытые правовые риски.

Эксперт~3 часаПроверено 25 дней назад

Все курсы в категории «Безопасность ИИ и приватность данных»