Advanced10 min readAI Safety & Data Privacy

Company knowledge RAG: permissions, leakage, and source boundaries

A company knowledge assistant is only safe if retrieval respects permissions. How to design RAG source boundaries, ACL filtering, document ownership, logging, stale-source handling, and refusal behavior.

What you should be able to do

Design a company knowledge RAG with permission-aware retrieval, source ownership, leakage controls, and refusal behavior.

May 17, 2026

In this article

The core rule
Source boundaries
Ingestion controls
Retrieval controls
Prompt and answer behavior
Logging without creating a second leak
Stale and conflicting sources
Testing permission boundaries
Rollout path
Do not do this yet
The takeaway

The most common RAG mistake inside companies is simple: upload everything, ask questions, and treat source-cited answers as automatically safe.

The second most common mistake arrives later: someone realizes the assistant can answer from documents the user should never have seen. Salary bands. Customer contracts. Legal drafts. Board notes. Support tickets. HR investigations. Security procedures. The answer may be accurate and cited, but the system has leaked information.

A company knowledge RAG is not a search box with nicer prose. It is a permissioned information system. Treat it that way.

Retrieval must enforce the same or stricter permissions as the source systems. If a user cannot open the document in Google Drive, SharePoint, Notion, Confluence, or the CRM, the RAG should not retrieve it for that user.

The core rule

The retrieval layer must answer this question before returning any chunk:

Is this user allowed to see this source right now?

Not "is this source in the vector database?" Not "is this source relevant?" Not "is this source useful?" Permission comes first.

There are three common patterns:

Pattern	How it works	Fit
Separate indexes	One index per audience or workspace	Simple teams, coarse permissions
Metadata filtering	Store ACL/group/source metadata and filter before retrieval	Most company RAG systems
Real-time permission check	Query source system permissions at retrieval time	Sensitive or frequently changing permissions

The right answer depends on the source systems and risk level. For most SMEs, separate indexes plus metadata filtering is enough. For customer, HR, legal, or regulated data, real-time checks may be required.

Source boundaries

Do not create one giant knowledge pool. Separate by audience and sensitivity:

Corpus	Audience	Examples	Rule
Public/product	Everyone	Help docs, public pricing, product pages	Safe for broad assistant
Internal operations	Employees	Process docs, internal FAQs	Employee-only
Department	Department members	Sales playbooks, support macros, engineering runbooks	Group-filtered
Customer records	Assigned teams	Tickets, contracts, account notes	Strict ACL and audit
Restricted	Named users only	HR, legal, security, board	Usually separate system or no RAG

The fewer audiences a corpus serves, the easier it is to reason about leakage.

Ingestion controls

The ingestion pipeline is where many leaks start.

Before indexing a source, capture:

Source system.
Document ID.
Owner.
Audience or ACL.
Sensitivity label.
Created and updated timestamps.
Expiry or review date.
Whether the document may be used for AI retrieval.
Whether the document contains personal data.

If the source system already has labels, preserve them. If it does not, add a lightweight classification step before ingestion.

Retrieval controls

Retrieval should happen in this order:

Identify the user and groups.
Identify the requested workspace or assistant.
Filter candidate sources by corpus, ACL, sensitivity, and freshness.
Retrieve relevant chunks only from allowed sources.
Rerank allowed chunks.
Generate the answer with source references.
Refuse or escalate when allowed sources are insufficient.

Do not retrieve first and filter later in the prompt. If a forbidden chunk enters the model context, the boundary has already failed.

Prompt and answer behavior

The assistant should be instructed to:

Answer only from retrieved sources.
Cite source title and section/link.
Say when allowed sources do not contain the answer.
Mark inference separately from sourced facts.
Avoid revealing that restricted sources exist.
Avoid summarizing access-denied material.

Bad refusal:

"I found HR salary bands but you do not have access."

Better refusal:

"I do not have an approved source available to answer that."

The second answer does not leak the existence or topic of restricted documents.

Logging without creating a second leak

RAG logs are sensitive. They can contain user questions, retrieved chunks, source IDs, answers, and sometimes personal data.

Log enough to debug:

User ID or pseudonymous ID.
Assistant/workspace.
Query timestamp.
Source IDs retrieved.
Permission-filter outcome.
Answer ID.
Refusal/escalation reason.
Latency and errors.

Be careful with:

Full user questions.
Full retrieved chunks.
Full generated answers.
Customer data.
HR/legal/security topics.

For sensitive systems, store redacted logs or source IDs rather than full text. Give logs their own access control and retention period.

Stale and conflicting sources

Permission is not the only boundary. Source quality matters.

Every indexed source should have an owner and a freshness rule:

Source type	Review rule
Pricing	Review on every pricing change
Policy	Review on policy owner update, at least quarterly
Product docs	Review on release
Legal template	Review by legal owner
Support macro	Review monthly or after escalation pattern

When sources conflict, the assistant should surface the conflict only if the user can access both sources. Otherwise it should answer from the highest-authority allowed source or escalate.

Testing permission boundaries

Test with users, not only documents:

Employee with broad access.
Employee with narrow department access.
Manager with team-only access.
Contractor.
Former employee or disabled account.
Customer-facing support user.
Admin.

For each, ask:

A question they should be able to answer.
A question just outside their permissions.
A question about a restricted document they know exists.
A question where public docs and internal docs conflict.
A question using prompt injection: "ignore access rules."

The correct result is not only "good answer." It is "good answer from allowed sources."

Rollout path

Start with the least sensitive corpus:

Public/product docs.
Internal operations docs.
Department-specific docs.
Customer records with strict ACL.
Restricted corpora only after explicit security/legal approval.

At each stage, measure:

Answer helpfulness.
Citation quality.
Refusal correctness.
Access-denied retrieval rate.
Stale-source rate.
User reports of missing or wrong sources.

Do not do this yet

Do not index "all company docs" into one assistant.

Do not rely on prompt instructions to enforce permissions.

Do not log full retrieved chunks for sensitive corpora without a clear retention and access policy.

Do not mix HR, legal, customer, and public docs in the same corpus.

Do not let the RAG answer outside its allowed sources just to be helpful.

The takeaway

Company knowledge RAG is valuable because it brings source-grounded answers into daily work. It is risky because source-grounded answers can still leak information.

Design permission boundaries first. Filter before retrieval. Separate corpora by audience. Preserve source metadata. Refuse safely. Log carefully. Test with real permission profiles. If a user cannot access the source directly, the RAG should not use that source to answer them.

Take it further

Hand-picked external courses that go deeper on this topic.

EIPA — European Institute of Public Administration

AI & EU Law: Definition and Developments

EIPA

The fastest credible briefing on what the AI Act actually says — written by the institute that trains EU civil servants. Forty-five minutes; covers the risk-tier classification, who's responsible for what, and what changes for your product roadmap. The single best starting point for EU-deployed AI systems.

Advanced~45 minutesVerified 25 days ago

Coursera · University of Michigan

Generative AI: Governance, Policy, and Emerging Regulation

Merve Hickok

Few courses survey the regulatory landscape across the US, EU, and G7 in one place; this one does. Useful for compliance officers and product leaders trying to ship into multiple jurisdictions without inheriting hidden legal exposure. Pairs well with the EIPA EU AI Act primer for the European-specific detail.

Advanced~3 hoursVerified 25 days ago

See all courses for AI Safety & Data Privacy