Private AI deployment patterns: local, VPC, self-hosted, and hybrid
Private AI is not one architecture. A practical comparison of local models, enterprise SaaS, VPC deployments, self-hosted inference, and hybrid patterns for SMEs that care about privacy and control.
Outcome: Choose a private AI deployment pattern based on data sensitivity, capability needs, cost, latency, and operational capacity.
"Private AI" gets used to mean everything from "we turned off training on our SaaS account" to "we run open models on our own GPUs in a locked-down network." Those are not the same thing.
For SMEs, the right private AI architecture depends on the data, the task, the quality requirement, and the team's ability to operate infrastructure. The most private option is not always the best option. The most capable option is not always acceptable for the data. The cheapest option can become expensive if it needs constant engineering attention.
This article gives a practical map.
Start with data classification, not model preference. A weak model in the right privacy boundary is better than a frontier model fed data it should not receive.
The five deployment patterns
| Pattern | What it is | Best for | Main limitation | | --- | --- | --- | --- | | Consumer SaaS | Personal ChatGPT/Claude/Gemini accounts | Public or personal low-risk tasks | Weak enterprise controls | | Enterprise SaaS | Business tier with admin, SSO, retention, training opt-out | Most normal company work | Data still leaves your environment | | VPC or private cloud | Managed model endpoint inside controlled cloud boundary | Confidential workloads needing stronger isolation | Higher cost and setup | | Self-hosted inference | You run open models on your own infrastructure | Restricted data, custom models, scale economics | Operations burden | | Local-device models | Model runs on laptop, workstation, or edge device | Offline, sensitive, low-latency narrow tasks | Smaller models and device limits |
Most companies need more than one pattern. The goal is not to pick one forever. The goal is to route each use case into the right boundary.
Classify the data first
Use four buckets:
| Data | Examples | Default AI boundary | | --- | --- | --- | | Public | Website copy, published docs, public research | Any approved tool | | Internal | Process notes, anonymized examples, non-sensitive drafts | Enterprise SaaS | | Confidential | Customer data, contracts, source code, financials, strategy | Enterprise SaaS with controls, VPC, or self-hosted | | Restricted | Health data, legal privilege, HR investigations, regulated records, credentials | Legal/security review; often local, VPC, or no AI |
This classification prevents a common mistake: using the same assistant for public blog drafts and confidential customer records because it is convenient.
Pattern 1: Enterprise SaaS as the default
For many SMEs, enterprise SaaS is the right default. ChatGPT Enterprise/Business, Claude for Work, Microsoft Copilot, Gemini for Workspace, and similar tools usually provide:
- Training opt-out by contract.
- Admin controls.
- SSO and access management.
- Retention controls.
- Audit logs.
- Security documentation.
- Vendor support.
This is enough for a large share of work: writing, summarization, research, meeting notes, internal analysis, and approved use of customer context.
The key is configuration. Buying the team plan is not enough. Set retention, sharing, connector access, approved workspaces, and data rules.
Pattern 2: VPC or private cloud
VPC/private cloud patterns are useful when data may leave your application but must stay inside a controlled cloud boundary. Examples:
- Customer support assistant over confidential tickets.
- Internal knowledge assistant over sensitive docs.
- Document extraction for contracts or invoices.
- Domain-specific assistant where you need stronger data isolation than SaaS.
Advantages:
- Better isolation.
- More control over networking and logs.
- Easier procurement story for sensitive customers.
- Less operational burden than full self-hosting.
Limitations:
- More expensive than SaaS.
- More integration work.
- Model choice can be narrower.
- You still depend on provider infrastructure.
For many serious SME systems, this is the practical middle ground.
Pattern 3: Self-hosted inference
Self-hosting means you run the model runtime: vLLM, TGI, SGLang, llama.cpp, Ollama, or another serving stack. It makes sense when:
- Data cannot leave your environment.
- You need a custom or fine-tuned open model.
- Inference volume is high enough to justify infrastructure.
- Latency or availability needs require direct control.
- You have people who can operate it.
Do not self-host only because it feels pure. The operational cost is real: GPU capacity, monitoring, upgrades, security patches, model evaluation, scaling, and incident response.
Self-hosting is a strong choice for the right organization. For a small team without ML infrastructure experience, it can become a fragile side project.
Pattern 4: Local-device models
Local models are underrated for privacy-sensitive individual work:
- Summarizing local notes.
- Drafting from private documents.
- Classifying internal snippets.
- Offline field work.
- Edge workflows where latency matters.
The tradeoff is quality. A small local model can be good enough for summarization, classification, extraction, and first drafts. It will not match frontier hosted models on hard reasoning, complex writing, or broad tool use.
Use local models when the task is narrow and the privacy boundary is more important than frontier quality.
Pattern 5: Hybrid routing
The mature pattern is hybrid:
- Public and low-risk tasks go to enterprise SaaS.
- Confidential retrieval happens inside a private RAG system.
- Restricted extraction runs locally or in a VPC.
- Final drafting may use a frontier model after sensitive fields are removed.
- Logs and evals decide whether each route is working.
Hybrid routing lets you use strong models without treating every record the same. It requires discipline:
- Data classification before routing.
- Redaction where possible.
- Clear model/tool allowlist.
- Logs that record which boundary was used.
- Fallback when the private model cannot do the task.
Decision framework
Ask six questions:
- What data enters the model? Public, internal, confidential, restricted.
- What output impact exists? Draft, recommendation, decision, customer-facing action.
- What quality is required? Good enough, expert-level, frontier reasoning.
- What latency is required? Interactive, batch, real-time, offline.
- What operating capacity exists? No infra team, app team, platform team, ML ops.
- What proof do customers or regulators need? Vendor docs, logs, data residency, audit trail, isolation.
Then choose the lowest-complexity pattern that satisfies the data and quality needs.
Do not do this yet
Do not self-host before measuring the workload and quality requirement.
Do not send restricted data to consumer tools.
Do not assume "open source" means private. It is private only if deployment, logs, access, and data flow are private.
Do not build one giant AI gateway without data classification. It will route sensitive data incorrectly.
Do not ignore evals. Private but wrong is still wrong.
A practical SME starting point
For most SMEs:
- Approve one enterprise SaaS assistant for general work.
- Write a data classification rule.
- Block restricted data unless reviewed.
- Build one private RAG or VPC workflow for the most valuable confidential use case.
- Use local models for narrow sensitive tasks where quality is acceptable.
- Revisit self-hosting only when privacy, customization, or cost clearly justifies it.
This gives the organization a private-by-design path without pretending every AI use case needs a GPU cluster.
The takeaway
Private AI is architecture matched to data. The right answer is rarely "all SaaS" or "all self-hosted." It is usually a portfolio: enterprise SaaS for normal work, private/VPC systems for confidential workflows, local models for narrow sensitive tasks, and self-hosting when scale or control genuinely demands it.
Choose based on data, impact, quality, latency, operations, and proof. That is the boring version. It is also the version that survives production.