Self-hosted vs hosted inference: vLLM, TGI, and the break-even math
At what scale does self-hosting beat API calls? The actual math, the operational realities, and the patterns that distinguish teams who should self-host from teams who should keep paying for managed inference.
The pitch is compelling. Open-source models are competitive. GPUs are available. Inference servers like vLLM, TGI, and SGLang are mature. Why pay 5-10x markup to OpenAI or Anthropic when you could host the equivalent yourself?
The reality is more complex. Self-hosting genuinely wins at certain scales. At others, the operational cost dwarfs the inference savings. The break-even point varies by workload, model size, latency requirements, and team capability.
This article goes deep on the math, the operational realities, and the patterns that distinguish teams who should self-host from teams who shouldn't. We assume you're considering this seriously and want honest numbers.
When self-hosting makes sense
Some characteristics that favor self-hosting:
Scale. High inference volume. Specifically, monthly inference spend on APIs exceeding €5K-10K typically justifies considering self-hosting.
Predictable workload. Steady, predictable usage. Self-hosting requires capacity planning; spiky workloads waste capacity (under-utilization) or fail (over-saturation).
Privacy / compliance requirements. Data that can't be sent to cloud providers (regulated industries, certain government contracts, internal-only data).
Custom models. Fine-tunes, custom architectures, or specialized variants that managed providers don't offer.
Latency control. Sub-100ms first-token latency for some applications requires running models on infrastructure you control.
Cost per call below break-even. When you're doing the math and self-hosting genuinely wins.
When most of these are true, self-hosting is worth serious consideration.
When self-hosting doesn't make sense
The other side. Characteristics that favor managed APIs:
Low or variable scale. Inference spend under €5K/month. The savings don't justify the operational cost.
Spiky workloads. Usage that varies 10x between peak and quiet times. Self-hosting wastes capacity in valleys.
Need frontier capabilities. GPT-5, Claude 4 Opus, the latest reasoning models — these are closed and only available via APIs. If your workload genuinely needs frontier quality, you're paying for APIs.
Small team. Self-hosted inference requires operational expertise. Without dedicated capacity, things break.
Rapid iteration. Trying many different models, configurations, providers. APIs make this easy; self-hosting makes each change a deployment.
Multi-region / global users. Self-hosting requires you to operate in each region. Managed APIs handle this.
For these cases, managed APIs are the right answer even at significant cost.
The cost math (carefully)
Let's do the actual numbers for a representative case. Assumptions:
- Workload: 100 million tokens/month input, 30 million tokens/month output.
- Quality target: comparable to Claude 4 Sonnet or GPT-5.
- Available open model: Llama 4 70B (quality is close to flagship closed for many tasks).
Option A: API to closed model.
- Flagship closed API: roughly €3/M input × 100M = €300. €15/M output × 30M = €450. Total: ~€750/month.
This level of spend doesn't justify self-hosting.
Let's scale up the workload by 10x:
- 1 billion input tokens, 300 million output tokens.
- Closed API: €7,500/month.
Now self-hosting becomes interesting.
Option B: API to open model on managed open-source provider.
- Llama 4 70B on Together AI: ~€0.50/M input, €0.80/M output.
- 1B input × €0.50/M = €500. 300M output × €0.80/M = €240. Total: €740/month.
A 90% savings vs closed. Significant.
Option C: Self-hosted on rented GPUs.
- Llama 4 70B requires ~2 H100s for reasonable throughput (with quantization).
- Rented H100s: €2-3/hour each.
- 2 H100s × €2.50/hour × 730 hours/month = €3,650/month for compute alone.
- Add: storage, networking, ops time.
For this workload, the open-source-on-managed-provider beats self-hosting on raw cost. Self-hosting only wins if you also need control (privacy, custom model) or your throughput is much higher.
Option D: Self-hosted on owned/long-term-reserved GPUs.
- 2 H100s purchased or reserved long-term: €1-2/hour effective.
- 2 H100s × €1.50/hour × 730 hours = €2,190/month.
- Higher utilization can spread cost: if these GPUs handle multiple workloads, the per-workload cost is lower.
Now we're competitive with managed open-source providers. But the operational overhead is real.
The key insight: at this workload scale (~10B tokens/month), the savings of self-hosting over managed open-source providers are marginal. The savings vs closed APIs are dramatic, but managed open-source captures most of those.
At 10x this workload (100B+ tokens/month), self-hosting starts to clearly win. At 10x less, hosted is the answer.
The operational cost
Beyond raw inference cost, the operational cost of self-hosting.
Initial setup:
- Picking the right inference server (vLLM, TGI, SGLang).
- Configuring for your model and hardware.
- Setting up GPU infrastructure (cloud or owned).
- Networking, security, observability.
- Quantization and optimization.
Typical: 1-4 engineer-weeks for first deployment.
Ongoing operations:
- Monitoring (latency, throughput, errors, GPU utilization).
- Capacity planning.
- Upgrades (new model versions, inference server updates, security patches).
- Incident response (GPU failures, OOM crashes, software bugs).
- Scaling (more GPUs as load grows).
Typical: 0.25-1 engineer FTE ongoing, depending on scale.
Hidden costs:
- GPU price volatility.
- Cloud egress costs if hybrid.
- Specialty expertise (CUDA, quantization, optimization).
- Replacement / failure costs for owned hardware.
At €100K-200K per engineer/year fully loaded, even part-time engineering attention is significant. A €5K/month inference cost saving disappears under €15K/month in engineering cost.
This is where teams underestimate the cost of self-hosting. The inference math looks great in isolation; the total cost of ownership is much higher.
The inference servers
If you're going to self-host, the main options:
vLLM. Open-source. Probably the most popular for serving open-source LLMs. PagedAttention, continuous batching, broad model support. The default choice.
TGI (Text Generation Inference). Hugging Face's server. Mature, broad model support, good performance. Less rapid feature development than vLLM lately.
SGLang. Newer, very high performance. Strong for structured generation. Active development.
LMDeploy. From InternLM team. Strong quantization, fast.
llama.cpp / Ollama. For smaller models, lower-throughput. CPU-friendly. Production-grade for some use cases.
Hugging Face TGI Inference Endpoints. Managed self-hosting. Pay-per-hour for instances; HF operates them. Middle ground between fully self-hosted and managed.
Modal, RunPod, Replicate. Function-as-a-service for inference. Lower commitment than full self-hosting; higher cost than DIY.
For most teams: vLLM or SGLang for production self-hosting. Both are mature, fast, well-documented.
Hardware choices
The GPU question:
NVIDIA H100. Current state-of-the-art for inference. ~€2-3/hour rented. Buys you 80GB VRAM, fast inference. 70B models run well in single-H100 with quantization or 2x without.
NVIDIA H200. Successor to H100, more VRAM (141GB). For very large models.
NVIDIA L40S. More accessible, ~€1-2/hour. Good for moderate-size models (up to ~30B with quantization).
NVIDIA A100. Previous gen, still widely available. ~€1-2/hour. Workhorse for many production deployments.
AMD MI300X. Competitive with H100 for some workloads. Increasingly available. Some software immaturity vs NVIDIA stack.
Apple M-series. For very small models (under 8B), Mac Studio or Mac Pro with unified memory works. Niche use case.
For most production self-hosting in 2026: H100 or H200 if you need large models; L40S or A100 for moderate.
Rental sources: AWS, GCP, Azure (mainstream), Lambda Labs, Runpod, Together, Vast.ai (specialty). Pricing varies. Spot/preemptible instances can save 50-70% if you tolerate the disruption.
Quantization
Most production self-hosted deployments use quantized models. The trade-offs:
FP16 (16-bit). Default precision. Full quality. Most memory-hungry.
INT8 / FP8 (8-bit). Halves memory, slight quality loss. Common production choice.
INT4 (4-bit). Quarter memory, more noticeable quality loss but still useful. Aggressive choice.
AWQ, GPTQ, GGUF. Different quantization formats with different trade-offs.
For a 70B model:
- FP16: 140GB VRAM.
- INT8: 70GB VRAM.
- INT4: 35GB VRAM.
The H100 has 80GB VRAM. INT8 fits comfortably; FP16 requires 2 GPUs.
Quality impact:
- INT8: usually <1% degradation on benchmarks.
- INT4: 1-5% degradation, varies by task.
Test on your workload before deploying. Some tasks (especially structured/code) are more quantization-sensitive than others.
Throughput and capacity planning
A key planning question: how many tokens/second do you need?
Single-request throughput.
- 70B model on H100, INT8: ~50-80 tokens/second for single user.
Batched throughput.
- Multiple concurrent requests: 1000-3000 tokens/second total across requests (vLLM with good batching).
Latency considerations.
- First-token latency: 100-500ms typically.
- Per-token latency: 10-30ms.
For capacity planning:
- Estimate peak concurrent requests.
- Estimate average request length.
- Calculate total tokens/second needed.
- Add 50% headroom.
A team handling 1M tokens/hour with 50 concurrent peak users typically needs 2-4 H100s in good utilization.
Reliability and fallback
Self-hosting means you own reliability.
Health checks. Continuous health monitoring. Restart unhealthy instances.
Graceful degradation. When capacity is saturated, prefer slow responses over failures.
Fallback to APIs. Many teams self-host primary traffic and fall back to managed APIs when overloaded. Best of both worlds; complexity is real.
Backup hardware. GPUs fail. Spare capacity ready.
Multi-region. For global users, replicate. Or use managed APIs for distant regions.
Update strategy. New model versions, server upgrades. Blue-green deployments to avoid downtime.
Each of these is engineering work that managed APIs absorb for you.
A worked example: a team's self-hosting decision
A real example. SaaS team, AI features, monthly inference cost on managed APIs: €18,000.
The math:
- 80% of inference is classification and extraction (could run on smaller open model).
- 20% is complex generation (needs frontier closed).
Plan:
- Self-host Llama 4 70B for 80% of workload.
- Keep Claude/GPT API for the 20%.
- 3 H100s on Lambda Labs reserved: ~€4,500/month.
- Engineering setup: 4 weeks, €25K one-time.
- Ongoing ops: 0.25 FTE engineer, ~€30K/year.
Result after 6 months:
- Inference cost dropped from €18K/month to €6K/month (€4.5K self-host + €1.5K closed API for hard tasks).
- Net savings vs old: €12K/month = €144K/year.
- Less engineering investment: €25K + €30K = €55K/year.
- Net financial benefit: ~€89K/year.
Hidden complexities:
- One outage when a deployment had a config bug. 2-hour partial degradation.
- Multiple weeks of ongoing tuning to get throughput optimal.
- Engineer doing self-hosting wished they were doing other things.
Outcome: financially positive but operationally heavier than expected. Team continues self-hosting; if the volume dropped 50%, they'd switch back to managed.
This is what a real successful self-hosting decision looks like. Not magic — engineering work with measurable ROI.
A worked example: a team's "back to APIs" decision
A different team, similar starting point.
Original setup: Self-hosted Llama 3 70B on rented GPUs. Inference cost: €3K/month rent. Plus engineering ~€20K/year ongoing.
The change:
- Open-source-on-managed-provider pricing dropped 50% over 18 months.
- Their team grew but didn't hire dedicated MLOps.
- Self-hosting setup needed major work to keep up with new models.
The decision:
- Stop self-hosting.
- Move to Together AI hosting open models.
- Cost: €2.5K/month for managed open. Slight savings, lower complexity.
- Free up the engineer.
Result:
- Modest financial savings.
- Engineer time freed for product work.
- Less operational stress.
Outcome: the right call for them. Self-hosting wins for some teams; not others.
When to revisit the decision
The decision isn't permanent. Periodically revisit:
Volume changes. Up significantly: self-hosting more attractive. Down significantly: less attractive.
Pricing changes. Closed APIs getting cheaper or more expensive. Managed open getting cheaper. Hardware getting cheaper.
Model improvements. New open-source models that match closed quality. New closed models that pull away.
Operational capacity. Team grew or shrunk in ML/ops capability.
Privacy / compliance changes. New requirements that mandate self-hosting.
A quarterly check-in is reasonable. Not constant re-evaluation, but not "decided once" either.
Common mistakes
Patterns we see in self-hosting decisions:
Mistake 1: Cost math without operational cost. "Self-hosting saves €10K/month" — but ignores €15K/month in engineering. Negative ROI.
Mistake 2: Self-hosting too early. Spending engineering effort on self-hosting when the workload is small. Optimization premature.
Mistake 3: Self-hosting frontier-quality with small open models. "We can save money by using a smaller model" — but quality drops, users complain. Falls back to APIs.
Mistake 4: No fallback. Self-hosted infrastructure goes down; no graceful degradation. Outage when API customers wouldn't have one.
Mistake 5: Under-investing in optimization. Running a 70B model on a single GPU at 5 tokens/sec when proper setup gives 50. Throwing away most of the value.
Mistake 6: Ignoring quality drift. Self-hosted model has degraded vs current closed. Customers notice; team doesn't.
Mistake 7: Not reconsidering. Once self-hosting, never re-evaluating. The decision might have been right two years ago and wrong now.
Mistake 8: Spot/preemptible without graceful handling. Saved 60% on compute; outages every few hours when instances get reclaimed.
A decision checklist
To make the decision deliberately:
- [ ] Inference spend on APIs is at least €5-10K/month?
- [ ] Workload is steady and predictable?
- [ ] Team has or can hire MLOps/inference expertise?
- [ ] Open-source model exists at adequate quality?
- [ ] Latency requirements compatible with self-hosting?
- [ ] Have done detailed cost math including operational costs?
- [ ] Have a fallback plan?
- [ ] Compliance/privacy requirements don't mandate one path?
- [ ] Will you revisit quarterly?
If most are yes, self-hosting is worth considering seriously.
Hybrid patterns
It's not all-or-nothing. Many teams run hybrid:
Self-hosted for the bulk; APIs for the hard cases. Classification, simple generation on self-hosted; complex reasoning on closed APIs.
Self-hosted for steady; APIs for spikes. Self-hosted handles base load; APIs absorb peaks.
Self-hosted for sensitive; APIs for general. Sensitive data through self-hosted; general queries through APIs.
Self-hosted for fine-tunes; APIs for base. Custom models run yourself; off-the-shelf models from APIs.
Hybrid adds complexity but often captures the best of both. For teams at scale, hybrid is often the right answer.
The takeaway
Self-hosting LLM inference is genuinely viable in 2026. Open-source models are competitive. Inference servers are mature. Hardware is available.
But the operational cost is real and easy to underestimate. The break-even point against managed APIs is roughly €5-10K/month in inference spend; below that, the engineering investment doesn't pay back.
The teams that self-host successfully:
- Have done the math honestly, including operational costs.
- Have or can build MLOps capability.
- Run at sufficient scale to justify the investment.
- Have steady workloads.
- Don't need frontier-only capabilities.
- Plan for reliability, monitoring, and updates.
The teams that should stay on APIs:
- Lower scale.
- Spiky workloads.
- Need rapid iteration.
- Small teams without ops capacity.
- Need frontier closed capabilities.
The right answer is specific to your situation. Run the numbers carefully. Honestly assess your operational capacity. Default to APIs unless self-hosting clearly wins.
When self-hosting does win, it wins big — economically and architecturally. When it doesn't, it's an expensive way to discover that managed APIs were the right call all along.