Cost-optimizing inference: prompt caching, routing, and output control
LLM inference costs are 60-90% reducible with the right techniques. Prompt caching, model routing, output control, batching, and a few less-known patterns. The numbers, the patterns, and the production discipline that distinguishes well-run inference from a runaway bill.
By mid-2026, the most common AI architectural mistake we see in production is over-paying for inference. Teams ship LLM features, watch them work, then get hit with five-figure monthly bills that grow with usage. Some features become unprofitable. Some companies cut features that would have been viable with better cost discipline.
The numbers are clear: most LLM bills are 60-90% bigger than they need to be. The savings aren't from one magic technique — they're from a stack of compounding optimizations, each individually modest but cumulatively transformative.
This article covers the techniques, the numbers, and the production discipline. We assume you've already done basic model routing (covered in another article); we're going deeper.
The cost stack
LLM costs come from:
- Input tokens. What you send to the model. Includes system prompt, context, user query.
- Output tokens. What the model returns. Typically 3-10x more expensive than input.
- Reasoning tokens. For reasoning models, internal "thinking" tokens. Often as expensive as output.
- Tool calls. If using tool-calling, each tool definition is input tokens.
- Retries. Failed calls still cost.
Optimization works at each layer.
Technique 1: Prompt caching
The single biggest lever. Most modern providers cache repeated input prefixes — you pay full price the first time, drastically less for subsequent calls with the same prefix.
Pricing (typical):
- Anthropic: cached input ~10% of normal cost.
- OpenAI: automatic for prefix-matching, ~50% of normal cost (varies by model).
- Google: explicit cached content, varies.
How it works: the first call to a model with a specific input prefix is normal price. Subsequent calls within the cache window (typically 5-60 minutes, provider-dependent) reuse the cached representation.
Practical implementation:
Structure your prompts so static content comes first, dynamic content last:
[CACHED: 10K tokens]
- System prompt
- Tool descriptions
- User's static profile
- Knowledge base snippets unlikely to change per call
[NOT CACHED: 1K tokens]
- Conversation history (changes each turn)
- Current user queryThe first 10K tokens are cached after the first call. Subsequent calls pay ~10% on them and full price on the 1K.
Savings example:
Without caching:
- 11K input tokens × €3/million = €0.033 per call.
- 100K calls/day = €3,300/day.
With caching (90% of input is cached):
- 1K full-price + 10K cached at 10%:
- 1K × €3/million + 10K × €0.30/million = €0.003 + €0.003 = €0.006 per call.
- 100K calls/day = €600/day.
82% savings. Real numbers, real systems.
Implementation discipline:
- Identify static vs dynamic parts of prompts.
- Place static parts first.
- Use cache markers where the provider supports them (Anthropic) for explicit control.
- Test cache hits — your observability should show cache hit rate. If it's low, your prompt structure isn't right.
This is the highest-ROI optimization. Implement it before anything else.
Technique 2: Model routing
Covered in detail elsewhere. Briefly: different requests to different models based on complexity.
- 60% of requests to small models.
- 30% to mid-tier.
- 10% to flagship.
Typical savings: 60-80% vs using flagship for everything.
Combined with caching, you're at 90%+ savings vs the naive baseline.
Technique 3: Output length control
Output tokens dominate cost for most use cases. They're 3-10x input cost; they're determined by the model and prompt; they're often longer than needed.
Strategies:
Explicit length instructions.
Respond in at most 100 words.Models follow this reasonably well. Cuts output costs significantly.
Structured output.
When the user-visible response is short structured data (JSON with specific fields), the output is bounded. No risk of unnecessary verbosity.
`max_tokens` parameter.
Set it. Don't leave at default. If 200 tokens is enough, set max to 250 (small buffer). The model can't exceed.
Format constraints.
"Bullet points only" or "single paragraph" produces shorter outputs than free-form.
Bullet over prose.
Bullets are typically half the tokens of prose conveying the same info.
No preamble.
"Skip introductory phrases. Get straight to the answer." Models often start with "Great question..." or "Let me explain..." — wasted tokens.
Savings example:
A summarization workflow. Default output: 500 tokens. Constrained: 200 tokens.
- 500 tokens × €10/million = €0.005 per call.
- 200 tokens × €10/million = €0.002 per call.
60% savings on output. Less impressive than caching's 90%, but on the biggest cost line.
Technique 4: Output sampling and early stop
For some use cases, you don't need full LLM output — you need a decision or a classification.
Logprobs for classification.
response = openai.chat.completions.create(
model="gpt-5-mini",
messages=[{"role": "user", "content": prompt}],
logprobs=True,
top_logprobs=5,
max_tokens=1
)
# Read logprobs of first token to determine likely categoryYou're asking the model to emit one token (the category). Cost is one input pass + 1 output token. Faster, cheaper, often as good as longer responses.
Logit bias.
For known-set outputs, bias the logits toward valid options.
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o-mini")
# logit_bias keys are token IDs (as strings), not words.
bias = {str(enc.encode(w)[0]): 100 for w in (" yes", " no", " maybe")}
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[...],
logit_bias=bias,
max_tokens=1,
)Drives the model to emit the right kind of output. Cheap and reliable for classification.
Technique 5: Batching
When you're processing many items, batch them.
Async batching at the API level.
Most providers support async or batch APIs that process multiple requests at lower cost.
- OpenAI Batch API: 50% off, 24-hour SLA.
- Anthropic Message Batches: 50% off, 24-hour SLA.
If you have backlog work that doesn't need real-time response, run it through batch. Half the cost.
In-prompt batching.
Process multiple items in one LLM call when possible.
Instead of:
[10 separate calls, each classifying one ticket]Do:
[1 call, classifying 10 tickets in one prompt]The single call has more input (10 items) but only one set of fixed overhead (system prompt, tool descriptions). Total tokens are less than 10 separate calls.
Caveat: quality can drop with too many items per prompt. Test the sweet spot for your use case. Usually 5-20 items per prompt is fine.
Technique 6: Smaller models for narrow tasks
Beyond standard routing — consider whether a task really needs a big model.
Classification: GPT-5 Nano (€0.05/M tokens) is often as good as GPT-5 (€2/M) for simple classification. 40x savings.
Extraction: Mid-tier models work for structured extraction. Save flagship for the cases that fail.
Translation: Specialized translation models or smaller LLMs handle most cases.
Embedding: Use embedding-specialized models, not general-purpose LLMs for embedding.
The pattern: identify your "simple, narrow" workloads. Route them to the smallest model that does the job adequately. Save flagship for the complex, judgment-heavy work.
Technique 7: Fine-tuned small models
For very high-volume narrow tasks, fine-tune a small model.
Example: 100K classification requests/day.
- GPT-5 unmodified: €30/day in API costs.
- Fine-tuned 8B model on dedicated inference: €5-10/day in inference, plus one-time fine-tuning cost.
At sufficient volume, fine-tuned small models pay back quickly. The math depends on your volume.
We covered this in the fine-tuning article. The principle: when scale and narrowness align, fine-tuning is a cost lever.
Technique 8: Pre-filtering
For multi-step LLM workflows, cheap filtering catches obvious cases before expensive processing.
Example: customer support classification + response.
Cheap pre-filter:
- "Is this an actual support question or spam/noise?" (1-token classification on a small model.)
- "Is this a known FAQ?" (Embedding search; cheap.)
Only requests passing the filter reach the expensive response generation.
Savings: if 30% of incoming requests are noise or FAQ-able, that's 30% of your expensive calls eliminated.
The pre-filter is cheap (€0.0001 per call) compared to the response generation (€0.05 per call). Easy ROI.
Technique 9: Caching beyond prompt caching
Beyond the model provider's prompt caching, application-level caching:
Response caching. Same query, same context, same response. Cache and return without calling the model.
def cached_call(prompt, model, ttl=3600):
cache_key = hash(prompt + model)
cached = redis.get(cache_key)
if cached:
return cached
response = call_llm(prompt, model)
redis.set(cache_key, response, ttl=ttl)
return responseFor idempotent queries, this eliminates duplicate calls entirely.
Embedding caching. Computed embeddings cached.
Retrieval result caching. Search results for a query cached for short periods.
Tool result caching. Tool call results cached if the underlying data doesn't change often.
Caching levels stack. At each layer, you save calls.
Technique 10: Speculative execution
For latency-sensitive flows where you can predict next steps, speculatively pre-call.
Example: customer support agent. You know the next step is usually "summarize the issue" after the customer describes it. Start that summarization in parallel with showing acknowledgment to the user.
If the prediction is right, the response is ready when needed. If wrong, you wasted one call.
This is a latency optimization more than cost, but for some flows it improves UX significantly.
Technique 11: Provider arbitrage
Different providers charge differently for similar models. Take advantage.
Open-source models on cheap inference providers.
Llama 4 70B on Together AI: ~€0.30/M input, €0.50/M output. Equivalent quality from Anthropic Claude 4 Sonnet: ~€2/M input, €15/M output.
For tasks where Llama 4 70B is sufficient, you save 5-30x.
Same model on different providers.
Some open models hosted by multiple providers with different pricing. Shop around.
Self-hosting at scale.
At sufficient volume (say €10K+/month on a specific model), self-hosting becomes cheaper than API calls. Requires operational capacity.
Provider arbitrage requires complexity. Multi-provider routing with fallback. Quality testing on each provider's variant. Worth it at scale.
Technique 12: Inference acceleration
For self-hosted: optimization of the inference layer itself.
vLLM, TGI, SGLang. Optimized inference servers. 2-10x throughput vs naive implementations.
Quantization. Run models at lower precision (4-bit, 8-bit). 2-4x throughput, mild quality cost.
Flash Attention, paged attention. Architectural optimizations enabled in modern servers.
Continuous batching. Servers that batch in-flight requests for better GPU utilization.
For teams self-hosting at scale, this matters. For teams using APIs, the provider handles it.
Technique 13: Streaming
Streaming doesn't reduce token count but improves UX, which matters for cost-effectiveness perception.
For long outputs, users see content appearing immediately. They can read along while generation completes. Feels much faster than waiting for full response.
For agents, streaming intermediate steps gives users visibility into progress.
Implementation: every modern API supports streaming. Use it for user-facing flows.
Technique 14: Budget guards
Beyond optimization, enforce hard budgets to prevent runaway costs.
Per-request budget. Maximum tokens per request. Stop if exceeded.
Per-user budget. Daily or monthly cost cap per user. Throttle when approaching.
Per-feature budget. Each feature has a budget. Auto-shutoff at 10x daily average.
Global budget. Total daily/monthly limit. Pause non-essential work near limits.
These don't directly save money but prevent disasters. A single bug or attack can balloon costs quickly without guards.
A worked example: a real cost reduction
A team running a customer support AI had a €12,000/month bill. Six months later, with the techniques applied, it was €1,800/month — an 85% reduction.
The changes:
- Prompt caching. Restructured prompts to maximize static prefix. ~70% of input now cached. Saved ~30%.
- Model routing. Classification and ticket triage moved from Claude Sonnet to Claude Haiku. Saved ~15%.
- Output length control. Responses constrained to 250 words from previous 800-1500. Saved ~25%.
- Pre-filtering. Cheap classification catches FAQ-able tickets, served from cache. ~20% of tickets eliminated from expensive flow. Saved ~10%.
- Response caching for FAQ. Identical questions return cached responses. Saved ~5%.
Compounding effects mean total savings is more than sum of parts on a percentage basis — each saving applies to the remaining cost.
Quality: by every metric measured (customer satisfaction, response correctness, resolution rate), quality was unchanged or slightly improved.
Operational cost: ~80 hours of engineering work over 3 months. ROI: paid back in 2 weeks.
Common mistakes
A few patterns we see:
Mistake 1: No cost tracking. Team has no visibility into what each feature, user, or call costs. Optimization is impossible without measurement.
Mistake 2: Optimizing the wrong thing. Spent weeks reducing input tokens by 5% when output tokens were 80% of the bill. Measure first; optimize biggest contributors.
Mistake 3: Quality regressions. Cost cuts shipped without quality monitoring. Saved money, lost users. Always pair cost work with eval suites.
Mistake 4: Over-routing. Aggressive routing to small models for tasks they can't really handle. False savings.
Mistake 5: Cache pollution. Cache filling with rare queries. Most cache entries used once. Cache misses dominate. Better caching strategy needed.
Mistake 6: Skipping batch API. Real-time when batch would do. Half-price was sitting on the table.
Mistake 7: Over-engineering. Building elaborate cost optimization on top of features that aren't profitable anyway. Sometimes the right answer is "kill the feature."
Mistake 8: No budget guards. A single bug produces a runaway. Catastrophe rather than minor inconvenience.
The cultural part
Cost discipline is partly cultural. Teams that succeed:
- Treat cost as a metric, not an afterthought.
- Have someone owning it (often someone on the eng/finance interface).
- Review costs in weekly metrics.
- Investigate spikes immediately.
- Set budgets per feature; alert on threshold breaches.
- Make trade-offs explicitly (cost vs quality vs latency).
Teams that don't:
- Treat cost as someone else's problem.
- Discover the bill at the end of the month.
- React to spikes after the fact.
- Have no budget concept.
- Skip the trade-off conversation; optimize one dimension at a time.
Cultural change is harder than technical change. But it's what makes the technical changes stick.
Pricing trajectory
A note on the broader trend.
LLM inference costs are dropping 5-10x annually. A model that costs €5/M tokens today will likely be €0.50-€1/M in a year.
This means:
- Some optimizations matter less over time (the absolute cost drops anyway).
- Some workloads currently unprofitable will become profitable.
- Build for the long run: clean architecture > squeezing every cent now.
That said: even with falling prices, optimization matters. Inefficient systems waste money at every price point. And competitive advantage often goes to teams running efficient operations at lower cost.
A 90-day cost optimization plan
For a team starting from "we have an AI feature, costs are higher than expected":
Weeks 1-2: Measure.
- Instrument per-call costs.
- Build per-feature, per-user dashboards.
- Identify the biggest cost contributors.
Weeks 3-4: Quick wins.
- Enable prompt caching where supported.
- Restructure top 3 prompts to maximize cache hit rate.
- Set max_tokens on all calls.
- Implement budget alerts.
Weeks 5-6: Routing.
- Identify simple tasks currently on flagship.
- Build router for the 3-5 most-called endpoints.
- Test for quality regression.
Weeks 7-8: Output and caching.
- Constrain output lengths where not user-visible.
- Add application-level response cache for common queries.
- Add pre-filters for the highest-volume flows.
Weeks 9-10: Advanced.
- Batch API for non-real-time work.
- Provider alternatives evaluated.
- Embedding cache, retrieval cache.
Weeks 11-12: Hardening.
- Budget guards on every feature.
- Cost dashboards in regular team review.
- Documentation of patterns for future features.
By end of 90 days: 50-80% cost reduction realistic. Quality monitored. Discipline embedded.
The takeaway
LLM costs are reducible — usually by 60-90% — without quality loss. The techniques are well-known: caching, routing, output control, batching, pre-filtering, response caching, model selection, budget guards.
Done in isolation, each saves modestly. Done together, they compound to dramatic savings.
The teams that get this right turn unprofitable AI features into profitable ones. The teams that don't, eventually have to cut features that should have been viable.
Measure first. Optimize the biggest contributors. Maintain quality monitoring. Build cost discipline into the team's regular work.
The result: AI features that scale economically, not just technically. That's what makes AI a sustainable part of a product, not just a launch headline.