Multi-model orchestration: routing by cost, latency, and quality
Using one model for everything is the rookie move. Production AI systems route different requests to different models — and save 60-90% on cost while improving quality. The patterns, the routing logic, and the trade-offs.
The single most expensive mistake we see in production AI systems is using one model for everything.
A team picks GPT-5 (or Claude 4 Opus, or another flagship model). They build their app around it. Their bills are five-figure monthly. They could cut them by 60-90% by routing different requests to different models — and often improve quality at the same time.
This is multi-model orchestration: using the right model for each task in a system. It's the difference between hobby AI and production AI.
This article covers the patterns, the routing logic, the trade-offs, and a step-by-step guide to implementing it.
Why one model isn't optimal
The models on offer in 2026 cluster into rough tiers:
Flagship reasoning models (GPT-5, Claude 4 Opus, Gemini 3 Pro, o3): excellent at complex reasoning, expensive (€2-€15 per million input tokens, €10-€60 per million output tokens), slower (5-30 seconds typical).
Flagship general models (GPT-5 turbo, Claude 4 Sonnet, Gemini 3 Pro Flash): excellent at most knowledge work, moderately expensive (€0.50-€3 per million input, €5-€15 per million output), reasonable speed (2-5 seconds).
Mid-tier models (GPT-5 Mini, Claude 4 Haiku, Gemini 3 Flash): good at simple-to-moderate tasks, cheap (€0.10-€0.50 per million input, €0.50-€2 per million output), fast (1-2 seconds).
Small models (GPT-5 Nano, Claude Haiku Lite, Gemini Flash Lite): good at simple structured tasks, very cheap (€0.02-€0.10 per million tokens), very fast (<1 second).
Specialised models (embedding models, reranking models, vision models, voice models): optimized for specific tasks, often very cheap because they're focused.
A typical AI app makes many different types of LLM calls. Each call has its own requirements:
- Classifying user intent: needs simple classification, fast response. Mid-tier model is perfect.
- Extracting structured data from documents: needs reliability on structured output, moderate complexity. Mid-tier or flagship general.
- Producing the actual response to a user query: needs quality, context handling. Flagship general or flagship reasoning.
- Generating summaries of past conversation: simple summarisation. Mid-tier or small model.
- Background batch processing: not latency-sensitive, but volume matters. Small or mid-tier.
Using flagship for all of these is wasteful. The classification doesn't need it. The summarisation doesn't need it. The structured extraction often doesn't need it. Only the user-facing response truly benefits.
The cost savings are real
A typical knowledge-work AI app might have a request distribution like:
- 60% of LLM calls: simple classification, extraction, summarisation. Best served by mid-tier or small models.
- 30% of calls: moderate complexity. Mid-tier or flagship general.
- 10% of calls: complex reasoning or final user response. Flagship.
If you use a flagship for everything, cost is 100% of flagship pricing. If you route appropriately:
- 60% at small-model cost (1/30th of flagship): 2% of original cost.
- 30% at mid-tier cost (1/5th of flagship): 6% of original cost.
- 10% at flagship cost: 10% of original cost.
Total: 18% of original cost. An 82% reduction. On a €10,000/month bill, that's €8,200/month saved.
These numbers depend on your traffic shape, but the pattern is consistent: most apps have a request mix where the average call is much cheaper than the worst call. Routing captures that.
The basic orchestration patterns
A few patterns recur in production multi-model systems:
Pattern 1: Task-based routing
Different types of tasks go to different models. This is the simplest pattern.
def route_request(task_type):
if task_type == "classification":
return "gpt-5-nano"
elif task_type == "extraction":
return "gpt-5-mini"
elif task_type == "summarization":
return "claude-4-haiku"
elif task_type == "user-facing-response":
return "claude-4-sonnet"
elif task_type == "complex-reasoning":
return "claude-4-opus"Tasks are classified by the calling code (it knows what it's asking for). The routing is deterministic and easy to debug.
Pattern 2: Complexity-based routing
The system estimates the complexity of each request and routes accordingly.
def route_by_complexity(request):
complexity = estimate_complexity(request)
if complexity < 3:
return "small"
elif complexity < 7:
return "mid"
else:
return "flagship"The complexity estimate can be heuristic (request length, keyword detection) or model-based (a cheap classifier scores the request). This pattern handles cases where the same task type varies in difficulty.
Pattern 3: Cascade routing
Try a cheap model first. If the output is good, use it. If not, escalate to a more expensive model.
def cascade(request):
cheap_response = call_model("small", request)
if is_acceptable(cheap_response):
return cheap_response
return call_model("flagship", request)This works when "acceptable" is detectable — by confidence scores, validators, or a separate quality-check LLM. It's powerful: most simple requests get answered by the cheap model; only the hard ones reach the expensive one.
Pattern 4: Specialty routing
Use specialised models for specialised tasks:
- Embeddings: use a dedicated embedding model (much cheaper than a chat model used to embed).
- Reranking: use a dedicated reranker.
- Vision: use a vision-specialised model for image analysis.
- Voice: use a voice model for transcription/synthesis.
- Code: use a code-specialised model for code tasks.
Specialised models are usually faster, cheaper, and better at their specific task than a general model trying to do the same thing.
Pattern 5: Provider routing
Use models from multiple providers for redundancy and pricing leverage.
providers = ["openai", "anthropic", "google"]
preferred = "anthropic" # primary
fallback = "openai" # fallback
def call_with_failover(request):
try:
return call(preferred, request)
except (RateLimit, ProviderError):
return call(fallback, request)This gives resilience against single-provider outages and rate limits. It also lets you take advantage of pricing changes — when a provider cuts prices, shift more traffic there.
A realistic example: customer support AI
To make it concrete, here's how a customer support AI might use multi-model orchestration.
The system has these steps per ticket:
Step 1: Classify intent. What is the customer asking about? (5-10 categories.)
Routing: GPT-5 Nano. It's a simple classification task. Cost: ~€0.0001 per ticket.
Step 2: Determine urgency and sentiment. Is the customer frustrated? Is this urgent?
Routing: GPT-5 Nano. Another simple classification. Cost: ~€0.0001 per ticket.
Step 3: Retrieve relevant knowledge.
Routing: embedding model + reranking model. Specialty tools for specialty job. Cost: ~€0.0002 per ticket.
Step 4: Determine if the AI can answer this or needs human escalation.
Routing: Claude 4 Haiku. Slightly more sophisticated classification given the retrieved context. Cost: ~€0.001 per ticket.
Step 5 (if AI can answer): Generate the customer-facing response.
Routing: Claude 4 Sonnet. Quality matters here — this is what the customer reads. Cost: ~€0.02 per ticket.
Step 6 (if AI cannot answer): Generate a summary for the human agent.
Routing: GPT-5 Mini. Useful summary, not customer-facing. Cost: ~€0.005 per ticket.
Step 7: Quality check. Did the response meet our standards?
Routing: Claude 4 Haiku as a fast judge. Cost: ~€0.001 per ticket.
For tickets the AI answers (say 70%):
- Total cost per ticket: ~€0.024
For tickets escalated to humans (30%):
- Total cost per ticket: ~€0.008
Weighted average: ~€0.019 per ticket.
If the team had used flagship reasoning model for everything: ~€0.30 per ticket. The multi-model approach saves 94%.
At 1,000 tickets/day, that's €280/day → €100,000/year in savings.
The routing logic
A few approaches to implementing routing:
Approach 1: Hard-coded by task type
Simplest. You know what task you're calling, you pick the model.
def classify(text):
return openai_client.chat.completions.create(
model="gpt-5-nano",
messages=[{"role": "user", "content": f"Classify: {text}"}],
)
def respond(context, query):
return claude_client.messages.create(
model="claude-4-sonnet",
messages=[{"role": "user", "content": f"Context: {context}\n\nQuery: {query}"}]
)Pros: Transparent, easy to debug, easy to change. Cons: Doesn't adapt to request complexity within a task type.
Approach 2: Router model
A small model classifies each request and routes it.
ROUTER_PROMPT = """
Classify this request as: trivial, moderate, or complex.
Output one word.
Request: {request}
"""
def route(request):
classification = small_model_call(ROUTER_PROMPT.format(request=request))
return MODEL_BY_COMPLEXITY[classification]Pros: Adapts to complexity within a category. Cons: Adds latency (the router call), adds a failure point, requires tuning.
Approach 3: Embedding-based router
For requests that fall into known patterns, use embedding similarity to past examples.
def route(request):
embedding = embed(request)
closest = find_nearest_example(embedding)
return closest.suggested_modelPros: Fast (just a vector lookup), gets smarter with more data. Cons: Requires building a labeled set of examples.
Approach 4: Cascade
Try cheap first; escalate if needed.
def cascade(request):
cheap = small_model_call(request)
if validates(cheap):
return cheap
return flagship_call(request)Pros: Adaptive, low average cost. Cons: Slow for cases that need escalation (two calls), requires reliable validation.
In practice, many production systems use a hybrid: hard-coded routing for the main task types, with cascades for specific high-variance subtypes.
The pitfalls
A few mistakes to avoid:
Pitfall 1: Optimising for cost while degrading quality
It's easy to route everything to small models and watch the cost drop. It's harder to notice that quality also dropped. Always pair routing changes with quality monitoring.
A useful discipline: when you move a task to a cheaper model, A/B test it for a week with quality metrics. Don't ship the change without evidence quality held up.
Pitfall 2: Over-engineering the router
A router that handles 100 task types with sophisticated logic is harder to maintain than the routing it replaces. Start simple. If the simple version is 80% as good as the sophisticated version, ship the simple version.
A common pattern: a 50-line router handling 5-10 task types covers 90% of the benefit. Beyond that, returns diminish.
Pitfall 3: Ignoring latency
Cheaper models are also usually faster — which is good. But if you cascade (try cheap, then flagship), you might double the latency for hard cases. For user-facing flows, this matters.
A useful pattern: for user-facing latency-sensitive responses, default to flagship and accept the cost. Save the cascading for batch or async work.
Pitfall 4: Not handling provider failures
When you depend on multiple models, you have multiple ways to fail. A flagship model goes down, a rate limit kicks in, an API key expires. Your routing logic needs fallbacks.
Minimum: every "primary" model should have a "fallback" model from a different provider. Even if quality drops in fallback, the system stays up.
Pitfall 5: Not measuring per-route quality
You need to know which route is doing well and which isn't. This means evaluation, ideally automated.
A useful setup: for every production call, log the model used, the request, the response, and (where possible) some quality signal (user feedback, downstream metrics, automated eval). Roll up per-route metrics. Catch quality drift before users complain.
Where this is going
A few trends to expect:
Auto-routing as a service. Tools like OpenRouter, Helicone, Portkey, and others increasingly offer "smart routing" — they choose the model for you based on configurable rules. Expect this to mature significantly.
Per-model specialisation increasing. Models specialised for code, for math, for specific domains. Routing will increasingly include specialty models.
Cost continuing to drop. Models in 2026 are 10x cheaper than equivalent quality in 2024. By 2028, expect another 10x. The economics of multi-model orchestration will keep improving.
On-device tiers. Phones and laptops with local AI capabilities will offer a "free" tier for some requests. Routing logic will increasingly include "stay on device if possible."
Standardised APIs across providers. OpenAI-compatible APIs (already widely adopted) mean swapping providers is increasingly trivial. Expect more standardisation, which makes multi-provider strategies easier.
A starter checklist
If you're building a multi-model system from scratch, or migrating from single-model:
- Map your tasks. What types of LLM calls does your app make? Roughly how often? Roughly how expensive?
- Categorise by complexity. For each task type, decide: trivial, moderate, or complex. Match to model tier.
- Build a router. Start with hard-coded task-based routing. Don't over-engineer.
- Add fallbacks. Every primary model should have a fallback (different provider).
- Measure quality per route. Set up logging and basic evaluation. You need to know if quality holds.
- Iterate. Move tasks to cheaper models where quality holds. Move tasks back to expensive models where quality breaks. Adjust over time.
- Don't stop tuning. Models change. New ones launch. Prices shift. A routing setup that's optimal in May 2026 may be suboptimal in November 2026.
The takeaway
Multi-model orchestration is one of the highest-ROI changes you can make to a production AI system. Done well, it cuts costs by 60-90% while often improving quality (because each task uses a model suited to it).
The technical bar is low — basic routing logic is a few dozen lines of code. The discipline bar is higher: you need to measure quality continuously to make sure the routing decisions hold up.
Stop using one model for everything. Map your tasks. Pick the right model for each. Measure the result. Iterate. The savings are real and the quality improvements are usually a bonus.