The 2026 LLM stack: models, inference, tooling, and trade-offs
A working architect's view of the 2026 LLM stack — the model tiers, inference providers, orchestration layers, evaluation tooling, and the trade-offs that actually matter when shipping production AI. Everything you wish someone had laid out before you started.
If you're building real AI products in 2026, you no longer ask "should I use OpenAI or Anthropic?" That framing is two years stale. You're making dozens of decisions across a layered stack, and most of them matter.
This article is a working architect's view of the 2026 LLM stack — what's at each layer, what trade-offs you're making, and where the field is heading. It's the article we wish someone had written for us before we made every avoidable mistake.
The layers
The stack, roughly:
┌────────────────────────────────────┐
│ Application Layer │ Your product / agent / workflow
├────────────────────────────────────┤
│ Orchestration / Frameworks │ LangGraph, CrewAI, custom, direct
├────────────────────────────────────┤
│ Prompt + Context Management │ Prompt templates, context engineering
├────────────────────────────────────┤
│ Retrieval / Memory │ RAG, vector stores, structured memory
├────────────────────────────────────┤
│ Tool / MCP Layer │ Tool calling, MCP servers, function APIs
├────────────────────────────────────┤
│ Model Layer │ Specific model selection, routing
├────────────────────────────────────┤
│ Inference Layer │ Hosted APIs, self-hosted, edge
├────────────────────────────────────┤
│ Observability / Evals │ Logging, tracing, eval suites
└────────────────────────────────────┘Each layer has multiple viable options. The choice at one layer constrains options at others. Decisions early on are sticky — model choice influences inference choice influences orchestration choice.
We'll go through each.
Layer 1: The model layer
Models in 2026 cluster into rough tiers, and selecting from each is the most consequential per-call decision in your system.
Flagship reasoning models. GPT-5 reasoning variants, Claude 4 Opus with extended thinking, o3, Gemini 3 Pro with thinking, DeepSeek R2. Excellent at multi-step problems, math, code, complex analysis. Expensive (€2-15 per million input tokens, more for output), slower (5-60 seconds). Use them when reasoning quality is the bottleneck.
Flagship general models. GPT-5, Claude 4 Sonnet, Gemini 3 Pro, Grok 4. Excellent at most knowledge work, fast (2-5 seconds), moderately expensive (€0.5-3 per million input tokens). The default for high-quality user-facing responses.
Mid-tier models. GPT-5 Mini, Claude 4 Haiku, Gemini 3 Flash, Mistral Medium. Good at simple-to-moderate tasks, fast (1-2 seconds), cheap (€0.10-0.50 per million tokens). Used heavily in production for classification, extraction, simple generation.
Small / nano models. GPT-5 Nano, Claude Haiku Lite, Gemini Flash Lite, smaller open-source models. Sufficient for narrow, structured tasks. Very cheap (€0.02-0.10 per million tokens), very fast (<1 second). Use for routing, scoring, batch processing.
Specialised models. Embedding models, reranking models, vision models, voice models, code-specialised models. Cheaper than general models for their specific tasks and usually better. Always consider for the relevant task.
Open-source frontier. Llama 4, DeepSeek V3 / R2, Qwen 3, Mistral Large. Hosted on inference providers (Groq, Together, Fireworks) or self-hosted. Pricing and quality competitive with closed frontier models for many tasks; lag for some (especially long-horizon reasoning).
The implications:
- One model does not fit all calls in your system. Routing is mandatory for cost (see Layer 5).
- The frontier is moving every quarter. Build to swap models, not lock in.
- Open-source is now genuinely viable for many production use cases, not just experiments.
Layer 2: The inference layer
Where does your model actually run?
Closed API providers — OpenAI, Anthropic, Google. Fastest path to running, best models, most reliable. You pay a premium and accept the data/security model.
Open-source inference providers — Groq, Together AI, Fireworks, Anyscale, Replicate, OctoAI. Run open models with various tuning. Often much faster than self-hosting; competitive prices.
Cloud-native — AWS Bedrock, Azure OpenAI, Google Vertex. Wraps closed and open models with your cloud's auth, billing, compliance. Necessary for many enterprise contexts.
Self-hosted — vLLM, TGI, SGLang, LMDeploy on your own GPUs. Lowest per-token cost at scale. Highest operational complexity. Usually only worth considering once your inference spend is in the high single-digit €K/month range (see the self-hosted vs hosted article for the break-even math).
Edge / on-device — Apple Intelligence, MediaPipe, ONNX, GGUF models via Ollama or llama.cpp. Free per call, but constrained model capability. Increasingly viable for narrow use cases.
The trade-offs:
- Latency matters: voice agents, conversational UX need fast first-token. Groq, Cerebras, and on-device dominate here.
- Throughput matters for batch: if you process millions of records, you want high-throughput, not low-latency.
- Compliance matters: GDPR, HIPAA, SOC 2 often dictate which providers and regions you can use.
- Vendor risk matters: depending solely on one provider is a single point of failure. Multi-provider is good hygiene.
A common 2026 pattern: hosted closed models for the highest-quality user-facing requests, hosted open-source for high-volume cheaper work, on-device for narrow latency-sensitive features. Self-hosting only when scale and economics justify the operational burden.
Layer 3: Tools and MCP
LLMs alone can't do much. They become useful when they can call tools — functions you define that give them access to data, APIs, and actions.
Native function calling. Every major model supports a structured function-calling API. You define functions with JSON schemas; the model decides when to call them; you execute the call; you return results.
MCP (Model Context Protocol). A standardized protocol (introduced by Anthropic, now widely adopted including by OpenAI, Cursor, and others) for tool servers. An MCP server exposes tools; an MCP client (an LLM agent) connects and uses them. Decouples tool implementation from any specific model.
Direct integrations. For high-volume specific use cases (e.g., specific CRM, specific database), often easier to write a direct adapter than a generic MCP server.
The 2026 trend is clear: MCP is winning as the standard. Most new tool development should target MCP. Direct integrations remain useful for performance-sensitive paths.
A few implementation realities:
- Tool descriptions matter enormously. A poorly described tool will not be used correctly. Tool docstrings should be written like prompts.
- Tool count matters. Models with 50+ tools available perform worse than ones with 5-10 relevant tools. Curate aggressively.
- Error handling matters. Tool errors need to be communicated to the model in a structured way so it can adapt.
- Authorization is hard. A multi-user system where the LLM has different permissions for different users is non-trivial. Don't let the LLM make authorization decisions; do it in the tool wrapper.
Layer 4: Retrieval and memory
LLMs need data they weren't trained on. This is the retrieval layer.
Vector databases. Pinecone, Weaviate, Qdrant, Chroma, PostgreSQL with pgvector, Turbopuffer. Stores embeddings; serves nearest-neighbor queries. Mature, well-understood. The default for semantic retrieval.
Hybrid search. Combines vector search with traditional BM25 keyword search. Catches both semantic and lexical matches. Use Reciprocal Rank Fusion to combine. Tools: Elasticsearch, OpenSearch, Vespa.
Knowledge graphs. Neo4j, Memgraph, custom triple stores. For data with rich relationships. Used in graph RAG architectures. More work to build, often higher quality for relationship-heavy domains.
Specialised RAG platforms. LlamaIndex (now mature), LangChain RAG abstractions, Haystack. Higher-level frameworks for common patterns.
Reranking. Cohere Rerank, Voyage, custom cross-encoders. After initial retrieval, rerank the top candidates with a more expensive model for accuracy. Usually 2-3x improves retrieval quality.
Memory. For agents and conversations, structured memory layers — Mem0, Letta (formerly MemGPT), or custom. Distinguish short-term (current conversation), medium-term (recent topics), long-term (durable facts about the user/account).
The architectural question: where does this layer live?
- In-app: the LLM call is wrapped in retrieval logic written by your team.
- At the MCP layer: retrieval exposed as tools.
- As a service: a dedicated retrieval service your apps call.
For monolithic single-product systems, in-app is fine. For multi-product organizations, treating retrieval as a service (with consistent quality and policy) pays off.
Layer 5: Prompt and context engineering
In 2026, "prompt engineering" is mostly synonymous with "context engineering" — managing what goes into the context window for each call.
The components:
Prompts. Often templated with variables. Stored in version control. Tested with eval suites. Treated like code.
Prompt management. Tools like Promptfoo, Langfuse, PromptLayer, or in-house systems. Versioning, A/B testing, rollback. (Helicone and similar LLM proxies belong in the observability layer below, not here — easy to conflate the two categories.)
Context strategy. Decisions about what to include in each call:
- System prompt (stable, defines behavior).
- Retrieved knowledge (dynamic, from RAG).
- Conversation history (managed, often summarized at length).
- Few-shot examples (chosen dynamically based on the query).
- Tool descriptions (filtered to relevant tools only).
- The user's current query.
Context compression. As context grows long, the model degrades. Strategies: summarize old turns, extract key facts to a structured memory, prune irrelevant content. Active research area.
Long-context use. 1M-token windows are available in 2026 (Gemini, GPT-5). They work, but "context rot" is real — quality degrades on long inputs even when the model technically supports them. Use long context carefully; don't dump everything just because you can.
Layer 6: Orchestration
How do you coordinate multi-step LLM workflows and agents?
Direct API. Just write the loop yourself in Python or TypeScript. Best for simple cases and for understanding what's actually happening.
LangChain / LangGraph. Widely used. LangGraph (state machine for agents) has matured significantly. Heavy abstractions, learning curve, but powerful.
CrewAI. Multi-agent framework focused on role-based agents. Easier to start than LangGraph; less flexible.
LlamaIndex agents. Especially strong for RAG-heavy workflows.
OpenAI Agents SDK. Simpler, more opinionated, optimized for OpenAI models.
Anthropic Claude SDK. Similar; optimized for Claude.
Custom. For mature teams shipping production agents, custom orchestration is common — frameworks impose costs (abstraction tax, debugging complexity, version churn) that outweigh benefits.
A 2026 pattern: prototype in a framework; rewrite in custom code for production. The frameworks help you discover the patterns; once you know them, direct code is simpler and more reliable.
Layer 7: Observability
You cannot ship serious LLM applications without observability. Every production system needs:
Tracing. Every LLM call captured: timestamp, model, input, output, latency, cost, success/failure. Trees for multi-step traces.
Cost tracking. Per-call, per-feature, per-user. Costs are large and unbounded; without tracking you find out at the end of the month.
Quality monitoring. Automated quality checks on a sample of production traffic. Alerts on quality drops.
User feedback capture. Thumbs up/down, explicit feedback, implicit signals (retry rate, abandonment).
Debugging. When something breaks, you need to see the full call chain. A failed agent run has many possible failure points.
Tools: LangSmith, Helicone, Arize, Phoenix, Braintrust, Weights & Biases, Datadog LLM Observability. Each has different strengths; pick one early and stick with it.
For small teams: even a simple Postgres table with one row per LLM call gets you 80% of what you need. Move to a tool when scale or feature needs justify it.
Layer 8: Evals
The single most important layer for serious production work.
Offline evals. A defined dataset; expected outputs; scoring. Run before deploying changes. Catches regressions. (We covered this in detail in the intermediate level.)
Online evals. Sample of production traffic scored automatically (LLM-as-judge) or via user signals. Catches drift.
Pre-deployment evals. Before any prompt or model change goes live, eval suite runs and is reviewed. Becomes part of CI.
Eval taxonomy. Different evals for different concerns:
- Behavioral: does it do what we expect?
- Safety: does it refuse what we want refused?
- Quality: how good is the output?
- Robustness: how does it handle adversarial inputs?
- Cost/latency: are we within budget?
Tools: Promptfoo, Braintrust, LangSmith, custom suites. All have place; Promptfoo is the easiest start.
Layer 9: The application layer
This is where your specific product lives. The decisions here:
Agent vs workflow. Agents (LLM in a loop with tools) are powerful but harder to make reliable. Workflows (fixed sequence of LLM calls) are easier and often sufficient. Default to workflows; reach for agents when truly needed.
Synchronous vs asynchronous. User-facing real-time? Batch background? Streamed? Affects model choice, infrastructure choice, UX design.
Single-tenant vs multi-tenant. Customer-specific data isolation requirements drive significant architecture decisions.
On-prem vs cloud. Compliance, security, or cost may push you on-prem. Operational complexity is much higher.
Edge cases. Hallucinations, prompt injections, abuse. Production systems need guardrails. Don't ship without them.
Trade-offs that matter
A few trade-offs worth being explicit about:
Quality vs cost vs latency
The fundamental triangle. You can usually optimize two; the third gets worse.
- High quality + low latency = expensive.
- Low cost + low latency = lower quality.
- High quality + low cost = high latency (batch processing, or reasoning models).
Pick your priorities per task. Don't optimize all three; that path leads to mediocrity in all.
Build vs buy
For each layer, you can build or buy.
- Build: more control, more maintenance, more cost (engineer time), differentiating capabilities.
- Buy: faster start, less control, ongoing vendor risk, undifferentiating capabilities offloaded.
A good heuristic: buy the commodity layers (vector storage, basic observability), build the differentiating layers (your specific orchestration, your prompts, your evals). Reversing this — buying your differentiation and building your commodity infrastructure — is a common mistake.
Open-source vs closed
A 2026 reality: open-source models are competitive for many tasks. For some tasks they're better (faster, cheaper). For others (long-horizon reasoning), closed frontier models still lead.
The decision factors:
- Quality requirements. For the bleeding edge of any task, closed models still win.
- Cost at scale. Open-source self-hosted gets cheap at high volume.
- Privacy/compliance. Self-hosted on your infra often required for sensitive data.
- Customization. Fine-tuning, custom training requires open-source.
- Operational capacity. Closed APIs are operationally trivial; self-hosted is significant work.
Most production systems in 2026 are hybrid — closed for some calls, open for others, based on the calculus per call.
Latency vs reasoning depth
Reasoning models (o3, Claude with extended thinking) trade latency for quality on hard problems. Sometimes that's worth it; sometimes the user can't wait 30 seconds.
A pattern: route simple queries to fast models, hard queries to reasoning models. Use a router (small model or heuristic) to decide.
Long context vs RAG
You can stuff context into the model (using a million-token window) or you can retrieve relevant chunks (using RAG).
- Long context: simpler, no retrieval infrastructure, but expensive per call and "context rot" is real.
- RAG: cheaper per call, more setup, retrieval quality is its own engineering problem.
The mature 2026 answer: usually RAG for production; long context for prototyping, special one-off tasks, or where retrieval quality is bad enough that retrieval ruins the result.
Agents vs workflows
Discussed above. Default to workflows; use agents when you genuinely need the flexibility. Many "agent" systems we see should be workflows.
A 2026 reference architecture
To make all this concrete, here's what a typical production system looks like for a mid-sized SaaS product with AI features:
User → Application (React/Next.js)
↓
API gateway / auth
↓
LLM Service (your wrapper)
↓
Router (small model or heuristic)
├→ Simple tasks: GPT-5 Mini or Claude Haiku
├→ Standard tasks: Claude 4 Sonnet or GPT-5
├→ Hard tasks: Claude 4 Opus or o3
└→ Special: vision/voice/embedding specialists
↓
Tool layer (MCP servers + direct integrations)
↓
Retrieval layer (Pinecone + hybrid + reranker)
↓
Observability (Helicone or LangSmith)
↓
Eval suite (Promptfoo, runs in CI)Cost per active user/month: typically €1-10 depending on usage intensity. Engineering effort to build: 6-12 weeks for an experienced team. Operational cost: low to moderate depending on traffic.
What I've seen go wrong
Patterns of failure we see repeatedly:
Pattern 1: Single model for everything. Cost overruns, quality issues. Fix: routing.
Pattern 2: No observability. Can't debug, can't measure, can't improve. Fix: instrument early.
Pattern 3: No evals. Quality drifts unnoticed. Fix: evals from day one.
Pattern 4: Framework lock-in. LangChain or CrewAI debugging becomes a full-time job. Fix: don't use frameworks unless they save more than they cost. Rewrite to direct code when patterns are clear.
Pattern 5: Building infrastructure that should be bought. Custom vector DB? Probably wasted time. Custom observability? Probably wasted time. Buy the commodity layers.
Pattern 6: Buying infrastructure that should be built. Outsourcing your prompts to a third party. Outsourcing your evals. These are your competitive moat; own them.
Pattern 7: Ignoring prompt injection. Production system without input sanitization for user-provided content. Big risk; mitigate early.
Pattern 8: Trusting agents in high-stakes flows. A LangGraph agent that authorizes refunds with no human review. This will eventually go wrong. Add human-in-loop for consequential actions.
Pattern 9: Optimizing for the wrong thing. Optimizing inference cost when total cost is dominated by engineering time. Or optimizing latency when users don't notice. Measure what actually matters.
Pattern 10: No multi-provider plan. When (not if) your primary provider has an outage, you're down. Have a fallback configured.
What I expect to change
Looking forward 12-18 months:
- Open-source closes more gaps. Expect open-source frontier to be within 10-20% of closed-source on most tasks, dramatically cheaper.
- Inference costs continue to drop. Per-token costs falling 5-10x per year. Architectures that are cost-prohibitive today become viable.
- Agents become more reliable. Better long-context handling, better tool use, better self-correction. Production agent use cases expand.
- MCP becomes ubiquitous. Every tool in 2027 is reachable by every AI agent via MCP. Walled gardens lose.
- On-device improves. Phone and laptop AI hits good-enough quality for many tasks. Hybrid on-device/cloud architectures become common.
- Standardisation increases. Today's bespoke architectures become standardised. Less custom plumbing, more focus on differentiation.
The takeaway
The 2026 LLM stack is real, layered, and the choices matter. The teams winning are those who:
- Understand the full stack, not just the parts they touch.
- Make explicit trade-offs (quality, cost, latency) per call.
- Build the parts that differentiate; buy the parts that don't.
- Instrument from day one (observability, evals).
- Stay nimble (model-portable, multi-provider).
The teams losing are those who picked one vendor, hard-coded its API, never instrumented, never measured, and now find themselves with a system that's expensive, brittle, and impossible to improve.
Get the architecture right. Everything else gets easier.