Choosing between prompting, RAG, and fine-tuning (and when to combine)
Prompting, RAG, and fine-tuning are the three big levers for adapting LLMs to your problem. Each is right for some problems and wrong for others. A framework for choosing, the realistic costs of each, and the production patterns where combining them shines.
A team gets the brief: "Make our AI better at handling our specific use case." They have a few choices on how to do it. They can write better prompts. They can build a RAG system to feed the model relevant data. They can fine-tune the model on examples of their domain.
These are not interchangeable. They solve different problems. Done wrong, you spend three months and a budget fine-tuning when the answer was a better prompt. Or building elaborate RAG infrastructure when fine-tuning would have been simpler. Or stuck on prompts when the model fundamentally can't do what you need.
This article is a framework for choosing — when each is the right tool, when to combine them, the realistic costs of each, and what we've seen go right and wrong in production.
What each actually does
A clean distinction:
Prompting changes what the model is asked. You give the model better instructions, examples, format requirements, context. The model itself doesn't change; the input does.
RAG changes what data the model sees. You retrieve relevant information at query time and include it in the prompt. The model has fresh, specific, dynamic data without being trained on it.
Fine-tuning changes what the model knows or how it behaves. You train the model on examples, modifying its weights. The model itself is updated.
These solve different problems:
- Instructional gap: the model could do the task if asked correctly. → Prompting.
- Knowledge gap: the model needs information it doesn't have. → RAG.
- Capability gap: the model can't reliably do the task even with good prompts and context. → Fine-tuning.
Knowing which gap you have is half the battle.
Diagnosing the gap
When the AI isn't doing what you need, ask:
Could a smart person, given just the prompt, do this task?
If yes → instructional gap. Better prompt should fix it.
If no, could they do it if you gave them relevant reference material?
If yes → knowledge gap. RAG can fix it.
If no, could they do it after extensive practice and feedback?
If yes → capability gap. Fine-tuning might fix it.
If no → maybe the task isn't solvable by an LLM. Reconsider the problem.
Most "the AI doesn't work" problems are instructional gaps — better prompting solves them. The next most common are knowledge gaps. Genuine capability gaps are the smallest category but the hardest to address.
Prompting: the underrated lever
Prompting is the cheapest, fastest, and most often the right answer. Yet teams skip past it to RAG or fine-tuning.
A few things you can do with prompting alone:
- Change tone, format, length.
- Apply reasoning patterns (chain-of-thought, self-critique).
- Encode constraints (do X, don't do Y).
- Encode policies and guardrails.
- Adapt to specific use cases (different prompts for different features).
- Improve consistency through few-shot examples.
What you can't do with prompting alone:
- Get the model to know facts it doesn't.
- Make a small model behave like a large model.
- Fundamentally change the model's voice or style on a deep level.
- Speed up the model's inference.
A reasonable rule: try prompting first, iterate for at least a week, before reaching for RAG or fine-tuning. Most of the time, you'll find prompting solves the problem.
Prompt engineering effort
A week of intensive prompt iteration can produce dramatic improvements. The typical curve:
- Day 1: baseline. Mediocre results.
- Day 2-3: structural changes. Better format, clearer instructions. Big improvements.
- Day 4-5: examples and edge cases. Catches failure modes.
- Day 6-7: tone, constraints, polish. Final 10% improvement.
After a week, you've extracted most of what prompting can give. If you're still not happy, the gap is probably knowledge or capability.
What good prompts look like
For reference, a strong prompt typically has:
- Clear role and task.
- Specific format requirements.
- 1-5 representative examples (if needed).
- Explicit constraints (what to do, what not to do).
- Edge case handling.
- Output schema.
5-10 paragraphs typically. Not too short (under-specified), not too long (the model loses focus).
RAG: the knowledge fix
RAG is the right tool when:
- The model needs factual information it doesn't have.
- The information changes (live data, recent events, account-specific data).
- The information is specific to your domain or organization.
- You need citations / provable grounding.
It's the wrong tool when:
- The problem is instructional, not knowledge.
- The data is small enough to fit in a prompt directly.
- You need the model to do something differently, not just know something different.
The realistic cost
A RAG system is real engineering:
- Build: 4-12 weeks for a serious one (ingestion, chunking, embedding, retrieval, reranking, eval).
- Operate: ongoing — keeping the index updated, monitoring quality, fixing issues.
- Infrastructure: vector DB, embedding API costs, reranking costs. Typically €200-2000/month for moderate-scale systems.
- Per-query cost: higher than just prompting (additional embedding + retrieval + larger context cost). Usually 2-5x classic API call cost.
This is well worth it for the right problems. But it's a significant investment compared to prompting.
RAG quality is a journey
A working RAG system in week 1 is usually 60-70% quality. Getting to production-grade quality (85%+) takes another 1-2 months of work: improving chunking, adding reranking, fixing failure modes, building evals.
Plan for this. Don't ship at week 1; you'll have angry users.
Fine-tuning: when prompting and RAG aren't enough
Fine-tuning is the right tool when:
- You have a clear capability gap — the model can't reliably do the task even with good prompts and context.
- You have a good batch of high-quality training examples — at minimum a few hundred for a very narrow LoRA, more typically 1,000+ for reliable general behavior. See the fine-tuning article for the exact thresholds by technique.
- You need consistent, narrow behavior (a specific style, a specific output format, a specific domain).
- Inference cost / latency matters (a fine-tuned smaller model can be cheaper than a generic larger one).
It's the wrong tool when:
- Your data is in flux (the fine-tuned model will be stale fast).
- You haven't first exhausted prompting and RAG.
- You don't have good evals (you can't tell if fine-tuning helped).
- The task needs very current information (fine-tuning is a snapshot).
- You're trying to teach facts (RAG does this better, more reliably, with citation).
Types of fine-tuning
Full fine-tuning: all model weights updated. Most powerful, most expensive. Requires significant compute. Usually reserved for foundation model labs.
LoRA (Low-Rank Adaptation): only a small subset of weights is trained. Much cheaper. Often produces results competitive with full fine-tuning for narrow tasks.
QLoRA: quantized LoRA. Even cheaper. Lower-quality at scale but reasonable for many tasks.
Prompt tuning / prefix tuning: even smaller; only soft prompts trained. Cheapest. Limited capability.
Instruction tuning: training the model to follow instructions. Usually done at the foundation level; rarely useful for end users.
RLHF / DPO / KTO: training the model to align with preference data (responses A vs B). Powerful for behavioral changes; complex to do well.
In 2026, most teams doing fine-tuning use LoRA on top of a strong base model. It's the right balance of cost and capability for most use cases.
The realistic cost
Fine-tuning costs depend on approach and scale, but typical for a LoRA fine-tune of a mid-size open model on 5K-10K examples:
- Data prep: 1-4 weeks. Often the bulk of the work. Curating, cleaning, formatting examples.
- Training: hours to days, depending on dataset size and infrastructure. €100-2000 in compute.
- Eval: 1-2 weeks. Building eval suites, comparing fine-tuned vs base.
- Iteration: 1-3 cycles before something is production-ready.
- Deployment: if using a managed API (OpenAI fine-tuning, Anthropic, Vertex), straightforward. If self-hosting, more work.
- Maintenance: retraining when data updates, when the base model updates, when the use case shifts.
Total: 6-12 weeks of work, €1K-€20K in compute (depending on scale), ongoing maintenance.
Significant investment. Make sure it's worth it.
When fine-tuning shines
Specific scenarios where fine-tuning clearly wins:
Strict format requirements. Outputs must follow a specific schema or style consistently. Prompting can get you 95% there; fine-tuning gets you 99%.
Specialized domains. Medical terminology, legal phrasing, code in an internal DSL. Fine-tuning teaches the model your specific dialect.
Personality/voice. A consistent voice across thousands of interactions. Prompts can drift; fine-tuning bakes it in.
Latency/cost optimization. A fine-tuned 7B model that handles your specific task can be cheaper and faster than a generic 70B model. At high volume, this pays off.
Behavioral safety. Fine-tuning the model to refuse certain things, or to add specific safeguards, can be more robust than prompt-based guardrails.
When fine-tuning fails
Common ways fine-tuning disappoints:
Insufficient data. Fine-tuning on 100 examples usually doesn't help much. A very narrow LoRA can sometimes work with a few hundred examples; for reliable general behavior, plan for 1,000+ high-quality examples.
Bad data. Garbage in, garbage out. Inconsistent, low-quality examples produce inconsistent, low-quality models.
Catastrophic forgetting. Heavy fine-tuning on narrow tasks can hurt general capabilities. The model gets good at your task but worse at everything else.
Stale knowledge. Fine-tuned model is a snapshot. New information requires retraining. For dynamic domains, this is a perpetual cost.
Base model improvements outpace fine-tuning. The base model improved enough that the fine-tune is no longer better. You're now maintaining a fine-tune of an outdated base.
Evaluation problems. Without solid evals, you don't know if fine-tuning helped, hurt, or had no effect. Many "successful" fine-tunes are placebo wins.
The combination patterns
In production, the best systems combine all three.
Combination 1: Prompted RAG (most common)
The default for knowledge-heavy applications.
- Carefully designed prompts encode instructions, format, constraints.
- RAG provides current, specific information.
- No fine-tuning; rely on strong base model.
This is the most common production pattern in 2026. It works for most use cases.
Combination 2: Fine-tuned model + RAG
When you need both behavioral specialization and dynamic knowledge.
- Fine-tune for voice, format, domain.
- RAG for current information.
- Prompts orchestrate.
Example: a fine-tuned model for a specific company's customer support voice, with RAG over current policies and documentation. The fine-tune handles the consistent voice; RAG handles the changing knowledge.
Combination 3: Specialized fine-tunes for specific tasks
Different fine-tunes for different parts of a system.
- Classification fine-tune for routing.
- Summarization fine-tune for digests.
- Generation fine-tune for customer responses.
- Each smaller, faster, specialized.
Used when scale and cost optimization matter. Each fine-tune does its narrow job well; orchestration calls them.
Combination 4: Fine-tuned router + general models
The router is fine-tuned to classify queries reliably. Once classified, queries go to general models for the actual work.
The fine-tune is small, fast, narrow. The expensive general work is done by general models, kept current.
This combines economy (fine-tune is small) with capability (general models for the hard work).
The decision framework
A practical decision flow:
Question 1: Is the problem solvable with the current model and a good prompt?
If yes: write the prompt. Iterate for a week. Ship.
If no, go to Question 2.
Question 2: Does the problem involve knowledge the model doesn't have?
If yes: build RAG. Spend the months. Get it to production quality. Pair with strong prompts.
If no, go to Question 3.
Question 3: Is the problem about consistent format, narrow domain, or specific behavior?
If yes, AND you have at least several hundred (ideally 1,000+) high-quality examples: fine-tune. Combine with prompting and possibly RAG.
If you don't have the examples: invest in collecting them, OR try better prompting / RAG further before fine-tuning.
Question 4: Have you done the eval work to know which approach actually helps?
This question applies at every step. Without evals, you're guessing.
Production examples
A few real-world combinations:
Example 1: AI customer support
Setup: A SaaS company's customer support AI handles tier 1 inquiries.
Components:
- Strong prompts for tone, format, escalation policies.
- RAG over current docs, policies, ticket history.
- Lightweight fine-tune on the company's specific voice and escalation patterns (1,500 examples curated from past tickets).
Outcome: Handles 65% of tickets autonomously. The fine-tune accounts for the consistent voice; RAG keeps it accurate; prompts handle the policies.
Example 2: Legal document review
Setup: A legal-tech product reviews contracts for risks.
Components:
- Detailed prompts encoding what to look for (legal categories, severity rubric).
- RAG over relevant case law and precedent.
- No fine-tuning; reasoning models handle the heavy lifting.
Outcome: Pure prompt + RAG works well because the model already has legal training. Fine-tuning would help marginally; the investment didn't justify.
Example 3: Code completion in a custom DSL
Setup: A specialized data tool with its own DSL.
Components:
- Prompts with examples.
- No RAG (the DSL is small enough to fit in context).
- LoRA fine-tune on 10K examples of the DSL.
Outcome: Fine-tuning was essential. Without it, the model couldn't produce valid DSL reliably. Prompts and context alone weren't enough.
Example 4: Internal company assistant
Setup: A general assistant for company employees.
Components:
- Strong system prompts (voice, behavior, refusals).
- RAG over company wiki, Slack, docs.
- No fine-tuning; the company's "voice" is captured in prompts.
Outcome: RAG + prompts handle most use cases. The company isn't quirky enough to need fine-tuning for voice.
Mistakes we see
A few patterns of misallocation:
Mistake 1: Reaching for fine-tuning first. Teams hear "we should fine-tune our own model" and start there. 90% of the time, prompting + RAG would have been faster, cheaper, and as good.
Mistake 2: Skipping RAG when it's the answer. Teams build elaborate prompts to "remind" the model of company info that should obviously be retrieved at query time. Better to just retrieve.
Mistake 3: Fine-tuning without evals. "We fine-tuned and it's better now." No metrics. Often the fine-tune did nothing or even hurt. Without evals, you don't know.
Mistake 4: Stale fine-tunes. A fine-tune from 6 months ago, when GPT-4 was best. Today's frontier models without the fine-tune outperform the fine-tuned older model. Fine-tunes need re-evaluation as the field moves.
Mistake 5: Trying to fine-tune in facts. Teams try to fine-tune the model to "know about our company." Doesn't work well — the model memorizes some facts, hallucinates others. RAG handles facts; fine-tuning handles behavior.
Mistake 6: Not iterating on prompts long enough. Two days of prompt iteration is a starting point. Two weeks gets you the real answer.
Mistake 7: Over-engineering RAG when prompting could do it. A 50K-token company doc dumped in the prompt is sometimes simpler than RAG. Especially for small corpora.
Cost and effort comparison
A rough comparison for a typical mid-sized project:
| Approach | Effort | Cost (one-time) | Cost (per query) | Maintenance | |----------|--------|----------------|------------------|-------------| | Prompting | 1-2 weeks | minimal | base API cost | low | | RAG | 6-12 weeks | infrastructure setup (~€1K-5K) | 2-5x base | moderate (ingestion, eval) | | Fine-tuning (LoRA) | 6-12 weeks | training compute (~€500-5K) | base (often cheaper if smaller model) | high (data, retrain, eval) | | Prompting + RAG | 8-14 weeks | infrastructure | 2-5x base | moderate | | All three | 12-20 weeks | combined | varies | high |
The right choice depends on your problem and resources. For most teams, prompting + RAG is the sweet spot — meaningful capability gain without the full fine-tuning investment.
The takeaway
Prompting, RAG, and fine-tuning solve different problems. Picking right requires honest diagnosis: is this an instructional gap, a knowledge gap, or a capability gap?
The honest order to try them:
- Prompting (1-2 weeks of iteration). Cheapest, fastest, most often sufficient.
- RAG if there's a clear knowledge gap. Significant investment but well-bounded.
- Fine-tuning if there's a clear capability gap that prompting + RAG can't close. Most expensive; do it last.
- Combinations for mature production systems.
The teams that succeed are honest about which gap they have and disciplined about evals. Without evals, you can't tell which approach helped. With them, the path is usually clear.
Most "we need to fine-tune our own model" projects, on closer inspection, should be "we need to write better prompts and add RAG." Save fine-tuning for the cases that truly need it.
The result: better systems, faster, at lower cost. Which is what shipping production AI is supposed to be about.