Prompt engineering for reasoning models (o3, R1, Claude extended thinking)
Reasoning models are not fast models with extra steps. They reward different prompting, ignore some conventional patterns, and have their own pitfalls. A practical guide to working with them well.
Reasoning models are the most important class of model to appear since the original ChatGPT. o1, o3, GPT-5 Thinking, Claude Opus / Sonnet with extended thinking, DeepSeek R1, Gemini 2.5 Thinking, Grok-4 Heavy — by 2026 every major lab has shipped a reasoning model, and they have changed what is possible on hard analytical work.
They also reward a different prompting style from fast models. Many of the techniques that work brilliantly with GPT-4-class models (heavy scaffolding, "think step by step," elaborate role prompts) are at best neutral and at worst harmful when applied to reasoning models. The right style is closer to "describe the problem clearly and trust the model" than to anything you have read in prompt engineering guides.
This article is what reasoning models are, when to use them, how to prompt them well, and the pitfalls that catch even experienced AI users.
What reasoning models actually are
A reasoning model is one that internally generates an extended chain of reasoning before producing its final answer. The model thinks for tens of seconds, often minutes, before responding. The "thinking" tokens are often hidden from you (you see a "thinking..." indicator) or summarised.
This is a real shift in capability. On hard, multi-step benchmarks, reasoning models routinely outperform fast models by a wide margin — the exact gap depends heavily on the task and the providers compared, but it is substantial enough that the choice between "fast" and "thinking" is now a real architectural decision. They also cost more, take longer, and require different prompting.
The major reasoning models in 2026:
- OpenAI o3 and its variants (o3-mini, o3-pro). Available through the API and ChatGPT (Plus and Pro). GPT-5 in "Thinking" mode is the consumer-facing version.
- Claude 4.5 Opus / Sonnet with extended thinking. Available in claude.ai (Pro and above) and the API. Set the
thinkingparameter to enable reasoning mode. - DeepSeek R1 (and successors). Open-source weights; available through DeepSeek's app, OpenRouter, and other providers.
- Gemini 2.5 Thinking. Inside Gemini Advanced and the API.
- Grok 4 Heavy. Available to X Premium+ users.
All of them work on the same basic principle. The differences are in cost, latency, the visibility of the thinking trace, and the specific problems each is strongest at.
When to reach for a reasoning model
A reasoning model is the right tool when:
- The problem has multiple steps that build on each other. Math, logic, multi-step planning, code that requires tracing state.
- The cost of being wrong is real. Financial analysis, legal interpretation, medical reasoning, debugging production issues.
- Conventional models keep getting it wrong. If you have tried a fast model and the answer is consistently off, a reasoning model usually solves it.
- The task involves careful comparison or trade-off analysis. Multi-criteria decisions, architectural choices, vendor evaluations.
- You need the model to actually reason about edge cases, not just produce plausible-sounding text.
When not to use a reasoning model:
- Conversational chat. The latency makes the back-and-forth painful.
- Generation and drafting. Reasoning models produce worse creative writing than fast ones in many users' experience.
- Simple recall. Asking a reasoning model "what is the capital of Estonia" wastes its compute and your time.
- Iterative refinement loops. When you want to send 10 quick messages, the fast model is the right tool.
- Tasks where you need to control intermediate steps. Reasoning models hide the reasoning; if you want to inspect every step, use a fast model with explicit CoT.
A useful rule: if you wouldn't pay an analyst to spend 20 minutes on the task, don't use a reasoning model. If you would, do.
The prompting shift
The biggest mistake people make with reasoning models is applying fast-model prompt engineering to them. The five things to drop:
1. Stop adding "think step by step"
Reasoning models already do this. Adding the phrase is redundant at best. Worse, on some reasoning models, the phrase can interfere with the internal reasoning process — the model dedicates compute to performing visible step-by-step reasoning instead of using its more capable internal reasoning.
Bad: Think step by step. Solve this carefully. Show your work. [problem]
>
Good: [problem]
Just state the problem clearly. Trust the model.
2. Stop over-scaffolding the structure
A pattern that works well with fast models is heavy scaffolding: "First do A, then do B, then do C, here is the format..." With reasoning models, the model often figures out the right structure for the answer on its own — and dictating it can produce worse outputs than letting the model decide.
Fast-model style: First, list the key constraints. Then enumerate the options. Then evaluate each option against each constraint. Then pick. Then justify. Output format: ...
>
Reasoning-model style: Help me decide between option A and option B. Context: [...]
The reasoning model will usually internally produce a more sophisticated analysis than the structure you would have imposed.
3. Don't stack reasoning techniques
CoT + self-critique + tree-of-thoughts works on fast models. With reasoning models, the model is already doing the equivalent of all three internally. Stacking external versions on top is redundant and degrades quality.
If your prompt for a reasoning model includes "think step by step, then critique your own answer, then revise," reduce it to just the question. The model knows.
4. Don't over-specify the role
A pattern that works well with fast models is heavy role specification: "You are a senior engineer with 20 years of experience in distributed systems who has built large-scale applications and knows the trade-offs of..." Reasoning models do not benefit from this scaffolding as much. They already access the right kind of expertise based on the problem.
A short, direct role prompt is still useful for setting tone and register, but the long elaborate persona is overkill.
Bad: You are a world-class senior backend engineer with 20+ years experience...
>
Good: Help me reason through this distributed-systems issue. [problem]
5. Don't ask for the "thinking"
On o3 and some others, the thinking trace is hidden by design. Asking the model to "show your reasoning" can produce a different (often shallower) output than letting the model think privately and give you the conclusion.
If you want to see the reasoning, that is a fair preference — and on Claude with extended thinking, the trace is often visible. But asking for it explicitly on a model where it is normally hidden can degrade quality.
What reasoning models DO want
A few things they reward:
Specifics. Numbers, dates, exact constraints, specific files and people. Reasoning models can do real arithmetic on real numbers — give them the numbers.
Open framing. "Here is the situation. Here is what I want to figure out. What do you think?" produces better output than rigid templates.
Honest uncertainty. Tell the model what you don't know. "I'm not sure if X or Y; help me figure that out." Reasoning models handle ambiguity well and use it productively.
Permission to disagree. "Push back if my framing is wrong" or "tell me what I'm not considering" produces noticeably better output than asking for support of your existing position.
Concrete data. Spreadsheets, code, documents — paste them in. Reasoning models do their best work when there are real artefacts to reason about, not abstract questions.
Worked examples
Example 1: A debugging task
Suppose you have a tricky bug.
Fast model + CoT:
>
You are a senior software engineer specialising in TypeScript. Think step by step about this bug.
>
First, identify the relevant pieces of code. Second, trace through the data flow. Third, identify likely causes. Fourth, recommend a fix.
>
Here is the bug: [description] Here is the code: [code]
Reasoning model:
>
Help me find this bug.
>
Symptoms: [description] Relevant code: [code] What I've already tried: [list]
The reasoning model will work through the bug systematically without needing the scaffolding. It will often catch the issue faster than the fast-model + CoT combination, because its internal reasoning is genuinely deeper.
Example 2: A strategic decision
Fast model:
>
You are a senior strategy consultant. I'm trying to decide whether to launch product X. Apply the [framework name] framework. First, ... [long structured prompt]
Reasoning model:
>
I'm trying to decide whether to launch product X. Context: - We're a 50-person company at $5M ARR. - The product would take 2 quarters to build. - It's adjacent to but not directly competitive with our main product. - Two of our top 10 customers have asked for it. - Our team capacity is already strained.
>
Help me think through this. Push back on weak reasoning. Tell me what I'm not considering.
The reasoning model will produce a deeper, more nuanced analysis from the minimalist prompt than from the over-scaffolded one. It will likely surface considerations you didn't think to mention and notice tensions in what you said.
Example 3: Complex code analysis
Fast model:
>
Analyse this code for performance issues. Think step by step. First identify the data structures, then trace through the algorithm complexity, then point out specific bottlenecks. [code]
Reasoning model:
>
What's slow about this code? It currently takes ~3 seconds on a typical input; I'd like it under 500ms.
>
[code]
The reasoning model will analyse complexity, identify bottlenecks, propose fixes, and often suggest measurement strategies — all without explicit scaffolding.
Pitfalls specific to reasoning models
A short list of things that catch even experienced users:
The latency. Reasoning models can take 30 seconds to several minutes to produce an answer. This is genuinely disruptive to flow if you're not expecting it. Plan for it; do not use them for conversational tasks.
The cost. Reasoning models typically cost several times more per query than fast models — sometimes an order of magnitude more, depending on tier, provider, and how many thinking tokens the model burns. On API pricing, a single complex query can cost a meaningful amount. Use them deliberately.
The "thinking gets cut off" problem. Reasoning models have token budgets for their internal thinking. On extremely hard problems, the model may exhaust its thinking budget before reaching a confident conclusion. The output is then shaky. The fix: give a thinking-friendly prompt (clear, well-bounded problem) and on tools that allow it, increase the thinking budget.
Reasoning loops. Occasionally, a reasoning model gets stuck — its internal thinking goes in circles, or it goes down a wrong path and cannot recover. Symptoms: very long thinking time, then a hedged or weird answer. Solution: restart with a slightly different framing.
Over-confidence on the wrong things. Reasoning models can be more confident than they should be on questions where their internal reasoning didn't actually verify the answer. Always ask, on critical outputs, "what is your confidence in this and what would change your answer?"
Cost asymmetry between sub-problems. A reasoning model spends compute approximately proportional to problem hardness. Easy sub-questions are cheap; hard ones are expensive. Be aware that asking the model to do five hard things in one prompt may quietly use far more compute than you expect.
The hybrid pattern that often works best
For many real workflows, the right pattern is fast model + reasoning model in sequence:
- Fast model to scope, explore, brainstorm. Quick back-and-forth. Refine the question.
- Reasoning model to crunch the hardest 1-3 sub-questions that came out of the exploration.
- Fast model to translate the reasoning model's output into the form you want (slides, email, doc).
This pattern keeps latency manageable, costs predictable, and uses each tool for its strengths.
A worked example: a market analysis task.
- Fast model (Claude / GPT): "I want to understand the market for X. Help me scope the analysis: what should I look at, what data do I need, what questions matter?"
- Reasoning model (o3 / Claude Thinking): "Given the data I've gathered, what does it imply for [specific strategic question]? Push hard on the reasoning."
- Fast model: "Now help me turn this into a one-page brief for our leadership team."
Three tools, each used for what it's best at. The total cost and time are lower than asking the reasoning model to do all three steps; the quality is higher than asking the fast model to do all three.
A few practical habits
Always explicitly choose when a task is reasoning-model-worthy. Default to fast; upgrade only when the task earns it.
Keep two tabs open. ChatGPT or Claude with the fast model in one; the same product with the reasoning model in the other. Easy switching without confusion.
Track your reasoning-model costs. Whether through subscription tier monitoring or API billing, get a feel for what your monthly reasoning-model bill looks like. Tune your usage accordingly.
Notice when you would not have used one before. As you get more comfortable, you will catch yourself reaching for the fast model on a problem the reasoning model would have solved better. Build the muscle of pausing.
Re-prompt minimally. The instinct on a reasoning model that gave a weak answer is to elaborate the prompt. Try the opposite first: a shorter, simpler version of the same prompt. Reasoning models are sometimes overhelped by complex prompts.
The takeaway
Reasoning models are not fast models with extra steps. They reward minimal, direct prompts; they punish heavy scaffolding; they take time; they cost more; and on hard problems, they produce dramatically better answers.
Use them deliberately, prompt them simply, and stop applying the fast-model patterns to them. The combination of fast and reasoning models — used for what each is best at — is the most powerful AI workflow available in 2026, and the gap between people who have learned the difference and people who haven't keeps widening.