Chain-of-thought, self-critique, tree-of-thoughts — when to use each
Three reasoning techniques that genuinely improve AI output on hard problems — and the cost-benefit math of using them. With concrete prompts, side-by-side comparisons, and the gotchas modern reasoning models introduce.
There are three techniques in prompt engineering that produce measurable, repeatable improvements on hard analytical tasks: chain-of-thought, self-critique, and tree-of-thoughts. They have been studied extensively since the 2022-2023 wave of reasoning research, and despite the rise of dedicated reasoning models (o3, Claude Extended Thinking, DeepSeek R1), the techniques still matter — both because they remain useful with fast models, and because they shape how you should prompt the reasoning models themselves.
This article is what each technique is, when to use which, and the cost-benefit math.
What problem these solve
All three techniques address the same root issue: the default behaviour of a language model is to produce its answer in a single forward pass, committing to a conclusion early without explicitly reasoning. For simple tasks, this is fine and efficient. For multi-step reasoning, complex analysis, or anything where the answer depends on getting several intermediate things right, the single-pass default produces confidently wrong outputs.
The techniques force the model to spend more compute on intermediate reasoning before committing.
Chain-of-thought (CoT)
The original and simplest. Add a phrase like "think step by step before giving your final answer" to your prompt and the model produces its reasoning before its conclusion.
A worked example. Compare:
Vanilla: A train leaves Tallinn at 9:00 a.m. traveling 80 km/h. Another train leaves Tartu at 9:30 a.m. traveling toward Tallinn at 100 km/h. The distance between the cities is 190 km. At what time do they meet?
versus:
With CoT: Same question. Think step by step. First, calculate the distance the first train covers before the second starts. Then set up the equation for when they meet. Show your work, then give your final answer.
On hard arithmetic-style problems, vanilla GPT-3.5-class models had error rates around 30-50%; CoT versions had error rates closer to 5-15%. The numbers have shifted as models have improved, but the direction — adding CoT lifts accuracy on multi-step tasks — has stayed consistent across generations.
When to use chain-of-thought:
- Multi-step arithmetic, especially with units, dates, or precise rounding. Even strong models slip on these.
- Logic puzzles and similar problems where the answer is the conclusion of a chain.
- Code debugging, where the answer depends on tracing through state.
- Strategy analysis, where the conclusion depends on weighing multiple factors.
When not to bother:
- Simple factual recall. "What is the capital of Estonia?" needs no CoT.
- Generation tasks. Writing, summarising, drafting. CoT just adds tokens without quality gain.
- Reasoning models. o3, Claude Extended Thinking, DeepSeek R1 already do CoT internally — adding "think step by step" to your prompt is at best redundant, at worst counterproductive.
That last point is critical and we'll return to it.
Self-critique
A two-pass technique. First, ask the model to produce an answer. Then, ask the model to critique its own answer and produce a revised version.
The prompt structure:
Step 1: [Your original question]
>
Step 2: Review your answer above. Find any mistakes, weaknesses, or places where you made assumptions that might not hold. Be a hard critic of your own work.
>
Step 3: Based on your critique, produce a revised answer.
The improvement comes from the model being released from the commitment-to-conclusion that the first pass made. Forced to look at its own work as a critic, it catches things it would not have caught in the original pass.
A more sophisticated variant is constitutional / principle-based self-critique. You define a set of principles the answer should satisfy, then ask the model to evaluate against each.
Principles for a good answer to this kind of question: 1. It addresses the actual question, not a generalisation of it. 2. It quotes specific evidence rather than gesturing at sources. 3. It acknowledges uncertainty explicitly where present. 4. It is calibrated — confident on strong points, hedged on weak ones.
>
Produce an answer. Then evaluate it against each principle. Then revise.
This is the technique behind Anthropic's Constitutional AI work and similar approaches in modern alignment research.
When to use self-critique:
- Writing tasks where you want a second pass without leaving the conversation.
- Analytical work where the model is likely to be overconfident.
- Decision support where you want the model to find the holes in its own argument.
- Code where you want a review pass after the generation pass.
When not to bother:
- Tasks where there is no "correct" answer to revise toward (creative brainstorming, idea generation).
- Tasks where you would prefer to do the critique yourself (anything where your judgement is the value).
- Quick conversational responses where the latency cost outweighs the quality gain.
Tree-of-thoughts (ToT)
The most expensive technique. Instead of producing a single chain of reasoning, the model explicitly considers multiple paths, evaluates each, and selects the most promising.
A worked example structure:
Step 1: Generate three different approaches to this problem.
>
Step 2: For each approach, work through the first few steps without committing to a final answer.
>
Step 3: Evaluate which approach is most likely to succeed and why. Be specific about strengths and weaknesses.
>
Step 4: Commit to the best approach and complete the solution.
Tree-of-thoughts works because some problems have multiple plausible attacks, and the first one you try is not always the best. By forcing parallel exploration, you avoid getting locked into a suboptimal path.
Practical example — a hard prompt:
I have a complex SQL query that runs too slowly. Help me optimise it.
>
Step 1: Generate three different optimisation strategies. Step 2: For each, identify the specific bottleneck it would address and the cost. Step 3: Evaluate which is most likely to give us the biggest gain for the smallest risk. Step 4: Implement the chosen approach.
You get back something noticeably more thoughtful than "here is one rewrite." Three approaches, comparison, recommendation, implementation.
When to use tree-of-thoughts:
- Problems with multiple credible solutions. Architecture decisions, algorithm choices, strategic choices.
- Optimisation problems. Where the first attempt is rarely the best.
- Creative tasks where exploration is the point. Naming, framing, positioning.
- Anything where you suspect the obvious answer is wrong.
When not to bother:
- Tasks with a single correct approach. Don't ask for three SQL queries when one works.
- Simple factual questions. Overkill.
- Most reasoning-model tasks — the models do this kind of exploration internally now.
A practical decision tree
When you have a hard problem in front of you, the question is not "should I use CoT, self-critique, or ToT." It is "what is the shape of the problem?"
- Linear multi-step problem (arithmetic, logic puzzle, strict reasoning) → chain-of-thought.
- Problem where overconfidence is the risk (analysis, recommendation, code that should be reviewed) → self-critique.
- Problem with multiple plausible approaches (optimisation, strategic choice, creative exploration) → tree-of-thoughts.
- Conversational, simple, or generative → none. Skip the overhead.
There is also a useful meta-pattern: use CoT inside ToT inside self-critique. Explore three paths (ToT), reason through each step by step (CoT), then critique the chosen path (self-critique). This sounds like overkill but is genuinely useful for the hardest analytical work. The cost is latency and tokens; the benefit is materially better answers.
How reasoning models change the math
The biggest shift since 2024 has been the rise of dedicated reasoning models — o1, o3, Claude Extended Thinking, DeepSeek R1, Gemini 2.5 Thinking. These models do chain-of-thought internally before producing an answer, often "thinking" for tens of seconds or minutes.
This changes how to prompt them in three important ways:
1. Stop adding "think step by step." Reasoning models already do this. Adding the phrase explicitly can confuse them or produce redundant output. Just ask the question directly.
2. Trust the model's reasoning length. If you ask a complex question, the model will internally generate a long chain of reasoning. You don't see all of it (some is hidden in "thinking" mode). The trade-off is latency. Be patient.
3. Use plain, direct prompts. Reasoning models are less prompt-sensitive than fast models — they reason through ambiguity rather than getting stuck on it. The over-engineered prompts that work well with fast models (heavy framing, multiple constraints, structured templates) sometimes degrade reasoning-model output. Try the simpler version first.
A worked example. Compare these two prompts to a reasoning model:
Prompt A: "Think step by step about the following question. First, identify the key constraints. Then list the options. Then evaluate each option against the constraints. Then choose. Show your reasoning at each step. Question: should we adopt a four-day work week?"
Prompt B: "Should we adopt a four-day work week? Context: 80-person B2B SaaS, customer support team operates Mon-Fri."
For most reasoning models, Prompt B will produce a better answer. The reasoning model already knows how to think through the question; explicit scaffolding can constrain it in ways that hurt.
For fast models, the opposite is true. They need the scaffolding to produce comparable quality.
This is the most important new fact about prompt engineering since 2023: the same prompt that works best on a fast model may be worse on a reasoning model, and vice versa.
Cost-benefit math
The techniques all have costs. The honest accounting:
| Technique | Token cost | Latency cost | Quality gain | When worth it | | --- | --- | --- | --- | --- | | Chain-of-thought | ~2-3x | ~1.5-2x | 10-40% on hard problems | Multi-step problems with fast models | | Self-critique | ~2x | ~2x | 5-20% across the board | When overconfidence is a real risk | | Tree-of-thoughts | ~3-5x | ~2-3x | 10-30% on multi-approach problems | Hard problems with multiple paths | | Reasoning model (built-in) | ~3-10x | ~5-30x | 30-100% on hard problems | Anything genuinely hard |
For most casual use, the fast model with no techniques is fine. For hard problems, the right technique (or reasoning model) is worth the extra cost. For trivial tasks, all of these techniques waste money and time.
A practical rule: before you reach for a technique, ask whether the cost of being wrong on this task is meaningful enough to justify the extra cost of the technique. If yes, use the right one. If no, just send the prompt.
Worked example: a real hard task
Suppose you are evaluating two vendor proposals and you want a calibrated comparison.
Without any technique (vanilla prompt):
Compare these two vendor proposals [paste]. Which should we pick?
You get a hedged, both-sides answer. Useful starting point; not enough.
With CoT:
Compare these two vendor proposals. Think step by step: 1. List the criteria that matter for our decision. 2. Score each vendor on each criterion. 3. Identify the criteria where the scores diverge most. 4. Then give your recommendation.
You get a much more structured analysis. Each step is visible; you can verify or correct.
With self-critique on top:
[same as above]
>
After your recommendation, critique your own analysis: 1. Which criteria might I have weighted wrong? 2. What did I assume that I shouldn't have? 3. What's the strongest credible case for the other vendor?
>
Then produce a revised recommendation if needed.
The critique catches blind spots in the first analysis.
With ToT:
Compare these two vendor proposals.
>
Step 1: Generate three different decision-making frameworks for this kind of choice (e.g., risk-minimising, value-maximising, capability-aligned). Step 2: Apply each framework. Get three recommendations. Step 3: Where do the frameworks agree? Where do they diverge? Step 4: Given our actual constraints, which framework is most appropriate? Final recommendation.
You get three different angles on the choice; the differences are where the interesting thinking happens.
With a reasoning model:
Compare these two vendor proposals. Which should we pick, and why? Include the things that would change your answer.
The reasoning model does all the above internally. The output is often comparable to or better than the heavily-prompted fast-model output, in roughly the same wall time.
In 2026, for genuinely hard analytical work, a reasoning model with a clean prompt is usually the right move. CoT and ToT remain useful with fast models, and self-critique remains useful as an additional layer regardless of the underlying model.
A few practical habits
Make the technique visible to yourself. Note which technique you used in your prompt — it helps you build intuition about what works.
Compare outputs. Once a week, run the same hard prompt with and without a technique, and see how different they are. You will calibrate fast on when the technique earns its cost.
Don't stack techniques without thought. Stacking CoT + self-critique + ToT + reasoning model is rarely better than picking the right one. Each layer adds cost; only add layers that genuinely improve the answer for your specific task.
Keep the techniques in your library. Snippets for "with CoT," "with self-critique," "with ToT" — applied to whatever the current task is — save real time over re-typing the scaffolding.
The takeaway
Three techniques. Each has a sweet spot. Chain-of-thought for linear multi-step problems. Self-critique for catching overconfidence. Tree-of-thoughts for multi-approach problems. Reasoning models change the math by doing the first two internally — but the techniques still matter, both for fast models and as patterns you can apply on top.
Use the right one for the problem. Skip them when they don't earn their cost. Internalise the difference between "harder problem" and "different problem" — that is the entire game.