Reasoning Models
Reasoning Models
A class of language models that generate an internal chain-of-thought trace before producing a final answer. The scratchpad reasoning is either hidden from the user or exposed in tagged blocks; either way, it drives significantly better performance on problems requiring multi-step logic, math, or planning.
What Distinguishes Them
Standard chat models produce tokens left-to-right without a dedicated deliberation phase. Reasoning models allocate a separate budget of "thinking tokens" — a scratchpad where the model plans, reconsiders, and self-corrects before committing to a response. This is trained via reinforcement learning on verifiable outcomes (correct math answers, passing tests) rather than human preference labels alone.
The practical effect: on hard problems, the model can explore multiple solution paths and back out of dead ends before the final output, rather than being committed to the first approach it starts writing.
Thinking Tokens
Thinking tokens often outnumber output tokens by 5–20× on complex problems. Costs and latency scale accordingly.
- OpenAI o-series: thinking tokens are hidden; users see only the final response.
- DeepSeek R1: thinking exposed in
<think>...</think>tags in the raw output. - extended-thinking (Claude): thinking exposed as
thinkingcontent blocks in the API response; billed even when not surfaced to the end user.
Research (Liu et al., "Think Deep, Not Just Long", 2025) shows that the proportion of deep-thinking tokens in a generation is a reliable predictor of accuracy across AIME and GPQA-Diamond benchmarks — not just total token count.
Key Examples
- OpenAI o1 (Sep 2024) — first widely deployed reasoning model; strong on AIME/MATH.
- OpenAI o3 (2025) — ~87.7% on GPQA-Diamond; pushed the frontier on formal reasoning tasks.
- DeepSeek R1 (Jan 2025) — open-weight MIT-licensed model matching o1 performance at a fraction of the cost; 97.3% MATH-500, 79.8% AIME 2024 pass@1.
- extended-thinking / Claude Sonnet 4.6 and Opus — Anthropic's implementation, interleaved with tool use in agentic-workflows.
Trade-offs
| Dimension | Reasoning model | Fast model |
|---|---|---|
| Latency | High (seconds to minutes) | Low |
| Cost | High (thinking tokens billed) | Low |
| Hard math / logic | Strong | Weak |
| Simple tasks | Overkill | Sufficient |
| Streaming UX | Poor | Good |
Evaluation Benchmarks
- AIME — American Invitational Mathematics Examination problems; hard high-school math with exact numeric answers. Commonly reported as pass@1 or consensus accuracy.
- MATH-500 — 500-problem subset of the MATH benchmark; tests algebra, number theory, geometry, etc.
- GPQA-Diamond — graduate-level science questions requiring expert knowledge; harder to memorize than AIME.
- SWE-bench — real-world software engineering issues; used to evaluate agentic-workflows with tool use rather than pure reasoning.
See evals for full benchmark context.
When to Use
Use reasoning models for: multi-step math, formal proofs, complex debugging, planning under constraints, and agentic-workflows where wrong early moves are costly.
Skip reasoning models for: simple completions, formatting, summarization, quick lookups, or any latency-sensitive application where the thinking overhead is not justified by the task difficulty.
Related
extended-thinking · evals · agentic-workflows · deepseek-r1