Reasoning Models

3 min · concept

Reasoning Models

A class of language models that generate an internal chain-of-thought trace before producing a final answer. The scratchpad reasoning is either hidden from the user or exposed in tagged blocks; either way, it drives significantly better performance on problems requiring multi-step logic, math, or planning.

What Distinguishes Them

Standard chat models produce tokens left-to-right without a dedicated deliberation phase. Reasoning models allocate a separate budget of "thinking tokens" — a scratchpad where the model plans, reconsiders, and self-corrects before committing to a response. This is trained via reinforcement learning on verifiable outcomes (correct math answers, passing tests) rather than human preference labels alone.

The practical effect: on hard problems, the model can explore multiple solution paths and back out of dead ends before the final output, rather than being committed to the first approach it starts writing.

Thinking Tokens

Thinking tokens often outnumber output tokens by 5–20× on complex problems. Costs and latency scale accordingly.

OpenAI o-series: thinking tokens are hidden; users see only the final response.
DeepSeek R1: thinking exposed in <think>...</think> tags in the raw output.
extended-thinking (Claude): thinking exposed as thinking content blocks in the API response; billed even when not surfaced to the end user.

Research (Liu et al., "Think Deep, Not Just Long", 2025) shows that the proportion of deep-thinking tokens in a generation is a reliable predictor of accuracy across AIME and GPQA-Diamond benchmarks — not just total token count.

Key Examples

OpenAI o1 (Sep 2024) — first widely deployed reasoning model; strong on AIME/MATH.
OpenAI o3 (2025) — ~87.7% on GPQA-Diamond; pushed the frontier on formal reasoning tasks.
DeepSeek R1 (Jan 2025) — open-weight MIT-licensed model matching o1 performance at a fraction of the cost; 97.3% MATH-500, 79.8% AIME 2024 pass@1.
extended-thinking / Claude Sonnet 4.6 and Opus — Anthropic's implementation, interleaved with tool use in agentic-workflows.

Trade-offs

Dimension	Reasoning model	Fast model
Latency	High (seconds to minutes)	Low
Cost	High (thinking tokens billed)	Low
Hard math / logic	Strong	Weak
Simple tasks	Overkill	Sufficient
Streaming UX	Poor	Good

Evaluation Benchmarks

AIME — American Invitational Mathematics Examination problems; hard high-school math with exact numeric answers. Commonly reported as pass@1 or consensus accuracy.
MATH-500 — 500-problem subset of the MATH benchmark; tests algebra, number theory, geometry, etc.
GPQA-Diamond — graduate-level science questions requiring expert knowledge; harder to memorize than AIME.
SWE-bench — real-world software engineering issues; used to evaluate agentic-workflows with tool use rather than pure reasoning.

See evals for full benchmark context.

When to Use

Use reasoning models for: multi-step math, formal proofs, complex debugging, planning under constraints, and agentic-workflows where wrong early moves are costly.

Skip reasoning models for: simple completions, formatting, summarization, quick lookups, or any latency-sensitive application where the thinking overhead is not justified by the task difficulty.

extended-thinking · evals · agentic-workflows · deepseek-r1

Sources

linked from

Language of Thought The LOT-LLM Paradox

Reasoning Models

Reasoning Models

What Distinguishes Them

Thinking Tokens

Key Examples

Trade-offs

Evaluation Benchmarks

When to Use

Related

Sources