OpenAI o1

OpenAI o1

OpenAI's first dedicated reasoning model. Uses chain-of-thought "thinking tokens" — an internal scratchpad that the model reasons through before producing a final response — rather than answering immediately. Marks a shift from single-pass generation toward deliberate multi-step reasoning as a first-class capability.

Release Timeline

  • o1-preview — September 12, 2024; early access for ChatGPT Plus and Team users
  • o1-mini — September 12, 2024; faster, cheaper variant optimized for coding and math; 80% cheaper than o1-preview
  • o1 (full) — December 17, 2024; full release with 200K context and improved performance across all domains

Architecture

Internal chain-of-thought: the model generates reasoning tokens (not shown to the user by default) before producing its output. More thinking tokens generally improve performance on hard problems at the cost of higher latency and token cost. Thinking tokens count toward billing on o1.

Context Window

  • o1: 200K tokens (128K max output)
  • o1-mini and o1-preview: 128K tokens

Benchmarks

Benchmark Score
AIME 2024 (single sample) 74% (11.1/15)
AIME 2024 (consensus, 64 samples) 83% (12.5/15)
MATH-500 97%
GPQA Diamond (PhD science) 77.3%

GPQA Diamond context: PhD-level human experts score ~69.7%. o1 was the first model to surpass expert-level performance on this benchmark. AIME is the American Invitational Mathematics Examination; a score of 13.9/15 (best-of-1000 sample re-ranking) places the model in the top 500 US high school students nationally.

Pricing

  • Input: $15.00 per million tokens
  • Output: $60.00 per million tokens

Substantially more expensive than gpt-4o. The cost reflects the additional thinking tokens generated internally. o1-mini is significantly cheaper.

Strengths

  • Hard math and science: best-in-class at launch on competition math, physics, chemistry, and biology at PhD level
  • Complex multi-step reasoning: problems that require planning, backtracking, or sequential deduction
  • Code correctness: produces fewer logical errors on algorithmic problems than non-reasoning models

Weaknesses

  • Latency: thinking tokens mean noticeably slower responses than standard models
  • Cost: $60/M output tokens is expensive; impractical for high-volume use
  • No tool use at launch: original o1 release did not support web search or function calling (added later)
  • No streaming of thinking tokens by default

Successors

o1 is superseded by the o3 family (December 2024 benchmarks, January 2025 release), o3-mini, and o4-mini. The reasoning model line continues as OpenAI's dedicated compute-intensive tier alongside the gpt-4o and GPT-5 families.

Related

gpt-4o · deepseek-r1 · extended-thinking · evals

Sources