Evals (Benchmarks)

Evals (Benchmarks)

Standardized tests for measuring model capabilities. Every model release cites benchmark scores — understanding what each benchmark actually measures is necessary to interpret the numbers correctly.

Key Benchmarks

SWE-bench Verified

Measures ability to resolve real GitHub issues. Tasks are drawn from open-source repos, each with a failing test. The model must write code that makes the test pass. "Verified" means human validators confirmed the issues and solutions are legitimate.

What it tests: Autonomous coding on realistic software engineering tasks. Scores to know: claude-sonnet-4-6 79.6%. Current top scores on the leaderboard reach 93.9%. Note: a "human developer baseline ~50%" figure has circulated informally but does not appear on the official benchmark site and should not be treated as authoritative.

OSWorld

Measures computer-use — controlling a desktop GUI to complete tasks. Agent sees screenshots, uses mouse and keyboard.

What it tests: Real computer automation across GUI applications. Scores to know: claude-opus-4-6 72.7%, claude-sonnet-4-6 72.5%, human baseline 72.36%.

AIME (American Invitational Mathematics Examination)

Standardized high school math competition problems. Hard for most students, used as a proxy for frontier-level mathematical reasoning.

Important: Scores are often reported across different test years — AIME 2024 and AIME 2025 are distinct tests and scores are not directly comparable.

Scores to know: deepseek-r1 79.8% on AIME 2024; gemma-4 89.2% on AIME 2025.

MATH-500

500 competition mathematics problems across difficulty levels. Measures mathematical reasoning and calculation.

Scores to know: deepseek-r1 97.3% — near ceiling.

LiveCodeBench

Coding benchmark using problems released after model training cutoffs. Harder to game than static benchmarks because models can't have seen the problems during training.

What it tests: Genuine coding ability, not memorization. Scores to know: gemini-2-5-pro and gemma-4 (80%) lead.

Chatbot Arena / LMSYS Arena

Human preference benchmark. Real users have conversations with two anonymous models and pick the better response. Aggregated into Elo-style rankings.

What it tests: General preference across real user tasks — not a single domain. Why it matters: Correlation with "what people actually like" rather than capability on specific test types.

Interpreting Scores

  • Single benchmark = narrow signal: A model that leads SWE-bench might trail on Arena. Use multiple benchmarks.
  • Training contamination: If problems appeared in training data, scores inflate. LiveCodeBench addresses this with post-cutoff problems.
  • Benchmark saturation: When MATH-500 scores approach 97%, the benchmark stops differentiating. New, harder benchmarks appear.
  • Task-type specificity: OSWorld tells you nothing about math. AIME tells you nothing about coding. Match the benchmark to your actual use case.
  • Test-year specificity: AIME scores must specify the year (2024, 2025, etc.) — different years are different exams.

Related

claude-sonnet-4-6 · gemini-2-5-pro · deepseek-r1 · gemma-4 · computer-use

Sources