Evals (Benchmarks)
Evals (Benchmarks)
Standardized tests for measuring model capabilities. Every model release cites benchmark scores — understanding what each benchmark actually measures is necessary to interpret the numbers correctly.
Key Benchmarks
SWE-bench Verified
Measures ability to resolve real GitHub issues. Tasks are drawn from open-source repos, each with a failing test. The model must write code that makes the test pass. "Verified" means human validators confirmed the issues and solutions are legitimate.
What it tests: Autonomous coding on realistic software engineering tasks. Scores to know: claude-sonnet-4-6 79.6%. Current top scores on the leaderboard reach 93.9%. Note: a "human developer baseline ~50%" figure has circulated informally but does not appear on the official benchmark site and should not be treated as authoritative.
OSWorld
Measures computer-use — controlling a desktop GUI to complete tasks. Agent sees screenshots, uses mouse and keyboard.
What it tests: Real computer automation across GUI applications. Scores to know: claude-opus-4-6 72.7%, claude-sonnet-4-6 72.5%, human baseline 72.36%.
AIME (American Invitational Mathematics Examination)
Standardized high school math competition problems. Hard for most students, used as a proxy for frontier-level mathematical reasoning.
Important: Scores are often reported across different test years — AIME 2024 and AIME 2025 are distinct tests and scores are not directly comparable.
Scores to know: deepseek-r1 79.8% on AIME 2024; gemma-4 89.2% on AIME 2025.
MATH-500
500 competition mathematics problems across difficulty levels. Measures mathematical reasoning and calculation.
Scores to know: deepseek-r1 97.3% — near ceiling.
LiveCodeBench
Coding benchmark using problems released after model training cutoffs. Harder to game than static benchmarks because models can't have seen the problems during training.
What it tests: Genuine coding ability, not memorization. Scores to know: gemini-2-5-pro and gemma-4 (80%) lead.
Chatbot Arena / LMSYS Arena
Human preference benchmark. Real users have conversations with two anonymous models and pick the better response. Aggregated into Elo-style rankings.
What it tests: General preference across real user tasks — not a single domain. Why it matters: Correlation with "what people actually like" rather than capability on specific test types.
Interpreting Scores
- Single benchmark = narrow signal: A model that leads SWE-bench might trail on Arena. Use multiple benchmarks.
- Training contamination: If problems appeared in training data, scores inflate. LiveCodeBench addresses this with post-cutoff problems.
- Benchmark saturation: When MATH-500 scores approach 97%, the benchmark stops differentiating. New, harder benchmarks appear.
- Task-type specificity: OSWorld tells you nothing about math. AIME tells you nothing about coding. Match the benchmark to your actual use case.
- Test-year specificity: AIME scores must specify the year (2024, 2025, etc.) — different years are different exams.
Related
claude-sonnet-4-6 · gemini-2-5-pro · deepseek-r1 · gemma-4 · computer-use