MATH

Benchmark

Competition-level mathematics (AMC, AIME, Putnam) across 5 difficulty levels and 7 subject areas. Far harder than GSM8K — still discriminating at the frontier.

MATH (Hendrycks et al., 2021) contains 12,500 competition mathematics problems sourced from AMC, AIME, and ARML competitions. Topics span algebra, counting, geometry, number theory, and precalculus at difficulty levels 1–5. Level 5 problems challenge even professional mathematicians without a calculator.

Structure

PropertyDetail
Problems7,500 train / 5,000 test
Difficulty1 (easy) – 5 (olympiad)
TopicsAlgebra, Counting & Probability, Geometry, Number Theory, Precalculus, AMC/AIME
MetricExact match or equivalent expression
Answer formatLaTeX boxed answer

Scores

ModelAccuracy (overall)
GPT-3 (2021 baseline)6.9%
GPT-4 (2023)52.9%
Claude 3.5 Sonnet71.1%
o1 (2024)83.3%
o3 (2025)96.7%
Llama 3.1 70B68.0%

Why MATH Matters

MATH remains discriminating because it requires symbolic manipulation, proof-like reasoning, and exact answer formatting — not just numerical approximation. The "reasoning model" era (o1, o3, DeepSeek-R1) was largely defined by jumps on MATH Level 5 and AIME rather than on saturated benchmarks like GSM8K.

AIME Extension

The AIME (American Invitational Mathematics Examination) subset has become a standalone benchmark for evaluating top-tier reasoning models, with problems requiring multi-page symbolic proofs. Only o3-class and DeepSeek-R1-class models reliably score above 50% on AIME 2024.