MATH (Hendrycks et al., 2021) contains 12,500 competition mathematics problems sourced from AMC, AIME, and ARML competitions. Topics span algebra, counting, geometry, number theory, and precalculus at difficulty levels 1–5. Level 5 problems challenge even professional mathematicians without a calculator.
Structure
| Property | Detail |
|---|---|
| Problems | 7,500 train / 5,000 test |
| Difficulty | 1 (easy) – 5 (olympiad) |
| Topics | Algebra, Counting & Probability, Geometry, Number Theory, Precalculus, AMC/AIME |
| Metric | Exact match or equivalent expression |
| Answer format | LaTeX boxed answer |
Scores
| Model | Accuracy (overall) |
|---|---|
| GPT-3 (2021 baseline) | 6.9% |
| GPT-4 (2023) | 52.9% |
| Claude 3.5 Sonnet | 71.1% |
| o1 (2024) | 83.3% |
| o3 (2025) | 96.7% |
| Llama 3.1 70B | 68.0% |
Why MATH Matters
MATH remains discriminating because it requires symbolic manipulation, proof-like reasoning, and exact answer formatting — not just numerical approximation. The "reasoning model" era (o1, o3, DeepSeek-R1) was largely defined by jumps on MATH Level 5 and AIME rather than on saturated benchmarks like GSM8K.
AIME Extension
The AIME (American Invitational Mathematics Examination) subset has become a standalone benchmark for evaluating top-tier reasoning models, with problems requiring multi-page symbolic proofs. Only o3-class and DeepSeek-R1-class models reliably score above 50% on AIME 2024.