MATH – LLM Glossary

MATH (Hendrycks et al., 2021) contains 12,500 competition mathematics problems sourced from AMC, AIME, and ARML competitions. Topics span algebra, counting, geometry, number theory, and precalculus at difficulty levels 1–5. Level 5 problems challenge even professional mathematicians without a calculator.

Structure

Property	Detail
Problems	7,500 train / 5,000 test
Difficulty	1 (easy) – 5 (olympiad)
Topics	Algebra, Counting & Probability, Geometry, Number Theory, Precalculus, AMC/AIME
Metric	Exact match or equivalent expression
Answer format	LaTeX boxed answer

Scores

Model	Accuracy (overall)
GPT-3 (2021 baseline)	6.9%
GPT-4 (2023)	52.9%
Claude 3.5 Sonnet	71.1%
o1 (2024)	83.3%
o3 (2025)	96.7%
Llama 3.1 70B	68.0%

Why MATH Matters

MATH remains discriminating because it requires symbolic manipulation, proof-like reasoning, and exact answer formatting — not just numerical approximation. The "reasoning model" era (o1, o3, DeepSeek-R1) was largely defined by jumps on MATH Level 5 and AIME rather than on saturated benchmarks like GSM8K.

AIME Extension

The AIME (American Invitational Mathematics Examination) subset has become a standalone benchmark for evaluating top-tier reasoning models, with problems requiring multi-page symbolic proofs. Only o3-class and DeepSeek-R1-class models reliably score above 50% on AIME 2024.