GSM8K

Benchmark

Grade School Math 8K — 8,500 linguistically diverse elementary math word problems requiring multi-step arithmetic. The primary benchmark for evaluating LLM mathematical reasoning.

GSM8K (Cobbe et al., OpenAI 2021) contains 8,500 grade-school math word problems, each requiring 2–8 arithmetic steps and no advanced mathematical knowledge. Its importance comes from measuring multi-step reasoning rather than memorisation — solutions require chaining logical steps, not just recalling a formula.

Structure

PropertyDetail
Problems7,473 train / 1,319 test
DifficultyElementary (grade 3–7)
Solution length2–8 steps
MetricAccuracy on final numeric answer
Prompt style8-shot chain-of-thought

Chain-of-Thought Impact

GSM8K was the key benchmark used by Wei et al. (2022) to demonstrate that chain-of-thought prompting dramatically boosts reasoning. Before CoT, GPT-3 scored ~17%. With CoT, the same model jumped to ~56%. This single observation changed how LLMs are prompted everywhere.

Score Progression

ModelAccuracyMethod
GPT-3 (175B, 2021)17.4%0-shot
GPT-3 + CoT56.4%8-shot CoT
GPT-4 (2023)92.0%5-shot CoT
Claude 3.5 Sonnet96.4%0-shot CoT
o3 (2025)97.9%reasoning model
Llama 3.1 8B84.5%5-shot CoT

Saturation

Frontier models score 96–98% on GSM8K — near saturation. The field has moved to MATH (competition-level), AIME (olympiad), and OmniMATH for meaningful discrimination at the frontier.