GSM8K (Cobbe et al., OpenAI 2021) contains 8,500 grade-school math word problems, each requiring 2–8 arithmetic steps and no advanced mathematical knowledge. Its importance comes from measuring multi-step reasoning rather than memorisation — solutions require chaining logical steps, not just recalling a formula.
Structure
| Property | Detail |
|---|---|
| Problems | 7,473 train / 1,319 test |
| Difficulty | Elementary (grade 3–7) |
| Solution length | 2–8 steps |
| Metric | Accuracy on final numeric answer |
| Prompt style | 8-shot chain-of-thought |
Chain-of-Thought Impact
GSM8K was the key benchmark used by Wei et al. (2022) to demonstrate that chain-of-thought prompting dramatically boosts reasoning. Before CoT, GPT-3 scored ~17%. With CoT, the same model jumped to ~56%. This single observation changed how LLMs are prompted everywhere.
Score Progression
| Model | Accuracy | Method |
|---|---|---|
| GPT-3 (175B, 2021) | 17.4% | 0-shot |
| GPT-3 + CoT | 56.4% | 8-shot CoT |
| GPT-4 (2023) | 92.0% | 5-shot CoT |
| Claude 3.5 Sonnet | 96.4% | 0-shot CoT |
| o3 (2025) | 97.9% | reasoning model |
| Llama 3.1 8B | 84.5% | 5-shot CoT |
Saturation
Frontier models score 96–98% on GSM8K — near saturation. The field has moved to MATH (competition-level), AIME (olympiad), and OmniMATH for meaningful discrimination at the frontier.