GSM8K – LLM Glossary

GSM8K (Cobbe et al., OpenAI 2021) contains 8,500 grade-school math word problems, each requiring 2–8 arithmetic steps and no advanced mathematical knowledge. Its importance comes from measuring multi-step reasoning rather than memorisation — solutions require chaining logical steps, not just recalling a formula.

Structure

Property	Detail
Problems	7,473 train / 1,319 test
Difficulty	Elementary (grade 3–7)
Solution length	2–8 steps
Metric	Accuracy on final numeric answer
Prompt style	8-shot chain-of-thought

Chain-of-Thought Impact

GSM8K was the key benchmark used by Wei et al. (2022) to demonstrate that chain-of-thought prompting dramatically boosts reasoning. Before CoT, GPT-3 scored ~17%. With CoT, the same model jumped to ~56%. This single observation changed how LLMs are prompted everywhere.

Score Progression

Model	Accuracy	Method
GPT-3 (175B, 2021)	17.4%	0-shot
GPT-3 + CoT	56.4%	8-shot CoT
GPT-4 (2023)	92.0%	5-shot CoT
Claude 3.5 Sonnet	96.4%	0-shot CoT
o3 (2025)	97.9%	reasoning model
Llama 3.1 8B	84.5%	5-shot CoT

Saturation

Frontier models score 96–98% on GSM8K — near saturation. The field has moved to MATH (competition-level), AIME (olympiad), and OmniMATH for meaningful discrimination at the frontier.