ARC Challenge – LLM Glossary

ARC (Clark et al., AI2 2018) splits 7,787 science exam questions (grades 3–9) into an Easy set (answerable by baseline retrieval systems) and a Challenge set (questions that stumped all baseline systems in 2018). The Challenge set became a standard reasoning benchmark.

Structure

Property	Detail
Task type	4-way multiple choice
Challenge set size	1,172 questions
Domain	US grade 3–9 science exams
Metric	Accuracy (normalised)
Prompt style	25-shot

Why "Challenge" Matters

ARC-E (Easy) is trivially solved by word overlap. ARC-C forces models to apply scientific concepts, chain facts, and eliminate plausible distractors — making it a better measure of knowledge application vs. retrieval. It is included in the Open LLM Leaderboard average.

Scores (ARC-C, 25-shot)

Model	Accuracy
GPT-4o	96.4%
Llama 3.1 70B	93.4%
Llama 3.1 8B	83.4%
Mistral 7B v0.3	60.0%