ARC Challenge

Benchmark

AI2 Reasoning Challenge — grade-school science multiple-choice questions. The 'Challenge' set contains questions that retrieval-based and word co-occurrence systems fail on.

ARC (Clark et al., AI2 2018) splits 7,787 science exam questions (grades 3–9) into an Easy set (answerable by baseline retrieval systems) and a Challenge set (questions that stumped all baseline systems in 2018). The Challenge set became a standard reasoning benchmark.

Structure

PropertyDetail
Task type4-way multiple choice
Challenge set size1,172 questions
DomainUS grade 3–9 science exams
MetricAccuracy (normalised)
Prompt style25-shot

Why "Challenge" Matters

ARC-E (Easy) is trivially solved by word overlap. ARC-C forces models to apply scientific concepts, chain facts, and eliminate plausible distractors — making it a better measure of knowledge application vs. retrieval. It is included in the Open LLM Leaderboard average.

Scores (ARC-C, 25-shot)

ModelAccuracy
GPT-4o96.4%
Llama 3.1 70B93.4%
Llama 3.1 8B83.4%
Mistral 7B v0.360.0%