ARC (Clark et al., AI2 2018) splits 7,787 science exam questions (grades 3–9) into an Easy set (answerable by baseline retrieval systems) and a Challenge set (questions that stumped all baseline systems in 2018). The Challenge set became a standard reasoning benchmark.
Structure
| Property | Detail |
|---|---|
| Task type | 4-way multiple choice |
| Challenge set size | 1,172 questions |
| Domain | US grade 3–9 science exams |
| Metric | Accuracy (normalised) |
| Prompt style | 25-shot |
Why "Challenge" Matters
ARC-E (Easy) is trivially solved by word overlap. ARC-C forces models to apply scientific concepts, chain facts, and eliminate plausible distractors — making it a better measure of knowledge application vs. retrieval. It is included in the Open LLM Leaderboard average.
Scores (ARC-C, 25-shot)
| Model | Accuracy |
|---|---|
| GPT-4o | 96.4% |
| Llama 3.1 70B | 93.4% |
| Llama 3.1 8B | 83.4% |
| Mistral 7B v0.3 | 60.0% |