HellaSwag (Zellers et al., 2019) tests grounded commonsense inference: given a short description of a physical activity, choose the most plausible of four sentence continuations. The distractors are generated by a language model and then filtered to remove any that humans found easy to reject — making the benchmark adversarially hard for models.
Structure
| Property | Detail |
|---|---|
| Task type | 4-way multiple choice |
| Examples | ~70,000 (validation: 10,003) |
| Domain | WikiHow activities + ActivityNet captions |
| Human accuracy | 95.6% |
| Metric | Accuracy |
Scores
| Model | Accuracy |
|---|---|
| BERT (2019, baseline) | 47.3% |
| GPT-2 | 70.8% |
| GPT-4 | 95.3% |
| Llama 3.1 8B | 82.1% |
| Human | 95.6% |
Saturation Note
HellaSwag is largely saturated for frontier models (95%+). It remains useful for evaluating small open-source models (7B–13B range) where commonsense reasoning is still imperfect, but it no longer discriminates between frontier models.