HellaSwag

Benchmark

Commonsense natural language inference: pick the most plausible continuation for an activity description. Created with adversarial filtering — humans score 95%, early models scored ~40%.

HellaSwag (Zellers et al., 2019) tests grounded commonsense inference: given a short description of a physical activity, choose the most plausible of four sentence continuations. The distractors are generated by a language model and then filtered to remove any that humans found easy to reject — making the benchmark adversarially hard for models.

Structure

PropertyDetail
Task type4-way multiple choice
Examples~70,000 (validation: 10,003)
DomainWikiHow activities + ActivityNet captions
Human accuracy95.6%
MetricAccuracy

Scores

ModelAccuracy
BERT (2019, baseline)47.3%
GPT-270.8%
GPT-495.3%
Llama 3.1 8B82.1%
Human95.6%

Saturation Note

HellaSwag is largely saturated for frontier models (95%+). It remains useful for evaluating small open-source models (7B–13B range) where commonsense reasoning is still imperfect, but it no longer discriminates between frontier models.