HellaSwag – LLM Glossary

HellaSwag (Zellers et al., 2019) tests grounded commonsense inference: given a short description of a physical activity, choose the most plausible of four sentence continuations. The distractors are generated by a language model and then filtered to remove any that humans found easy to reject — making the benchmark adversarially hard for models.

Structure

Property	Detail
Task type	4-way multiple choice
Examples	~70,000 (validation: 10,003)
Domain	WikiHow activities + ActivityNet captions
Human accuracy	95.6%
Metric	Accuracy

Scores

Model	Accuracy
BERT (2019, baseline)	47.3%
GPT-2	70.8%
GPT-4	95.3%
Llama 3.1 8B	82.1%
Human	95.6%

Saturation Note

HellaSwag is largely saturated for frontier models (95%+). It remains useful for evaluating small open-source models (7B–13B range) where commonsense reasoning is still imperfect, but it no longer discriminates between frontier models.