WinoGrande

Benchmark

Large-scale commonsense reasoning via pronoun disambiguation — 44,000 adversarially filtered sentence pairs. Tests whether models resolve ambiguous pronouns using world knowledge.

WinoGrande (Sakaguchi et al., 2019) is an adversarially filtered version of the Winograd Schema Challenge at 44,000 examples. Each problem presents a sentence with a blank where a pronoun was removed, and two possible fillings — models must use world knowledge and logic to pick the correct one.

Example

"The trophy doesn't fit in the brown suitcase because it is too [big/small]."

Resolving whether it refers to the trophy or the suitcase requires understanding physical size relationships — something pure surface statistics cannot handle.

Why "WinoGrande" vs "Winograd"

The original Winograd Schema Challenge has only ~273 handcrafted sentences — too small for reliable measurement. WinoGrande scales this up 100× using crowdsourced generation + AFLITE adversarial filtering that removes examples solvable by n-gram co-occurrence, ensuring genuine commonsense is required.

Scores

ModelAccuracy (5-shot)
Human~94%
GPT-4o88.3%
Llama 3.1 70B85.7%
Llama 3.1 8B77.4%