WinoGrande (Sakaguchi et al., 2019) is an adversarially filtered version of the Winograd Schema Challenge at 44,000 examples. Each problem presents a sentence with a blank where a pronoun was removed, and two possible fillings — models must use world knowledge and logic to pick the correct one.
Example
"The trophy doesn't fit in the brown suitcase because it is too [big/small]."
Resolving whether it refers to the trophy or the suitcase requires understanding physical size relationships — something pure surface statistics cannot handle.
Why "WinoGrande" vs "Winograd"
The original Winograd Schema Challenge has only ~273 handcrafted sentences — too small for reliable measurement. WinoGrande scales this up 100× using crowdsourced generation + AFLITE adversarial filtering that removes examples solvable by n-gram co-occurrence, ensuring genuine commonsense is required.
Scores
| Model | Accuracy (5-shot) |
|---|---|
| Human | ~94% |
| GPT-4o | 88.3% |
| Llama 3.1 70B | 85.7% |
| Llama 3.1 8B | 77.4% |