WinoGrande – LLM Glossary

WinoGrande (Sakaguchi et al., 2019) is an adversarially filtered version of the Winograd Schema Challenge at 44,000 examples. Each problem presents a sentence with a blank where a pronoun was removed, and two possible fillings — models must use world knowledge and logic to pick the correct one.

Example

"The trophy doesn't fit in the brown suitcase because it is too [big/small]."

Resolving whether it refers to the trophy or the suitcase requires understanding physical size relationships — something pure surface statistics cannot handle.

Why "WinoGrande" vs "Winograd"

The original Winograd Schema Challenge has only ~273 handcrafted sentences — too small for reliable measurement. WinoGrande scales this up 100× using crowdsourced generation + AFLITE adversarial filtering that removes examples solvable by n-gram co-occurrence, ensuring genuine commonsense is required.

Scores

Model	Accuracy (5-shot)
Human	~94%
GPT-4o	88.3%
Llama 3.1 70B	85.7%
Llama 3.1 8B	77.4%