BIG-Bench Hard (Suzgun et al., Google 2022) is a curated subset of 23 tasks from the broader BIG-Bench suite, selected specifically because state-of-the-art models (at the time) scored below average human performance. It was the first benchmark to show that chain-of-thought prompting helps even on non-math tasks.
Task Categories
Logical Deduction
Determine ordering of objects from a sequence of clues. Requires tracking multi-variable constraints across 3–7 objects.
Multi-Step Arithmetic
Evaluate expressions with large integers and multi-digit operations without a calculator.
Causal Judgement
Determine whether an agent caused an outcome based on a narrative. Requires counterfactual reasoning.
Date Understanding
Infer dates from relative descriptions ("3 weeks before the first Monday after…"). Tests calendar arithmetic.
Formal Fallacies
Identify whether an English-language syllogism is logically valid. Maps natural language to formal logic.
Tracking Shuffled Objects
Follow swaps and permutations to determine final state. High cognitive load; humans also struggle.
Key Finding: CoT Unlocks BBH
Without chain-of-thought, LLMs scored near or below 50% on BBH tasks. With CoT, performance jumped 10–20+ percentage points — establishing CoT as a near-universal improvement strategy for reasoning tasks. This paper is a key citation in any reasoning model paper.
Current Scores
| Model | BBH (3-shot CoT) |
|---|---|
| GPT-3.5-turbo | 70.1% |
| GPT-4o | 83.4% |
| Claude 3.5 Sonnet | 88.0% |
| Llama 3.1 70B | 81.6% |
| Llama 3.1 8B | 61.2% |