BIG-Bench Hard (BBH)

Benchmark

23 challenging tasks from the BIG-Bench suite where LLMs historically underperformed humans — covering logical reasoning, multi-step arithmetic, causal reasoning, and formal logic.

BIG-Bench Hard (Suzgun et al., Google 2022) is a curated subset of 23 tasks from the broader BIG-Bench suite, selected specifically because state-of-the-art models (at the time) scored below average human performance. It was the first benchmark to show that chain-of-thought prompting helps even on non-math tasks.

Task Categories

Logical Deduction

Determine ordering of objects from a sequence of clues. Requires tracking multi-variable constraints across 3–7 objects.

Multi-Step Arithmetic

Evaluate expressions with large integers and multi-digit operations without a calculator.

Causal Judgement

Determine whether an agent caused an outcome based on a narrative. Requires counterfactual reasoning.

Date Understanding

Infer dates from relative descriptions ("3 weeks before the first Monday after…"). Tests calendar arithmetic.

Formal Fallacies

Identify whether an English-language syllogism is logically valid. Maps natural language to formal logic.

Tracking Shuffled Objects

Follow swaps and permutations to determine final state. High cognitive load; humans also struggle.

Key Finding: CoT Unlocks BBH

Without chain-of-thought, LLMs scored near or below 50% on BBH tasks. With CoT, performance jumped 10–20+ percentage points — establishing CoT as a near-universal improvement strategy for reasoning tasks. This paper is a key citation in any reasoning model paper.

Current Scores

ModelBBH (3-shot CoT)
GPT-3.5-turbo70.1%
GPT-4o83.4%
Claude 3.5 Sonnet88.0%
Llama 3.1 70B81.6%
Llama 3.1 8B61.2%