GPQA (Rein et al., 2023) contains 448 multiple-choice questions written by PhD experts in biology, chemistry and physics, then verified by other PhD-level reviewers. The "Google-Proof" label means that even with internet access and time to search, non-expert humans (with PhDs in adjacent fields) score only ~34% — barely above the 25% random baseline.
Structure
| Property | Detail |
|---|---|
| Questions | 448 total; 198 in the Diamond subset (hardest, most vetted) |
| Fields | Biology, Chemistry, Physics |
| Expert accuracy | ~65% (PhD in the field) |
| Non-expert + Google | ~34% |
| Subsets | GPQA-Main, GPQA-Extended, GPQA-Diamond |
Difficulty Design
Each question required the author to spend ≥ 30 minutes writing it. Questions require combining multiple non-obvious facts from different subfields. Distractors are carefully chosen to be plausible to non-experts but clearly wrong to a specialist — preventing shortcut heuristics.
Scores (Diamond subset)
| Model | Accuracy |
|---|---|
| GPT-4o | 53.6% |
| Claude 3 Opus | 50.4% |
| o1 (2024) | 77.3% |
| o3 (2025) | 87.7% |
| DeepSeek-R1 | 71.5% |
| Llama 3.1 70B | 46.7% |
| PhD expert (field) | 65% |
Why GPQA Defines the Reasoning Model Era
GPQA Diamond was the key benchmark that demonstrated reasoning models (o1, o3, DeepSeek-R1) achieve superhuman expert-level science knowledge. A frontier model going from 53% (GPT-4o) to 87% (o3) on GPQA Diamond represents a qualitative capability jump — not incremental improvement. It is now considered the primary discriminator for frontier model evaluation above MMLU.