GPQA

Benchmark NEW

Graduate-Level Google-Proof Q&A — 448 expert-crafted questions in biology, chemistry and physics so hard that even PhD researchers only score 65%. Designed to challenge frontier models.

GPQA (Rein et al., 2023) contains 448 multiple-choice questions written by PhD experts in biology, chemistry and physics, then verified by other PhD-level reviewers. The "Google-Proof" label means that even with internet access and time to search, non-expert humans (with PhDs in adjacent fields) score only ~34% — barely above the 25% random baseline.

Structure

PropertyDetail
Questions448 total; 198 in the Diamond subset (hardest, most vetted)
FieldsBiology, Chemistry, Physics
Expert accuracy~65% (PhD in the field)
Non-expert + Google~34%
SubsetsGPQA-Main, GPQA-Extended, GPQA-Diamond

Difficulty Design

Each question required the author to spend ≥ 30 minutes writing it. Questions require combining multiple non-obvious facts from different subfields. Distractors are carefully chosen to be plausible to non-experts but clearly wrong to a specialist — preventing shortcut heuristics.

Scores (Diamond subset)

ModelAccuracy
GPT-4o53.6%
Claude 3 Opus50.4%
o1 (2024)77.3%
o3 (2025)87.7%
DeepSeek-R171.5%
Llama 3.1 70B46.7%
PhD expert (field)65%

Why GPQA Defines the Reasoning Model Era

GPQA Diamond was the key benchmark that demonstrated reasoning models (o1, o3, DeepSeek-R1) achieve superhuman expert-level science knowledge. A frontier model going from 53% (GPT-4o) to 87% (o3) on GPQA Diamond represents a qualitative capability jump — not incremental improvement. It is now considered the primary discriminator for frontier model evaluation above MMLU.