GPQA – LLM Glossary

GPQA (Rein et al., 2023) contains 448 multiple-choice questions written by PhD experts in biology, chemistry and physics, then verified by other PhD-level reviewers. The "Google-Proof" label means that even with internet access and time to search, non-expert humans (with PhDs in adjacent fields) score only ~34% — barely above the 25% random baseline.

Structure

Property	Detail
Questions	448 total; 198 in the Diamond subset (hardest, most vetted)
Fields	Biology, Chemistry, Physics
Expert accuracy	~65% (PhD in the field)
Non-expert + Google	~34%
Subsets	GPQA-Main, GPQA-Extended, GPQA-Diamond

Difficulty Design

Each question required the author to spend ≥ 30 minutes writing it. Questions require combining multiple non-obvious facts from different subfields. Distractors are carefully chosen to be plausible to non-experts but clearly wrong to a specialist — preventing shortcut heuristics.

Scores (Diamond subset)

Model	Accuracy
GPT-4o	53.6%
Claude 3 Opus	50.4%
o1 (2024)	77.3%
o3 (2025)	87.7%
DeepSeek-R1	71.5%
Llama 3.1 70B	46.7%
PhD expert (field)	65%

Why GPQA Defines the Reasoning Model Era

GPQA Diamond was the key benchmark that demonstrated reasoning models (o1, o3, DeepSeek-R1) achieve superhuman expert-level science knowledge. A frontier model going from 53% (GPT-4o) to 87% (o3) on GPQA Diamond represents a qualitative capability jump — not incremental improvement. It is now considered the primary discriminator for frontier model evaluation above MMLU.