MMLU

Benchmark

Massive Multitask Language Understanding — 57-subject multiple-choice exam covering STEM, humanities, law and more. The standard academic knowledge benchmark since 2021.

MMLU (Hendrycks et al., 2021) tests a model's academic knowledge across 57 subjects at difficulty levels ranging from elementary school to professional — including law, medicine, physics, history, ethics and computer science. A score of 89%+ (GPT-4 class) is considered frontier performance.

Structure

PropertyDetail
Task type4-way multiple choice
Number of questions~14,000 (test set)
Number of subjects57
Evaluation metricAccuracy (%)
Prompt style5-shot standard

Score Tiers

Random Baseline

25% (4 choices). A model that knows nothing scores here.

Human (non-expert)

~56%. Comparable to an informed adult without specialist training.

Human Expert

~89.8%. Requires deep expertise across all 57 domains simultaneously.

GPT-4o / Claude 3.5+

~88–93%. Frontier models pass or match expert-level for most subjects.

Llama 3 8B (Q4)

~65%. Good small model. Sufficient for many knowledge retrieval chains with RAG.

Limitations

MMLU has faced criticism for data contamination (test questions appeared in training corpora), annotation errors (some questions have ambiguous or wrong official answers), and the fact that it measures breadth of recall rather than reasoning depth. MMLU-Pro and GPQA were designed to address these gaps.

Variants

  • MMLU-Pro: Harder, 10-choice questions, reduced guessing. More discriminating at the frontier.
  • CMMLU / CEVAL: Chinese-language equivalents for evaluating Chinese-language capability.