MMLU (Hendrycks et al., 2021) tests a model's academic knowledge across 57 subjects at difficulty levels ranging from elementary school to professional — including law, medicine, physics, history, ethics and computer science. A score of 89%+ (GPT-4 class) is considered frontier performance.
Structure
| Property | Detail |
|---|---|
| Task type | 4-way multiple choice |
| Number of questions | ~14,000 (test set) |
| Number of subjects | 57 |
| Evaluation metric | Accuracy (%) |
| Prompt style | 5-shot standard |
Score Tiers
Random Baseline
25% (4 choices). A model that knows nothing scores here.
Human (non-expert)
~56%. Comparable to an informed adult without specialist training.
Human Expert
~89.8%. Requires deep expertise across all 57 domains simultaneously.
GPT-4o / Claude 3.5+
~88–93%. Frontier models pass or match expert-level for most subjects.
Llama 3 8B (Q4)
~65%. Good small model. Sufficient for many knowledge retrieval chains with RAG.
Limitations
MMLU has faced criticism for data contamination (test questions appeared in training corpora), annotation errors (some questions have ambiguous or wrong official answers), and the fact that it measures breadth of recall rather than reasoning depth. MMLU-Pro and GPQA were designed to address these gaps.
Variants
- MMLU-Pro: Harder, 10-choice questions, reduced guessing. More discriminating at the frontier.
- CMMLU / CEVAL: Chinese-language equivalents for evaluating Chinese-language capability.