MMLU – LLM Glossary

MMLU (Hendrycks et al., 2021) tests a model's academic knowledge across 57 subjects at difficulty levels ranging from elementary school to professional — including law, medicine, physics, history, ethics and computer science. A score of 89%+ (GPT-4 class) is considered frontier performance.

Structure

Property	Detail
Task type	4-way multiple choice
Number of questions	~14,000 (test set)
Number of subjects	57
Evaluation metric	Accuracy (%)
Prompt style	5-shot standard

Score Tiers

Random Baseline

25% (4 choices). A model that knows nothing scores here.

Human (non-expert)

~56%. Comparable to an informed adult without specialist training.

Human Expert

~89.8%. Requires deep expertise across all 57 domains simultaneously.

GPT-4o / Claude 3.5+

~88–93%. Frontier models pass or match expert-level for most subjects.

Llama 3 8B (Q4)

~65%. Good small model. Sufficient for many knowledge retrieval chains with RAG.

Limitations

MMLU has faced criticism for data contamination (test questions appeared in training corpora), annotation errors (some questions have ambiguous or wrong official answers), and the fact that it measures breadth of recall rather than reasoning depth. MMLU-Pro and GPQA were designed to address these gaps.

Variants

MMLU-Pro: Harder, 10-choice questions, reduced guessing. More discriminating at the frontier.
CMMLU / CEVAL: Chinese-language equivalents for evaluating Chinese-language capability.