MMMU – LLM Glossary

MMMU (Yue et al., 2023) evaluates multimodal large language models on tasks from university-level exams and textbooks across 183 subjects in 6 disciplines. Unlike image captioning benchmarks, MMMU requires joint visual and textual reasoning — reading a circuit diagram, interpreting a medical image, or parsing a mathematical graph while answering a discipline-specific question.

Structure

Property	Detail
Questions	11,500 (multi-choice and open-ended)
Subjects	183 across Art & Design, Business, Science, Health & Medicine, Humanities, Tech & Engineering
Image types	Charts, diagrams, tables, photographs, equations, maps, schematics
Metric	Accuracy (%)
Scope	College entrance to postgraduate level

Why Multimodal Benchmarks Matter

Text-only benchmarks like MMLU cannot evaluate the fastest-growing LLM capability: understanding images, diagrams, tables, and documents. MMMU is the primary benchmark for vision-language models (Gemini, GPT-4V, Claude-Sonnet-Vision, Qwen-VL, LLaVA), directly measuring whether a model can reason about visual information in professional contexts.

Scores

Model	Accuracy
Human expert	88.6%
Gemini 2.5 Pro	81.7%
GPT-4o	69.1%
Claude 3.5 Sonnet	70.4%
LLaVA-1.6 34B	51.1%
Random baseline	25%

Relation to On-Premise

Multimodal models (LLaVA, InternVL, Qwen-VL) can be run on-premise for document understanding, visual QA, and OCR workflows. MMMU scores help select the right model for your image-heavy on-premise use case — particularly for technical document processing where diagram comprehension is required.