MMMU (Yue et al., 2023) evaluates multimodal large language models on tasks from university-level exams and textbooks across 183 subjects in 6 disciplines. Unlike image captioning benchmarks, MMMU requires joint visual and textual reasoning — reading a circuit diagram, interpreting a medical image, or parsing a mathematical graph while answering a discipline-specific question.
Structure
| Property | Detail |
|---|---|
| Questions | 11,500 (multi-choice and open-ended) |
| Subjects | 183 across Art & Design, Business, Science, Health & Medicine, Humanities, Tech & Engineering |
| Image types | Charts, diagrams, tables, photographs, equations, maps, schematics |
| Metric | Accuracy (%) |
| Scope | College entrance to postgraduate level |
Why Multimodal Benchmarks Matter
Text-only benchmarks like MMLU cannot evaluate the fastest-growing LLM capability: understanding images, diagrams, tables, and documents. MMMU is the primary benchmark for vision-language models (Gemini, GPT-4V, Claude-Sonnet-Vision, Qwen-VL, LLaVA), directly measuring whether a model can reason about visual information in professional contexts.
Scores
| Model | Accuracy |
|---|---|
| Human expert | 88.6% |
| Gemini 2.5 Pro | 81.7% |
| GPT-4o | 69.1% |
| Claude 3.5 Sonnet | 70.4% |
| LLaVA-1.6 34B | 51.1% |
| Random baseline | 25% |
Relation to On-Premise
Multimodal models (LLaVA, InternVL, Qwen-VL) can be run on-premise for document understanding, visual QA, and OCR workflows. MMMU scores help select the right model for your image-heavy on-premise use case — particularly for technical document processing where diagram comprehension is required.