MMMU

Benchmark NEW

Massive Multitask Multimodal Understanding — 11,500 questions across 183 subjects requiring image + text reasoning. The MMLU equivalent for multimodal language models.

MMMU (Yue et al., 2023) evaluates multimodal large language models on tasks from university-level exams and textbooks across 183 subjects in 6 disciplines. Unlike image captioning benchmarks, MMMU requires joint visual and textual reasoning — reading a circuit diagram, interpreting a medical image, or parsing a mathematical graph while answering a discipline-specific question.

Structure

PropertyDetail
Questions11,500 (multi-choice and open-ended)
Subjects183 across Art & Design, Business, Science, Health & Medicine, Humanities, Tech & Engineering
Image typesCharts, diagrams, tables, photographs, equations, maps, schematics
MetricAccuracy (%)
ScopeCollege entrance to postgraduate level

Why Multimodal Benchmarks Matter

Text-only benchmarks like MMLU cannot evaluate the fastest-growing LLM capability: understanding images, diagrams, tables, and documents. MMMU is the primary benchmark for vision-language models (Gemini, GPT-4V, Claude-Sonnet-Vision, Qwen-VL, LLaVA), directly measuring whether a model can reason about visual information in professional contexts.

Scores

ModelAccuracy
Human expert88.6%
Gemini 2.5 Pro81.7%
GPT-4o69.1%
Claude 3.5 Sonnet70.4%
LLaVA-1.6 34B51.1%
Random baseline25%

Relation to On-Premise

Multimodal models (LLaVA, InternVL, Qwen-VL) can be run on-premise for document understanding, visual QA, and OCR workflows. MMMU scores help select the right model for your image-heavy on-premise use case — particularly for technical document processing where diagram comprehension is required.