MMLU (Massive Multitask Language Understanding)

Overview

MMLU is a large-scale benchmark designed to evaluate the broad factual knowledge and reasoning capabilities of large language models across academic, professional, and real-world subject areas. It extends earlier multitask benchmarks by significantly increasing subject coverage and difficulty, and is commonly used to assess frontier LLMs.

Modality: Text
Annotations | Labels: Multiple-choice answers with ground-truth labels
Size: ~15,000 questions
Format: JSON / CSV
Structure:
- 57 subject areas spanning STEM, humanities, social sciences, and professional fields
- Train, validation, and test splits (test labels partially withheld)

Typical Uses

General knowledge evaluation
LLM benchmarking

Notable Features

Broad subject coverage (law, medicine, engineering, ethics, etc.)
Increasingly difficult, expert-level questions
Widely reported in modern LLM system cards and papers

Limitations

English-only
Multiple-choice format limits evaluation of open-ended reasoning

Access

License / Source Information

Dataset Owner: Center for AI Safety (CAIS)
MIT License