MMLU (Massive Multitask Language Understanding)
Overview
MMLU is a large-scale benchmark designed to evaluate the broad factual knowledge and reasoning capabilities of large language models across academic, professional, and real-world subject areas. It extends earlier multitask benchmarks by significantly increasing subject coverage and difficulty, and is commonly used to assess frontier LLMs.
Contents
- Modality: Text
- Annotations | Labels: Multiple-choice answers with ground-truth labels
- Size: ~15,000 questions
- Format: JSON / CSV
- Structure:
- 57 subject areas spanning STEM, humanities, social sciences, and professional fields
- Train, validation, and test splits (test labels partially withheld)
Typical Uses
- General knowledge evaluation
- LLM benchmarking
Notable Features
- Broad subject coverage (law, medicine, engineering, ethics, etc.)
- Increasingly difficult, expert-level questions
- Widely reported in modern LLM system cards and papers
Limitations
- English-only
- Multiple-choice format limits evaluation of open-ended reasoning
Access
License / Source Information
- Dataset Owner: Center for AI Safety (CAIS)
- MIT License