Skip to content

MMLU (Massive Multitask Language Understanding)

Overview

MMLU is a large-scale benchmark designed to evaluate the broad factual knowledge and reasoning capabilities of large language models across academic, professional, and real-world subject areas. It extends earlier multitask benchmarks by significantly increasing subject coverage and difficulty, and is commonly used to assess frontier LLMs.

Contents

  • Modality: Text
  • Annotations | Labels: Multiple-choice answers with ground-truth labels
  • Size: ~15,000 questions
  • Format: JSON / CSV
  • Structure:
    • 57 subject areas spanning STEM, humanities, social sciences, and professional fields
    • Train, validation, and test splits (test labels partially withheld)

Typical Uses

  • General knowledge evaluation
  • LLM benchmarking

Notable Features

  • Broad subject coverage (law, medicine, engineering, ethics, etc.)
  • Increasingly difficult, expert-level questions
  • Widely reported in modern LLM system cards and papers

Limitations

  • English-only
  • Multiple-choice format limits evaluation of open-ended reasoning

Access

License / Source Information

  • Dataset Owner: Center for AI Safety (CAIS)
  • MIT License