MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Wang, Yubo; Ma, Xueguang; Zhang, Ge; Ni, Yuansheng; Chandra, Abhranil; Guo, Shiguang; Ren, Weiming; Arulraj, Aaran; He, Xuan; Jiang, Ziyan; Li, Tianle; Ku, Max; Wang, Kai; Zhuang, Alex; Fan, Rongqi; Yue, Xiang; Chen, Wenhu

Computer Science > Computation and Language

arXiv:2406.01574 (cs)

[Submitted on 3 Jun 2024 (v1), last revised 6 Nov 2024 (this version, v6)]

Title:MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Authors:Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, Wenhu Chen

View PDF HTML (experimental)

Abstract:In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions. Our assessments confirm that MMLU-Pro is a more discriminative benchmark to better track progress in the field.

Comments:	This version has been accepted and published at NeurIPS 2024 Track Datasets and Benchmarks (Spotlight)
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2406.01574 [cs.CL]
	(or arXiv:2406.01574v6 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.01574

Submission history

From: Yubo Wang [view email]
[v1] Mon, 3 Jun 2024 17:53:00 UTC (1,928 KB)
[v2] Tue, 4 Jun 2024 10:36:49 UTC (727 KB)
[v3] Wed, 5 Jun 2024 04:03:36 UTC (727 KB)
[v4] Sun, 23 Jun 2024 15:57:16 UTC (727 KB)
[v5] Mon, 7 Oct 2024 17:46:08 UTC (728 KB)
[v6] Wed, 6 Nov 2024 02:54:00 UTC (728 KB)

Computer Science > Computation and Language

Title:MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators