Qwen team confirms data quality issues in GPQA and HLE datasets

The Qwen team has officially confirmed the presence of serious data quality issues within the GPQA (General Purpose Question Answering) and HLE (Humanity's Last Exam) test sets. The news, initially surfaced on Reddit and picked up by the LocalLLaMA community, highlights how several answers considered correct within the datasets were, in reality, incorrect.

Analysis and Validation

An independent researcher had previously conducted a forensic analysis on the datasets, called "DeepSeek-Overclock", finding that the DeepSeek model, pushed to its limit, provided technically correct answers but contradicted the provided "gold standard" labels. Further verification, using Python scripts, confirmed the errors in the datasets.

Implications

The confirmation by the Qwen team, through a paper published on ArXiv, underscores the importance of accurate validation of the datasets used to evaluate the reasoning capabilities of language models. The paper highlights how many questions in the HLE test set are "fundamentally broken" and how, in some cases, the standard answers are simply wrong. This raises questions about the reliability of existing benchmarks and the need to develop more robust evaluation methodologies.

Qwen team confirms data quality issues in GPQA and HLE datasets

Analysis and Validation

Implications

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

Deepseek testing a new model: focus on reading comprehension

LLM Benchmark: Logical Reasoning and the 'Car Wash' Test

Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious Dataset