The Qwen team has officially confirmed the presence of serious data quality issues within the GPQA (General Purpose Question Answering) and HLE (Humanity's Last Exam) test sets. The news, initially surfaced on Reddit and picked up by the LocalLLaMA community, highlights how several answers considered correct within the datasets were, in reality, incorrect.
Analysis and Validation
An independent researcher had previously conducted a forensic analysis on the datasets, called "DeepSeek-Overclock", finding that the DeepSeek model, pushed to its limit, provided technically correct answers but contradicted the provided "gold standard" labels. Further verification, using Python scripts, confirmed the errors in the datasets.
Implications
The confirmation by the Qwen team, through a paper published on ArXiv, underscores the importance of accurate validation of the datasets used to evaluate the reasoning capabilities of language models. The paper highlights how many questions in the HLE test set are "fundamentally broken" and how, in some cases, the standard answers are simply wrong. This raises questions about the reliability of existing benchmarks and the need to develop more robust evaluation methodologies.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!