Integrity Issues in SWE-bench Verified
SWE-bench Verified, a widely used benchmark for measuring the code generation capabilities of language models, has been subject to increasing concerns regarding its integrity. Recent analyses have revealed that the benchmark has flawed tests and potential training leakage issues, compromising its accuracy and reliability.
The presence of training leakage implies that models may have been exposed, directly or indirectly, to test data during the training phase, effectively invalidating the results obtained. This raises serious doubts about the ability of SWE-bench Verified to accurately measure real progress in the development of code generation models.
Recommendation: SWE-bench Pro
In light of these issues, the decision has been made to no longer use SWE-bench Verified to evaluate model consegne. As an alternative, the adoption of SWE-bench Pro, a presumably improved and more reliable version of the benchmark, is recommended. Further details on the differences between the two versions and the advantages of SWE-bench Pro have not been specified.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!