SWE-bench Verified: Evaluation Suspended Due to Integrity Issues

Integrity Issues in SWE-bench Verified

SWE-bench Verified, a widely used benchmark for measuring the code generation capabilities of language models, has been subject to increasing concerns regarding its integrity. Recent analyses have revealed that the benchmark has flawed tests and potential training leakage issues, compromising its accuracy and reliability.

The presence of training leakage implies that models may have been exposed, directly or indirectly, to test data during the training phase, effectively invalidating the results obtained. This raises serious doubts about the ability of SWE-bench Verified to accurately measure real progress in the development of code generation models.

Recommendation: SWE-bench Pro

In light of these issues, the decision has been made to no longer use SWE-bench Verified to evaluate model consegne. As an alternative, the adoption of SWE-bench Pro, a presumably improved and more reliable version of the benchmark, is recommended. Further details on the differences between the two versions and the advantages of SWE-bench Pro have not been specified.

SWE-bench Verified: Evaluation Suspended Due to Integrity Issues

Integrity Issues in SWE-bench Verified

Recommendation: SWE-bench Pro

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

DeepSearchQA: A Benchmark for Advanced Research Agents

FACTS Benchmark Suite: Systematically evaluating the factuality of large language models

Benchmarks: allies of open source AI against mystification

👥 Join 160+ AI explorers