SWE-rebench: February Results

The SWE-rebench benchmark has been updated with February results, evaluating the performance of various models on 57 fresh GitHub PR tasks. The models had to read real PR issues, edit code, and run tests to pass the full suite.

Claude Opus 4.6 remains the leader with a resolution rate of 65.3%. gpt-5.2-medium (64.4%), GLM-5 (62.8%), and gpt-5.4-medium (62.8%) follow closely.

Gemini 3.1 Pro Preview (62.3%) and DeepSeek-V3.2 (60.9%) complete the top 6.

Open-weight and hybrid models continue to improve. Qwen3.5-397B (59.9%), Step-3.5-Flash (59.6%), and Qwen3-Coder-Next (54.4%) are closing the gap, thanks to improvements in long-context use and scaling.

MiniMax M2.5 (54.6%) stands out as a cost-effective option with competitive performance.