📁 LLM AI generated

SWE-rebench Leaderboard: GPT-5.4, Qwen3.5, Gemini 3.1 Pro, and More

Published on 2026-03-23 14:17 ℹ️ LocalLLaMA 📰 Read the original source article →

SWE-rebench: GPT-5.4, Qwen3.5 e Gemini 3.1 Pro a confronto

SWE-rebench: February Results

The SWE-rebench benchmark has been updated with February results, evaluating the performance of various models on 57 fresh GitHub PR tasks. The models had to read real PR issues, edit code, and run tests to pass the full suite.

Claude Opus 4.6 remains the leader with a resolution rate of 65.3%. gpt-5.2-medium (64.4%), GLM-5 (62.8%), and gpt-5.4-medium (62.8%) follow closely.

Gemini 3.1 Pro Preview (62.3%) and DeepSeek-V3.2 (60.9%) complete the top 6.

Open-weight and hybrid models continue to improve. Qwen3.5-397B (59.9%), Step-3.5-Flash (59.6%), and Qwen3-Coder-Next (54.4%) are closing the gap, thanks to improvements in long-context use and scaling.

MiniMax M2.5 (54.6%) stands out as a cost-effective option with competitive performance.

AI-Radar Takeaway

The SWE-rebench leaderboard has been updated with February results on 57 fresh GitHub PR tasks. Claude Opus 4.6 remains at the top, but GPT-5.2, GLM-5, and GPT-5.4 are very close. Open-weight models like Qwen3.5 and Step-3.5-Flash continue to improve, closing the gap.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

🚂

Railway Cloud Infrastructure

Modern cloud platform with instant deployments. Deploy from GitHub in seconds with automatic HTTPS, databases, and monitoring. Perfect for web apps, APIs, and LLM inference services.

✓ GitHub integration ✓ Auto HTTPS ✓ Simple pricing

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

AI-RADAR NEWSLETTER

Stay ahead — get AI signals in your inbox

Daily or weekly digest of the most important AI news. 160+ readers, no spam.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

SECTION

Explore LLM On-Premise

Complete guide to running AI models locally: hardware, stack, and privacy.

Read →

LLM Feb 13

SWE-rebench Jan 2026: GLM-5, MiniMax M2.5, and Opus Lead Performance

The SWE-rebench benchmark has been updated with January 2026 results on 48 new GitHub tasks. Claude Code (Opus 4.6) leads with a 52.9% resolved rate. GLM-5, Min

Read →

LLM May 16

Qwen3.6-35B-A3B and 9B: Open Source Models Challenging Giants on Terminal-Bench 2.0

The Qwen3.6-35B-A3B and Qwen3.5-9B models have officially entered the public Terminal-Bench 2.0 leaderboard. Notably, the 35B version, integrated with little-co

Read →

LLM Feb 16

Qwen 3 Max-Thinking: Superior Performance in Spatial Reasoning

A spatial reasoning benchmark (MineBench) demonstrates a significant performance improvement in the Qwen 3 Max-Thinking model compared to Qwen 3.5. The results

Read →

LLM Mar 01

Qwen 3.5 27B: Best Chinese Translation Model Under 70B

A LocalLLaMA user reports that Qwen 3.5 27B offers Chinese translations comparable to GPT-3.5 and Gemini, outperforming other models up to 70B. The model was te

Read →

Hardware Jan 21

LLM Inference: 8 AMD MI50 GPUs for Performance and Affordability

A setup with eight 32GB AMD MI50 GPUs delivers notable performance in large language model (LLM) inference. It achieves 26 tokens per second with MiniMax-M2.1,

Read →

SWE-rebench Leaderboard: GPT-5.4, Qwen3.5, Gemini 3.1 Pro, and More

SWE-rebench: February Results

💻 Need GPU Cloud Infrastructure?

Stay ahead — get AI signals in your inbox

💬 Comments (0)

🔍 Continue Exploring

More in LLM

👥 Join 160+ AI explorers