SWE-rebench Jan 2026: GLM-5, MiniMax M2.5, and Opus Lead Performance

Claude Code (Opus 4.6) models lead the latest SWE-rebench leaderboard, achieving a 52.9% resolved rate on 48 new tasks extracted from GitHub pull requests (PRs) created in the previous month. The SWE-rebench benchmark evaluates the ability of models to read real issues, edit code, and run tests, with the goal of passing the full suite.

Model Performance

Claude Code (Opus 4.6) also excels in pass@5, reaching 70.8%.
Claude Opus 4.6 and gpt-5.2-xhigh follow closely, with a 51.7% resolution rate.
gpt-5.2-medium (51.0%) shows similar performance to top configurations.
Among open-source models, Kimi K2 Thinking (43.8%), GLM-5 (42.1%), and Qwen3-Coder-Next (40.0%) lead the pack.
MiniMax M2.5 (39.6%) continues to show strong performance while remaining one of the cheapest options.
A gap is noted between Kimi variants: K2 Thinking (43.8%) vs K2.5 (37.9%).
Newer smaller variants (GLM-4.7 Flash, gpt-5-mini-medium) trade performance for efficiency, landing in the 25-31% range.

For those evaluating on-premise deployments, there are trade-offs between performance, cost, and resource requirements. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these alternatives.

SWE-rebench Jan 2026: GLM-5, MiniMax M2.5, and Opus Lead Performance

Model Performance

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

GLM-5 and Minimax-2.5 benchmarked on Fiction.liveBench

Field test of GLM 4.7 Flash Q6 with RTX 5090

Coding LLMs: GLM 4.7 Flash vs. GPT OSS 120B vs. Qwen3 Coder 30B Compared

👥 Join 160+ AI explorers