Claude Code (Opus 4.6) models lead the latest SWE-rebench leaderboard, achieving a 52.9% resolved rate on 48 new tasks extracted from GitHub pull requests (PRs) created in the previous month. The SWE-rebench benchmark evaluates the ability of models to read real issues, edit code, and run tests, with the goal of passing the full suite.
Model Performance
- Claude Code (Opus 4.6) also excels in pass@5, reaching 70.8%.
- Claude Opus 4.6 and gpt-5.2-xhigh follow closely, with a 51.7% resolution rate.
- gpt-5.2-medium (51.0%) shows similar performance to top configurations.
- Among open-source models, Kimi K2 Thinking (43.8%), GLM-5 (42.1%), and Qwen3-Coder-Next (40.0%) lead the pack.
- MiniMax M2.5 (39.6%) continues to show strong performance while remaining one of the cheapest options.
- A gap is noted between Kimi variants: K2 Thinking (43.8%) vs K2.5 (37.9%).
- Newer smaller variants (GLM-4.7 Flash, gpt-5-mini-medium) trade performance for efficiency, landing in the 25-31% range.
For those evaluating on-premise deployments, there are trade-offs between performance, cost, and resource requirements. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these alternatives.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!