Claude Code (Opus 4.6) models lead the latest SWE-rebench leaderboard, achieving a 52.9% resolved rate on 48 new tasks extracted from GitHub pull requests (PRs) created in the previous month. The SWE-rebench benchmark evaluates the ability of models to read real issues, edit code, and run tests, with the goal of passing the full suite.

Model Performance

  • Claude Code (Opus 4.6) also excels in pass@5, reaching 70.8%.
  • Claude Opus 4.6 and gpt-5.2-xhigh follow closely, with a 51.7% resolution rate.
  • gpt-5.2-medium (51.0%) shows similar performance to top configurations.
  • Among open-source models, Kimi K2 Thinking (43.8%), GLM-5 (42.1%), and Qwen3-Coder-Next (40.0%) lead the pack.
  • MiniMax M2.5 (39.6%) continues to show strong performance while remaining one of the cheapest options.
  • A gap is noted between Kimi variants: K2 Thinking (43.8%) vs K2.5 (37.9%).
  • Newer smaller variants (GLM-4.7 Flash, gpt-5-mini-medium) trade performance for efficiency, landing in the 25-31% range.

For those evaluating on-premise deployments, there are trade-offs between performance, cost, and resource requirements. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these alternatives.