AI-RADAR.IT · AI-RADAR.NET · AI-RADAR.TECH

News & analysis on local LLMs, stack & on-prem hardware.

📁 LLM AI generated

SWE-rebench Jan 2026: GLM-5, MiniMax M2.5, and Opus Lead Performance

Published on 2026-02-13 18:41 ℹ️ LocalLLaMA 📰 Read the original source article →

🏷️ LLM On-Premise 🏷️ DevOps

SWE-rebench: GLM-5, MiniMax M2.5 e Opus dominano le consegne di gennaio 2026

Claude Code (Opus 4.6) models lead the latest SWE-rebench leaderboard, achieving a 52.9% resolved rate on 48 new tasks extracted from GitHub pull requests (PRs) created in the previous month. The SWE-rebench benchmark evaluates the ability of models to read real issues, edit code, and run tests, with the goal of passing the full suite.

Model Performance

Claude Code (Opus 4.6) also excels in pass@5, reaching 70.8%.
Claude Opus 4.6 and gpt-5.2-xhigh follow closely, with a 51.7% resolution rate.
gpt-5.2-medium (51.0%) shows similar performance to top configurations.
Among open-source models, Kimi K2 Thinking (43.8%), GLM-5 (42.1%), and Qwen3-Coder-Next (40.0%) lead the pack.
MiniMax M2.5 (39.6%) continues to show strong performance while remaining one of the cheapest options.
A gap is noted between Kimi variants: K2 Thinking (43.8%) vs K2.5 (37.9%).
Newer smaller variants (GLM-4.7 Flash, gpt-5-mini-medium) trade performance for efficiency, landing in the 25-31% range.

For those evaluating on-premise deployments, there are trade-offs between performance, cost, and resource requirements. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these alternatives.

AI-Radar Takeaway

The SWE-rebench benchmark has been updated with January 2026 results on 48 new GitHub tasks. Claude Code (Opus 4.6) leads with a 52.9% resolved rate. GLM-5, MiniMax M2.5, and Qwen3-Coder-Next stand out among open-source models. A gap between Kimi variants is highlighted.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

Vast.ai GPU Marketplace

Decentralized GPU marketplace with ultra-competitive pricing. Rent from a global network of providers. Perfect for experimentation, development, and cost-optimized workloads.

✓ Lowest prices ✓ Global network ✓ Flexible options

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

AI-RADAR NEWSLETTER

Stay ahead — get AI signals in your inbox

Daily or weekly digest of the most important AI news. 160+ readers, no spam.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

Explore LLM On-Premise

Complete guide to running AI models locally: hardware, stack, and privacy.

SWE-rebench Leaderboard: GPT-5.4, Qwen3.5, Gemini 3.1 Pro, and More

SWE-rebench Leaderboard: GPT-5.4, Qwen3.5, Gemini 3.1 Pro, and More

The SWE-rebench leaderboard has been updated with February results on 57 fresh GitHub PR tasks. Claude Opus 4.6 remains at the top, but GPT-5.2, GLM-5, and GPT-

GLM-5.1: Zhipu AI model aims to outperform GPT-4o in coding

GLM-5.1: Zhipu AI model aims to outperform GPT-4o in coding

Zhipu AI has released GLM-5.1, a large language model (LLM) that, according to benchmarks, rivals Claude Opus 4.5 in coding tasks. With a context window of 200K

Qwen3.6-397B-A17B: The Open Source LLM Challenging Claude Sonnet in Real-World Scenarios

Qwen3.6-397B-A17B: The Open Source LLM Challenging Claude Sonnet in Real-World Scenarios

An analysis highlights the performance of Qwen3.6-397B-A17B, a Large Language Model that, despite benchmarks, demonstrates real-world reliability and effectiven

GLM-5 and Minimax-2.5 benchmarked on Fiction.liveBench

GLM-5 and Minimax-2.5 benchmarked on Fiction.liveBench

A user shared on Reddit the results of a comparative benchmark between the GLM-5 and Minimax-2.5 language models, using the Fiction.liveBench dataset. The analy

MiniMax M2.7: New benchmarks on autonomous coding performance

MiniMax M2.7: New benchmarks on autonomous coding performance

MiniMax has released M2.7, a model showing significant improvements in autonomous coding benchmarks. In tests, M2.7 achieved competitive results compared to mod

More in LLM

Toe-to-toe in the US Ban benchmark: OpenAI ties with Anthropic

Even Google believes in small coding models

SpectralQuant narrows the Q4_K_M quantization gap to 96.5%: a leap for local models

Two new AI tools from Tokyo and Beijing fill the gap left by Anthropic's export ban

ConlangCrafter: The AI That Invents Imaginary Languages (and Could Teach Us How We Think)

Orthrus brings diffusion head to Qwen 3.5/3.6 and Gemma 4: open-source code dropping soon

→ View all in LLM →

AI-Radar LLM On-Premise

Complete guide to running AI models locally: hardware, stack, privacy, and reference architectures.

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Register free → Already a member? Log in