SWE-rebench update: Qwen3.6-27B, Gemma 4 31B and other local models join the leaderboard

The latest SWE-rebench leaderboard update introduces a trio of models designed for self-hosting: Qwen3.6-27B, Qwen3.6-35B-A3B, and Gemma 4 31B. They don’t yet top the charts – Claude Opus 4.8 leads with 56.5% and GLM-5.2 follows at 51.1% – but their presence signals a meaningful shift in how coding AI is being evaluated.

A closer look at the numbers

In a benchmark that mimics real-world software engineering tasks, Qwen3.6-27B scored 36.5% while consuming an average of 1.88 million tokens per task. Its MoE sibling, Qwen3.6-35B-A3B (whose naming indicates only 3 billion active parameters per token), reached 33.8% with 2.23 million tokens. Gemma 4 31B landed at 16.5% using 2.24 million tokens. The gap with the leaders is still clear, but the efficient use of tokens makes these models especially interesting for anyone running agents on local hardware.

Token consumption as a deployment metric

In on-premise or self-hosted scenarios, token count isn’t just an abstract efficiency metric – it directly affects compute time, energy draw, and hardware requirements. Qwen3.6-27B’s 1.88M tokens look particularly frugal compared to, say, MiniMax M3’s 6.89M tokens (45.6%) or even GLM-5.2’s 2.62M. Fewer tokens mean lower inference latency and a less demanding context window, reducing the need for high-end GPUs and shrinking the Total Cost of Ownership. For teams weighing on-prem vs cloud options, these numbers translate into concrete CapEx and OpEx considerations.

MoE architecture: a self-hosting advantage

The “35B-A3B” designation reveals a mixture-of-experts design that activates only a fraction of the total parameters for each token. This keeps the memory footprint low and inference fast, enabling the model to run on machines with modest VRAM budgets – a critical factor for local deployments. Meanwhile, the dense Qwen3.6-27B shows that a 27-billion-parameter model can already handle complex coding challenges at a level that’s more than adequate for many internal development tasks.

Enabling autonomy with Harbor

It’s no coincidence that the update references Harbor, a framework that lets you run coding agents on your own infrastructure. Tools like Harbor are turning the idea of fully local development pipelines into a practical reality, where code never leaves the company network. SWE-rebench, by spotlighting self-hosted models and publishing detailed token consumption data, becomes a valuable ally not just for researchers but also for IT decision-makers who need objective data to guide their AI infrastructure choices.

A growing space for local coding AI

The call for community suggestions on which local models to test next suggests the leaderboard will keep expanding. While frontier models push the upper bound, the rapidly maturing mid-tier is poised to deliver production-grade coding agents on owned hardware. For organizations that prioritize data sovereignty and TCO, the message is clear: the metrics that matter are evolving, and token efficiency is one of them.

SWE-rebench update: Qwen3.6-27B, Gemma 4 31B and other local models join the leaderboard

A closer look at the numbers

Token consumption as a deployment metric

MoE architecture: a self-hosting advantage

Enabling autonomy with Harbor

A growing space for local coding AI

💻 Need GPU Cloud Infrastructure?

Stay ahead — get AI signals in your inbox

💬 Comments (0)

🔍 Continue Exploring

More in LLM

👥 Join 160+ AI explorers