Qwen3.6-35B-A3B and 9B: Open Source Models Challenging Giants on Terminal-Bench 2.0

Qwen Models Challenge Giants on Terminal-Bench 2.0

The Qwen3.6-35B-A3B and Qwen3.5-9B models have officially entered the public Terminal-Bench 2.0 leaderboard, a recognized benchmark for evaluating LLM capabilities. Specifically, the little-coder × Qwen3.6-35B-A3B combination achieved a score of 24.6% (±3.2), positioning itself above prominent solutions like Gemini 2.5 Pro on Gemini CLI, which scored 19.6%, and Qwen3-Coder-480B on Terminus 2, with 23.9%.

This result is significant as it demonstrates how smaller models can compete effectively in complex evaluation contexts. The performance of Qwen3.5-9B, which reached 9.2%, while more modest, further reinforces the idea that Large Language Models (LLMs) with fewer than 10 billion parameters should no longer be considered unsuitable for challenging benchmarks but rather represent measurable and valid options.

Technical Details and Infrastructure Implications

The Terminal-Bench 2.0 benchmark is designed to test the "agentic" capabilities of LLMs, meaning their ability to reason, plan, and interact with complex environments to solve problems. The fact that a model like Qwen3.6-35B-A3B can outperform larger or proprietary competitors in this type of test has direct implications for deployment strategies.

For CTOs, DevOps leads, and infrastructure architects, the availability of performant LLMs with lower computational requirements is crucial. Smaller models demand less VRAM and can run on less expensive or existing hardware, reducing the Total Cost of Ownership (TCO) for on-premise deployments. This paves the way for more accessible self-hosted solutions, where data sovereignty and control over the infrastructure are priorities.

The Context of On-Premise Deployments and Data Sovereignty

The increasing effectiveness of smaller LLMs is an enabling factor for organizations choosing to keep AI workloads within their own data centers. On-premise deployments offer advantages in terms of security, regulatory compliance (such as GDPR), and the ability to operate in air-gapped environments, which are essential for highly regulated sectors.

While larger models often require scalable and costly cloud infrastructures, the optimization of LLMs for "less compute" allows for the utilization of bare metal servers or local GPU clusters. This approach enables companies to maintain full control over their data and inference processes, avoiding the dependencies and variable costs associated with cloud services. For those evaluating the trade-offs between on-premise and cloud deployments, AI-RADAR offers analytical frameworks and insights on /llm-onpremise to support informed decisions.

Future Prospects and the Impact of Open Source

Innovation in the LLM field is increasingly driven by the open-source community, which constantly pushes boundaries to make these technologies more efficient and accessible. The success of Qwen models on Terminal-Bench 2.0 is a clear example of how collaboration and open research can lead to significant progress, especially in optimization for resource-constrained environments.

The stated goal of aiming for the top of the leaderboard and the emphasis on open source underscore a clear trend: the future of LLMs is not just in model size, but also in their efficiency and ability to be run locally. This direction is fundamental for democratizing access to advanced artificial intelligence and enabling more companies to implement customized AI solutions with granular control and predictable costs.

Qwen3.6-35B-A3B and 9B: Open Source Models Challenging Giants on Terminal-Bench 2.0

Qwen Models Challenge Giants on Terminal-Bench 2.0

Technical Details and Infrastructure Implications

The Context of On-Premise Deployments and Data Sovereignty

Future Prospects and the Impact of Open Source

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

Gemini 3 Pro scores 69% trust in blinded testing up from 16% for Gemini 2.5

AI models still struggle with math, but less than before

Qwen3.5B: a leap forward compared to models from 2 years ago

👥 Join 160+ AI explorers