Qwen3.5-35B-A3B: performance close to Claude Opus with continuous verification

A recent experiment has demonstrated that the Qwen3.5-35B-A3B model, a Mixture of Experts model with only 3 billion active parameters, can achieve remarkable performance on the SWE-bench Verified Hard benchmark by adopting a continuous verification strategy.

Experiment Details

The experiment used a minimal agent harness with tools such as file_read, file_edit, bash, grep, and glob. Several verification strategies were tested, including:

Baseline (no self-verification): 22.2% success rate
Verify-at-last (test before declaring done): 33.3% success rate
Verify-on-edit (test after every file_edit): 37.8% success rate

The "verify-on-edit" strategy consists of injecting a message to the agent after each file modification, asking it to verify the correctness of the modification via a short inline python script or a test script.

Results

The "verify-on-edit" strategy allowed the model to achieve a 37.8% success rate on the SWE-bench Verified Hard benchmark, approaching Claude Opus 4.6's 40%. On the full benchmark (500 tasks), the model achieved 67.0%, comparable to much larger systems.

Considerations

These results highlight the importance of effective verification strategies to improve the performance of language models, even smaller ones. For those evaluating on-premise deployments, there are trade-offs to consider, as highlighted by AI-RADAR's analytical frameworks on /llm-onpremise.

🔍 Continue Exploring

Qwen3.5-35B-A3B: performance close to Claude Opus with continuous verification

Experiment Details

Results

Considerations

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

Fine-tuning Qwen 14B for Discord Autocomplete

Context Management for DeepAgents

Qwen 3 Max-Thinking: Superior Performance in Spatial Reasoning

👥 Join 160+ AI explorers