A recent experiment has demonstrated that the Qwen3.5-35B-A3B model, a Mixture of Experts model with only 3 billion active parameters, can achieve remarkable performance on the SWE-bench Verified Hard benchmark by adopting a continuous verification strategy.

Experiment Details

The experiment used a minimal agent harness with tools such as file_read, file_edit, bash, grep, and glob. Several verification strategies were tested, including:

  • Baseline (no self-verification): 22.2% success rate
  • Verify-at-last (test before declaring done): 33.3% success rate
  • Verify-on-edit (test after every file_edit): 37.8% success rate

The "verify-on-edit" strategy consists of injecting a message to the agent after each file modification, asking it to verify the correctness of the modification via a short inline python script or a test script.

Results

The "verify-on-edit" strategy allowed the model to achieve a 37.8% success rate on the SWE-bench Verified Hard benchmark, approaching Claude Opus 4.6's 40%. On the full benchmark (500 tasks), the model achieved 67.0%, comparable to much larger systems.

Considerations

These results highlight the importance of effective verification strategies to improve the performance of language models, even smaller ones. For those evaluating on-premise deployments, there are trade-offs to consider, as highlighted by AI-RADAR's analytical frameworks on /llm-onpremise.