AI-RADAR.IT · AI-RADAR.NET · AI-RADAR.TECH

News & analysis on local LLMs, stack & on-prem hardware.

📁 LLM AI generated

Qwen3.5-35B-A3B: performance close to Claude Opus with continuous verification

Published on 2026-03-04 19:20 ℹ️ LocalLLaMA 📰 Read the original source article →

🏷️ LLM On-Premise 🏷️ DevOps

Qwen3.5-35B-A3B: performance vicina a Claude Opus con verifica continua

A recent experiment has demonstrated that the Qwen3.5-35B-A3B model, a Mixture of Experts model with only 3 billion active parameters, can achieve remarkable performance on the SWE-bench Verified Hard benchmark by adopting a continuous verification strategy.

Experiment Details

The experiment used a minimal agent harness with tools such as file_read, file_edit, bash, grep, and glob. Several verification strategies were tested, including:

Baseline (no self-verification): 22.2% success rate
Verify-at-last (test before declaring done): 33.3% success rate
Verify-on-edit (test after every file_edit): 37.8% success rate

The "verify-on-edit" strategy consists of injecting a message to the agent after each file modification, asking it to verify the correctness of the modification via a short inline python script or a test script.

Results

The "verify-on-edit" strategy allowed the model to achieve a 37.8% success rate on the SWE-bench Verified Hard benchmark, approaching Claude Opus 4.6's 40%. On the full benchmark (500 tasks), the model achieved 67.0%, comparable to much larger systems.

Considerations

These results highlight the importance of effective verification strategies to improve the performance of language models, even smaller ones. For those evaluating on-premise deployments, there are trade-offs to consider, as highlighted by AI-RADAR's analytical frameworks on /llm-onpremise.

AI-Radar Takeaway

A Mixture of Experts model called Qwen3.5-35B-A3B, with only 3 billion active parameters, has achieved surprising performance on the SWE-bench Verified Hard benchmark. By implementing a continuous verification strategy after each code edit, the model nearly matched the performance of Claude Opus 4.6, demonstrating the importance of effective verification techniques.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

Vast.ai GPU Marketplace

Decentralized GPU marketplace with ultra-competitive pricing. Rent from a global network of providers. Perfect for experimentation, development, and cost-optimized workloads.

✓ Lowest prices ✓ Global network ✓ Flexible options

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

AI-RADAR NEWSLETTER

Stay ahead — get AI signals in your inbox

Daily or weekly digest of the most important AI news. 160+ readers, no spam.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

Explore LLM On-Premise

Complete guide to running AI models locally: hardware, stack, and privacy.

Ornith-1.0-35B Q3_K_M: 17 GB VRAM, all benchmarks pass, extreme quantization holds up

Ornith-1.0-35B Q3_K_M: 17 GB VRAM, all benchmarks pass, extreme quantization holds up

Ornith-1.0-35B has been quantized to Q3_K_M, achieving 16.8 GB on disk and ~17 GiB loaded VRAM. Validated with KL divergence probes and 14/14 behavior suite, it

Fine-tuning Qwen 14B for Discord Autocomplete

Fine-tuning Qwen 14B for Discord Autocomplete

A user fine-tuned the Qwen 14B model on their Discord messages to get personalized autocomplete suggestions. The model was trained with Unsloth.ai and QLoRA on

Qwen3.5-9B: Model merges Claude 4.6 with uncensored capabilities

Qwen3.5-9B: Model merges Claude 4.6 with uncensored capabilities

A version of Qwen3.5-9B is available that integrates Claude 4.6 Opus capabilities with a less restrictive content filter. The model is optimized for local use o

Context Management for DeepAgents

Frameworks Jan 28

Context Management for DeepAgents

LangChain's Deep Agents SDK addresses the challenges of context management in complex AI agents. Using compression techniques such as filesystem offloading and

Qwen3.6-35B vs 27B: Performance and Quantization on Local Hardware

Qwen3.6-35B vs 27B: Performance and Quantization on Local Hardware

A user shared observations on the performance of Qwen3.6-35B and 27B models in self-hosted environments. Despite the 27B's higher popularity, the 35B showed sup

More in LLM

On-Prem LLMs: Navigating Fragmented Benchmarks and the Myth of Size

Toe-to-toe in the US Ban benchmark: OpenAI ties with Anthropic

Even Google believes in small coding models

SpectralQuant narrows the Q4_K_M quantization gap to 96.5%: a leap for local models

Two new AI tools from Tokyo and Beijing fill the gap left by Anthropic's export ban

ConlangCrafter: The AI That Invents Imaginary Languages (and Could Teach Us How We Think)

→ View all in LLM →

AI-Radar LLM On-Premise

Complete guide to running AI models locally: hardware, stack, privacy, and reference architectures.

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Register free → Already a member? Log in