Qwen3.5-397B: Q2 Quantization Proves Surprisingly Effective on Local Hardware

A recent test conducted on a local hardware setup has revealed unexpected capabilities of the Qwen3.5-397B model, particularly in its Q2 quantized version. This discovery is significant, as low-level quantizations like Q2 have historically been associated with a drastic loss of quality, rendering many models almost unusable for practical applications. Experience now suggests that, with specific LLMs, it is possible to achieve surprisingly robust results even with aggressive compression.

This scenario is of particular interest to organizations evaluating the deployment of Large Language Models in self-hosted or air-gapped environments, where hardware resource management, especially VRAM, represents a critical constraint. The ability to run large models with a reduced memory footprint while maintaining high output quality can profoundly influence decisions regarding Total Cost of Ownership (TCO) and data sovereignty.

Technical Details and On-Field Performance

The configuration used for the test consisted of a workstation equipped with an AMD Ryzen 3950x processor, coupled with 96GB of DDR4 RAM running at 3000MHz. The graphical core of the system comprised a combination of AMD w6800 and Rx6800 GPUs, which together provided a total of 48GB of VRAM, with a bandwidth of approximately 512GB/s. The specific model employed was Unsloth's UD_IQ2_M, which occupies about 122GB on disk in its Q2 quantized form.

Regarding performance, after a brief "warm-up" period of about 2-3 minutes for token generation, the system recorded a speed of approximately 11 tokens per second during generation. Processing shorter prompts reached about 43 tokens per second. It is important to note that, while prompt processing speed might be slightly below ideal for interactive coding sessions, it proved largely sufficient for 24/7 autonomous agent loops. The processing context was maintained at approximately 20,000 tokens, with the KV-cache set to q8_0, utilizing the main branch of llama.cpp with ROCm support.

Output Quality and Implications for Local Deployments

The output quality of the Qwen3.5-397B quantized at Q2 proved remarkable, particularly for coding and general knowledge tasks. Tests showed that this model outperformed LLMs such as Qwen3.5 27B (full version), Qwen3.5 122B (Q4), MiniMax M2.5 (Q4), GPT-OSS-120B (full version), and Gemma 4 31B (full version) in these areas. Although hallucinations were observed in the reasoning output, a common phenomenon for Q2 quantizations, the model demonstrated a rapid self-correction capability. However, the test author advises against using the model without dedicated "reasoning tokens," as the ability to correct hallucinations is lost in their absence.

This unexpected performance opens new perspectives for companies that need to run powerful LLMs in resource-limited environments or with stringent data sovereignty requirements. The ability to obtain high-quality results from a Q2 quantized model means that investment in high-end hardware can be significantly reduced, making on-premise deployments more accessible and sustainable.

Outlook for the On-Premise LLM Ecosystem

The discovery that Qwen3.5-397B can be "shockingly useful" even with Q2 quantization represents a turning point for the local LLM community and for enterprises exploring cloud alternatives. For months, quantizations at such low levels were considered ineffective, but this model demonstrates that output quality is not always directly proportional to model size or quantization level. This prompts a reconsideration of the trade-offs between compression, hardware requirements, and actual performance.

For those evaluating on-premise deployments, analytical frameworks such as those offered by AI-RADAR on /llm-onpremise can help assess the trade-offs between different models, quantization levels, and hardware configurations. The experience with Qwen3.5-397B suggests that model optimization and the choice of the right quantization can unlock new possibilities for running advanced LLMs on local infrastructure, ensuring control, security, and optimized TCO.