The Efficiency of LLMs on Local Hardware: The Qwen 35B-A3B Case

The adoption of Large Language Models (LLMs) in enterprise environments often raises critical questions regarding hardware requirements, operational costs, and data sovereignty. While cloud solutions offer immediate scalability, on-premise or self-hosted deployments are gaining traction for organizations that require total control over their data and long-term costs. In this context, optimizing local hardware becomes paramount.

A detailed analysis examined the capabilities of the Qwen 35B-A3B model, a 35-billion-parameter Mixture-of-Experts (MoE) LLM, when run on a common hardware configuration. The results indicate that a GPU with 12GB of VRAM, such as an NVIDIA RTX 3060, can represent a practical "sweet spot" for efficient execution of this model, offering a balance between performance and memory requirements.

Technical Details and Performance Optimization

The test setup included an NVIDIA RTX 3060 GPU with 12GB of VRAM, alongside 32GB of DDR4-3200 RAM, on a Windows system with CUDA 13.x. The model used was a quantized version of Qwen3.6-35B-A3B-MTP-IQ4_XS.gguf. As an MoE model, managing the MoE blocks is crucial: the -ncmoe parameter in llama.cpp determines how many of these blocks remain on the GPU, directly impacting performance. A lower -ncmoe value means more MoE blocks on the GPU, positively affecting decoding speed.

Initial benchmarks with llama-bench showed promising results for prefill (initial prompt processing), reaching approximately 914 tokens per second (t/s) for a 512-token prompt (pp512) and about 46.8 t/s for 128-token generation (tg128), using an optimized configuration with -ncmoe 18 and q8_0 quantized KV cache. It was also noted that using q8_0 for the Key-Value (KV) cache does not introduce significant performance degradation on this GPU, making it a preferable choice over other quantizations.

Balancing Context and Generation Speed

For practical use cases, such as code development, the analysis explored various configurations. A "practical" coding profile, with a context window of 32,768 tokens, allowed for generation at approximately 43.4 t/s, while maintaining about 273 MiB of free VRAM. This demonstrates the configuration's ability to handle large contexts, essential for complex programming tasks, without completely exhausting available VRAM.

A slightly faster profile, with a reduced context of 16,384 tokens, increased generation speed to about 44.5 t/s but left only 37 MiB of free VRAM, indicating near-limit memory usage. The test of speculative decoding (MTP) was also interesting, offering a modest speed increase of about 2% compared to well-tuned plain decoding, reaching approximately 47.7 t/s. This suggests that, for this model and hardware, optimizing basic parameters can be more impactful than more advanced techniques.

Implications for On-Premise Deployments and TCO

The main conclusion of this analysis is clear: 12GB of VRAM represents a significant strength for running the Qwen 35B-A3B model. This memory capacity allows a sufficient number of the MoE model's "experts" to remain on the GPU, ensuring fast decoding, efficient use of the q8_0 KV cache, and handling of large contexts up to 32k tokens.

For CTOs, DevOps leads, and infrastructure architects, these results are particularly relevant. They demonstrate that solid performance with considerable-sized LLMs can be achieved on relatively accessible hardware, reducing the Total Cost of Ownership (TCO) compared to cloud-based solutions and strengthening data sovereignty. The ability to deploy LLMs on-premise with manageable VRAM requirements opens new opportunities for applications requiring low latency, data security, and complete control over the infrastructure. AI-RADAR continues to explore these trade-offs, providing analysis and frameworks to evaluate the best deployment strategies at /llm-onpremise.