Qwen3.6-35B-A3B Optimization with MTP: A Performance Analysis on Local Hardware

The landscape of Large Language Models (LLMs) continues to evolve rapidly, with growing interest in solutions that enable on-premise or self-hosted deployments. This approach offers significant advantages in terms of data sovereignty, control, and potentially long-term TCO. In this context, optimizing performance on dedicated hardware is crucial. Recently, a user shared test results for a version of the Qwen3.6-35B-A3B model, enhanced with Multi-Token Prediction (MTP) optimization and available in GGUF format on HuggingFace. The objective is to evaluate the impact of this technique on inference speeds across different GPU configurations.

Technical Details and Initial Results

Initial tests, conducted by the model's author, revealed less pronounced performance gains compared to expectations based on previous models. Using an NVIDIA GeForce RTX 5090 Founders Edition for Q4 quantization tests, the model showed a 6% speed increase, going from 215 tokens/second (t/s) to 228.83 t/s. For Q8 quantization, with a setup including an NVIDIA GeForce RTX 5090 FE and an NVIDIA GeForce RTX 3090, the increase was even more modest, at 2.5%, moving from 148.20 t/s to 152.02 t/s. These results contrast with the 2-2.5x gains observed on 27B models, suggesting that MTP's effectiveness might be influenced by the specific llama.cpp implementation or the qwen35moe architecture of the model.

Performance Variability and Deployment Context

Performance variability is a critical aspect for those designing AI infrastructures. While initial tests indicated limited gains, another user reported significantly better results. On a configuration comprising two NVIDIA GeForce RTX 5070 Ti and one NVIDIA GeForce RTX 3090, the Q8 model achieved a 50% speed increase, going from 110 t/s to 165 t/s. This discrepancy highlights how the effectiveness of optimizations can heavily depend on the specific combination of hardware, drivers, and software configuration. For CTOs and infrastructure architects evaluating on-premise LLM deployments, this underscores the importance of conducting thorough benchmarks with their own technology stack. Hardware selection, its VRAM, and computational capacity are critical factors that directly influence throughput and latency, impacting the overall TCO.

Final Considerations for AI Infrastructure

Adopting techniques like MTP to improve LLM performance on local hardware represents an important step towards optimizing AI workloads. However, the presented results demonstrate that there is no universal solution. Companies aiming to implement LLMs in self-hosted or air-gapped environments must carefully consider how different model architectures and optimizations interact with their specific hardware. The promise of significant gains, such as those seen on smaller models, must be validated through rigorous testing. For those evaluating on-premise deployments, AI-RADAR offers analytical frameworks on /llm-onpremise to assess the trade-offs between performance, costs, and data sovereignty requirements, providing tools for informed decision-making without direct recommendations.