Optimizing On-Premise LLMs: The Qwen3.6 Case on RTX 5080 16GB
In the rapidly evolving landscape of artificial intelligence, the ability to run Large Language Models (LLMs) locally, on on-premise infrastructure, has become a priority for many companies. This choice is often driven by data sovereignty requirements, regulatory compliance, and control over Total Cost of Ownership (TCO). However, optimizing performance on dedicated hardware, especially for intensive workloads like coding agents that demand large context windows, presents significant challenges.
A recent benchmark examined the performance of various Qwen3.6 model configurations on an NVIDIA RTX 5080 16GB GPU, utilizing the popular llama.cpp framework. The goal was to identify the most efficient configuration for scenarios involving extended contexts, typical for coding agents. The results revealed an interesting picture, particularly regarding the effectiveness of the Multi-Token Prediction (MTP) feature.
Methodology and Configurations Under Review
The test platform included an RTX 5080 GPU with 16GB of VRAM, coupled with a Ryzen 9 9950X CPU and 128GB of RAM, all managed by llama.cpp version b9204. Three Qwen3.6 models were evaluated, each with different quantizations and sizes:
- Qwen3.6-27B MTP-UD-IQ3_XXS: A 27-billion parameter model, approximately 12.45 GB in size, which fits entirely on the 16GB GPU.
- Qwen3.6-35B-A3B MTP-UD-Q4_K_XL: A 35-billion parameter MoE (Mixture of Experts) model, approximately 22 GB in size, requiring partial offload to the CPU.
- Qwen3.6-35B-A3B MTP-Q8_0: Another 35-billion parameter variant, approximately 36 GB in size, requiring more substantial offload.
The Multi-Token Prediction (MTP) feature, recently merged into llama.cpp, was tested to assess its impact on generation and prompt processing speed. MTP aims to improve inference speed by speculating on multiple tokens simultaneously, but its effectiveness can vary depending on hardware and model configuration, especially under VRAM constraints.
Performance Analysis: Speed, Context, and Quality
The tests showed that for the Qwen3.6-35B Q4_K_XL model, the optimal configuration for a 128k token context, without MTP and using the --fit-target 1536 flag, achieved a generation speed of 56 tokens/second and a prompt processing speed of 1,584 tokens/second. This means a 128k token prompt is processed in approximately 81 seconds.
The most surprising finding concerns MTP itself: for the 35B MoE model on 16GB VRAM, MTP proved to be 23% slower in short contexts. This is because MTP requires a VRAM reservation (approximately 1.5 GB for the compute buffer), which pushes additional model layers from the GPU to the CPU, creating a bottleneck. In extended contexts, such as 128k tokens, generation speed with and without MTP converges to 56 tokens/second, as the KV (Key-Value) cache saturates VRAM regardless of MTP, rendering its computational overhead ineffective.
Conversely, for the 27B IQ3 model, which fits entirely on the GPU (12.45 GB), MTP provided a benefit, increasing speed from approximately 56 to 73 tokens/second. This suggests a general rule of thumb: MTP is beneficial when the model resides entirely on the GPU, but it can be counterproductive when its compute buffer forces additional layers to offload to the CPU.
Regarding context management, the 35B MoE model excelled, easily handling over 131k tokens. This is attributed to its hybrid architecture (Gated DeltaNet + Attention) which requires a KV cache for only a limited number of attention layers, while SSM (State Space Model) layers use a minimal recurrent state. The 27B model, on the other hand, topped out at 56k tokens (or 110k with q4_0 quantization for the KV cache).
In terms of quality, the 27B IQ3 achieved a perfect score in the CodeNeedle (positional recall) benchmark, while the 35B models were slightly lower. In the GSM8K (math) benchmark, the 35B Q4_K_XL showed 91% accuracy and was 37% faster in evaluation compared to the 27B.
Implications for CTOs and Future Prospects
For CTOs, DevOps leads, and infrastructure architects evaluating on-premise LLM deployment, these results offer valuable insights. The recommended configuration for coding agents requiring large context windows on an RTX 5080 16GB is the Qwen3.6-35B Q4_K_XL model without MTP, using --fit on and --fit-target 1536. This configuration ensures 56 tokens/second at 128k context and a maximum capacity of 131k tokens, balancing speed and context handling capability.
The choice between smaller models that fit entirely on the GPU (potentially benefiting from MTP) and larger models requiring partial offload (where MTP can be disadvantageous) is a critical trade-off. The amount of available VRAM is the determining factor. For those evaluating on-premise deployments, AI-RADAR offers analytical frameworks on /llm-onpremise to assess these trade-offs in terms of performance, TCO, and data sovereignty.
Looking ahead, the native integration of MTP into frameworks like vLLM (versions >= 0.19.0) with PagedAttention could change the scenario. PagedAttention's dynamic VRAM management might eliminate the overhead of the fixed compute buffer, making MTP more effective even for models requiring partial offload. This development could unlock new opportunities to further optimize LLM performance on local hardware.
💬 Comments (0)
🔒 Log in or register to comment on articles.
No comments yet. Be the first to comment!