On-Premise Optimization: The New APEX-MTP Quantization of Qwen 3.6 35B-A3B

The landscape of Large Language Models (LLMs) continues to evolve rapidly, with increasing focus on optimization for on-premise deployments. In this context, a new APEX-MTP quantization of the Qwen 3.6 35B-A3B-Claude-4.7-Opus-Reasoning-Distilled model has recently been announced. This release, available in GGUF format, stands out for integrating the multi-token prediction (MTP) head directly into the model file, a feature that significantly simplifies the implementation of self-speculative decoding.

The initiative stems from mudler's independent research, which hosts over 30 free APEX MoE quantizations. The local hardware used for this research includes an NVIDIA DGX Spark with 122 GB of unified memory, a configuration sufficient to manage MoE models in the 30-50 billion parameter class. For larger models, such as those with 200 billion parameters and beyond, external computing resources are required, typically on H100, H200, or Blackwell GPUs, with costs ranging from $20 to $100 per single quantization. This highlights the trade-offs between local hardware capacity and the need to scale for more demanding workloads.

Technical Details: APEX, MTP, and llama.cpp

The APEX (Adaptive Precision for EXpert Models) quantization strategy is specifically designed for Mixture-of-Experts (MoE) models. It is a mixed-precision approach that optimizes compression based on tensor roles: "routed" experts are compressed more heavily, while "shared" experts maintain higher precision, being always active. This methodology, combined with diverse imatrix calibration (including data from chat, code, reasoning, and tool-calling), aims to maintain high model accuracy while significantly reducing memory and computational requirements.

The main novelty of this release is the inclusion of the MTP (multi-token prediction) head within the GGUF file, made possible by a recent update to llama.cpp (PR #22673). This allows self-speculative decoding to be enabled using a single file, eliminating the need for a separate "draft" model. The MTP head, which includes blk.40.* blocks and the nextn.* projection, is quantized to Q8_0 to ensure near-lossless accuracy, crucial for a high acceptance rate in speculative decoding. In "I-Nano" variants, the MTP head maintains trunk-tier precision (Q3_K routed experts, Q4_K attention) but pins blk.40.nextn.eh_proj to Q4_K, at an additional cost of approximately 1 GB per file compared to non-MTP versions.

Implications for On-Premise Deployments and Data Sovereignty

The optimization of LLMs for inference on local hardware, as demonstrated by this APEX-MTP quantization, is of paramount importance for organizations prioritizing control, data sovereignty, and regulatory compliance. The use of the GGUF format and the llama.cpp framework enables CTOs, DevOps leads, and infrastructure architects to run these models directly on their own servers, reducing reliance on external cloud services and the associated risks in terms of privacy and security.

The ability to run 30-50 billion parameter models on a single DGX Spark with 122 GB of unified memory offers a concrete alternative to cloud-based deployments, especially for workloads requiring low latency and internal data management. Although larger models still require more powerful computing resources, often available through high-end GPU rental, the APEX quantization approach reduces the overall TCO for many applications. For those evaluating on-premise deployments, AI-RADAR offers analytical frameworks on /llm-onpremise to assess the trade-offs between CapEx and OpEx, performance, and security requirements.

Future Prospects and Developments in Quantization

The continuous development of quantization techniques like APEX and the integration of advanced features such as the MTP head for self-speculative decoding represent significant steps towards the democratization of AI. These advancements allow increasingly complex LLMs to run on less exotic hardware, expanding the audience of companies and researchers who can leverage their potential without incurring prohibitive costs or compromising data sovereignty.

Research is still ongoing to further improve efficiency, for example, through a patch to llama-imatrix that will allow MTP activations to be recorded during calibration, enabling the "draft" head to be pushed to lower bit-widths more cleanly. This constant commitment to hardware-software optimization is crucial for unlocking new applications and use cases for Large Language Models in on-premise and hybrid environments, while ensuring high performance and total control over the infrastructure.