Introduction

Optimizing LLM performance is a constant priority for those managing AI infrastructures, especially in on-premise contexts. Among the most promising techniques for accelerating inference, speculative inference โ€“ or MTP (Multi-Token Prediction) โ€“ has gained attention. However, recent in-depth tests have revealed that its benefits are not universal and critically depend on the nature of the generative task.

An analysis conducted on a Qwen 3.6 27B LLM, run on Apple Silicon M2 Max hardware, brought surprising results to light. Initially, some users reported worse performance with MTP enabled, contrary to expectations. A subsequent systematic investigation, based on over 300 tests, confirmed these observations, outlining a more complex picture of this technique's effectiveness.

Technical Details and Key Findings

Speculative inference works by generating a "draft" of tokens which is then verified by the main model. If the draft tokens are correct, the generation process can skip computational steps, accelerating output. Benchmarks showed that for coding tasks, MTP can nearly triple generation speed with F16 models, and offer a 123% increase with Q8_0 quantization. Conversely, for creative writing, the same feature can lead to a slowdown, for example, 9% with Q4_K_M models.

This dichotomy is closely linked to the draft token acceptance rate. For coding, the acceptance rate ranges between 79% and 89%, indicating high predictability of the generated text. For creative tasks, however, the rate drops dramatically to 39-48%, making the overhead of draft generation and verification less efficient. Memory bandwidth proved to be a decisive factor: larger, less quantized models (like F16 at 51GB) benefit more from MTP, as each generated token requires significant memory access. Smaller, more quantized models (like Q4_K_M at 16GB) are already inherently fast, and MTP overhead can outweigh the benefits for less predictable tasks. Tests on temperature and MTP layer quantization showed minimal impact on overall results, confirming that the nature of the task is the primary driver.

Context and Implications for On-Premise Deployment

For CTOs, DevOps leads, and infrastructure architects evaluating on-premise LLM deployments, these results are crucial. The decision to enable or disable speculative inference cannot be generalized but must be guided by the specific workload. An organization primarily using LLMs for code generation or factual responses will see a clear advantage in adopting MTP, optimizing hardware utilization and reducing latency. Conversely, for applications requiring creativity or brainstorming, MTP might not only fail to provide benefits but could even worsen the user experience.

This underscores the importance of careful TCO analysis and hardware specifications, such as VRAM and memory bandwidth, in relation to anticipated models and tasks. A system's ability to leverage techniques like speculative inference is directly related to its architecture and model requirements. For those evaluating on-premise deployments, AI-RADAR offers analytical frameworks on /llm-onpremise to assess trade-offs between performance, costs, and data sovereignty, helping to make informed decisions without direct recommendations.

Final Perspective

Ultimately, the research confirms that there is no universal solution for LLM performance optimization. The intrinsic nature of the generative task โ€“ from the predictability of code to the fluidity of creative narrative โ€“ is the most influential factor in determining the effectiveness of advanced techniques like speculative inference. This compels technical decision-makers to adopt an approach based on benchmarks specific to their workloads, rather than relying on generic metrics. Understanding these trade-offs is fundamental to maximizing the return on investment in AI infrastructures and ensuring that on-premise deployments are not only performant but also economically sustainable and aligned with operational needs.