AMD Strix Halo and llama.cpp: MTP Accelerates On-Premise LLM Inference

Advancing LLM Inference on Local Hardware

The landscape of Large Language Models (LLMs) continues to evolve rapidly, with increasing attention on on-premise and self-hosted deployment solutions. This approach addresses the need for greater data control, sovereignty, and Total Cost of Ownership (TCO) optimization. In this context, inference efficiency becomes a critical factor, pushing the developer community to explore new techniques and software optimizations that can best leverage available hardware.

A recent experiment conducted on a system equipped with AMD Strix Halo (specifically an AI max 395) has highlighted the potential of Multi-Token Prediction (MTP) implemented in llama.cpp. This integration, still in a Pull Request phase, promises to significantly improve token generation speed, a fundamental aspect for applications requiring fast and fluid responses from LLMs.

Technical Details and Performance Gains

The experiment utilized a robust hardware configuration, featuring 128GB of DDR5 memory at 8000MHz, essential for handling large models locally. The software stack included a radv container rebuilt from a specific AMD Strix Halo repository, integrating llama.cpp's Pull Request #22673, which introduces MTP support. For the test, a Qwen3.6-35BA3B-MTP-GGUF model was used, with two GGUF files of approximately 36GB each, executed with the parameters --spec-type mtp --spec-draft-n-max 3.

The results were remarkable: inference speed reached between 60 and 80 tokens/s with MTP active, a significant increase compared to approximately 40 tokens/s achieved without MTP (using Vulkan). This represents a doubling of performance, demonstrating MTP's effectiveness in accelerating output generation. It is important to note that prompt processing (PP) speed did not show substantial changes, suggesting that MTP's benefits primarily focus on the subsequent token generation phase.

Implications for On-Premise Deployments

These results have direct implications for CTOs, DevOps leads, and infrastructure architects evaluating self-hosted LLM solutions. Software optimization like MTP, combined with performant hardware such as AMD Strix Halo, can make on-premise deployments more competitive in terms of speed and responsiveness. The ability to run 36GB models with such high performance on a single local machine is a strong signal for those seeking to maintain full control over their data and infrastructure, avoiding the dependencies and operational costs associated with cloud services.

For those considering on-premise deployments, TCO analysis becomes crucial. Investing in powerful hardware and optimized inference frameworks can lead to long-term savings, in addition to ensuring compliance with data sovereignty regulations and the ability to operate in air-gapped environments. These trade-offs between initial CapEx and ongoing OpEx, coupled with benefits in security and customization, are central to strategic decisions for enterprise LLM adoption.

Future Prospects and Technological Trade-offs

The integration of advanced features like MTP into frameworks such as llama.cpp is an example of how software innovation is unlocking new possibilities for LLM inference on consumer and professional hardware. Although the experiment has already shown impressive results, the author mentioned the need for further optimization of launch parameters and the intention to test even larger models, such as Qwen 3.5 122B. This indicates a continuous path of improvement and refinement.

The choice between different hardware architectures and software stacks for on-premise LLM inference always involves trade-offs. Factors such as available VRAM, memory bandwidth, compatibility with existing frameworks, and community support are all elements to consider. The emergence of integrated solutions like AMD Strix Halo, which combine powerful CPUs and GPUs with high-speed shared memory, offers an interesting alternative to discrete multi-GPU configurations, especially for scenarios requiring a balance between performance, power efficiency, and physical footprint.

AMD Strix Halo and llama.cpp: MTP Accelerates On-Premise LLM Inference

Advancing LLM Inference on Local Hardware

Technical Details and Performance Gains

Implications for On-Premise Deployments

Future Prospects and Technological Trade-offs

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

Optimizations in progress for llama.cpp

Strix Halo benchmarks: 13 LLM models, 15 llama.cpp builds

AI chip spending nears $1tn tipping point

👥 Join 160+ AI explorers