Strix Halo and llama.cpp: MTP Benchmarks Reveal Accelerations for Large Language Models

Introduction: Optimizing On-Premise LLMs with Strix Halo

The adoption of Large Language Models (LLMs) in self-hosted environments is a growing priority for companies aiming to maintain data sovereignty and optimize Total Cost of Ownership (TCO). In this context, the efficiency of local inference hardware and software is crucial. Recent benchmarks conducted on an AMD Strix Halo platform, utilizing the popular llama.cpp framework, offer interesting insights into the performance of Qwen3.6 models and the impact of the Multi-Turn Prediction (MTP) feature.

These tests focus on the ability to process complex LLM workloads, such as long-context conversations, directly on local hardware. The results provide a concrete basis for CTOs and infrastructure architects evaluating self-hosted solutions for their AI workloads, highlighting the trade-offs between different configurations and optimizations.

Technical Details and Benchmark Methodology

The hardware configuration used for the benchmarks included an AMD RYZEN AI MAX+ 395 CPU (16 cores/32 threads), a Radeon 8060S iGPU (RADV GFX1151), and 30 GiB of RAM, all running on Ubuntu 24.04 with kernel 6.17. The inference software was llama.cpp, version 9187, with Vulkan API 1.4.305 and Mesa RADV 25.0.7. This configuration represents a typical edge computing environment or a mid-range on-premise server.

Qwen3.6 models, in both 27B and 35B parameter variants, were tested in both base and MTP-optimized versions. The methodology involved two main scenarios: a “single-turn” test with a synthetic prompt of approximately 15,000 tokens, and a “multi-turn” test of 5 turns, where the context grew to about 28,500 tokens. Key metrics measured included “wall time” (total end-to-end time), “prompt processing throughput” (tokens/sec), and “generation throughput” (tokens/sec).

Analysis of Results: Between Acceleration and Trade-offs

The benchmarks revealed a differentiated picture depending on the model and workload type. For the Qwen3.6-27B model, adopting MTP led to a significant overall improvement. In the 15,000-token single-turn test, total time decreased by 11.50% (from 87.44s to 77.39s), with a 111.77% increase in generation throughput (from 7.63 to 16.15 tokens/sec), despite a slight slowdown in prompt processing (-12.46%). Even more pronounced was the benefit in the long-context multi-turn test, where total time was reduced by 22.46% (from 258.65s to 200.55s), and average generation throughput increased by 136.41% (from 7.61 to 17.98 tokens/sec).

The Qwen3.6-35B model showed more mixed results. In the single-turn test, total time increased by 11.17% (from 20.83s to 23.16s), although generation throughput improved by 16.47%. In the multi-turn test, total time remained substantially unchanged, with a slight increase of 2.34% (from 58.86s to 60.24s), while still registering a 24.80% increase in generation throughput. These data suggest that MTP consistently improves token generation but can negatively impact prompt processing, and the overall impact depends on the dominance of the decode or prefill phase in the specific workload.

Implications for On-Premise Deployments

The results of these benchmarks underscore the importance of carefully evaluating the workload shape when considering optimizations like MTP for on-premise LLM deployments. For scenarios requiring intensive token generation, especially in long-term conversational contexts, MTP can offer substantial advantages, as demonstrated by the 27B model. However, for workloads where prompt processing is dominant, the overall benefit might be limited or even negative, as observed with the 35B model.

For CTOs and DevOps leads exploring self-hosted solutions, these data highlight the need to test specific configurations with their own models and real-world workloads. Hardware selection, software optimization like llama.cpp, and the adoption of advanced features such as MTP are all factors that directly influence performance, TCO, and the ability to maintain data sovereignty. AI-RADAR offers analytical frameworks on /llm-onpremise to help evaluate these complex trade-offs, providing tools for informed AI deployment decisions.

Strix Halo and llama.cpp: MTP Benchmarks Reveal Accelerations for Large Language Models

Introduction: Optimizing On-Premise LLMs with Strix Halo

Technical Details and Benchmark Methodology

Analysis of Results: Between Acceleration and Trade-offs

Implications for On-Premise Deployments

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

LocalLLaMA: a look back at the early days of local LLM inference

Quick Qwen-35B-A3B Test: Image Analysis and Tool Calling on Consumer Hardware

Small LLM Evaluation: The Importance of Parsing in Local Agents

👥 Join 160+ AI explorers