Optimizing On-Premise LLMs: The StepFun 3.7 Flash Case on AMD Strix Halo
The Large Language Model (LLM) landscape is constantly evolving, with increasing focus on on-premise deployment solutions that offer greater control, data sovereignty, and potentially a more favorable Total Cost of Ownership (TCO). In this context, optimizing performance on local hardware becomes crucial. A recent benchmark, conducted on a system equipped with an AMD Strix Halo APU, has highlighted the benefits of Multi-Token Prediction (MTP) technology applied to the StepFun 3.7 Flash model.
This analysis provides significant insights for CTOs, DevOps leads, and infrastructure architects evaluating self-hosted alternatives for AI workloads. The results demonstrate how tangible improvements in inference speed and energy efficiency can be achieved, which are fundamental aspects for the sustainability and scalability of local deployments.
Technical Details and Benchmark Architecture
The system used for the benchmark is based on an AMD Ryzen AI Max+ 395 APU, featuring an integrated Radeon 8060S GPU (gfx1151 architecture). The hardware configuration includes 128 GB of unified LPDDR5X memory, with 4 GB of dedicated UMA VRAM and a "GTT ceiling" of 112 GiB, indicating considerable flexibility in memory management for AI workloads. The operating system is Ubuntu 25.04, running Linux kernel 6.18.1 and Mesa/RADV 25.2.8 drivers.
The primary model tested is StepFun Step-3.7-Flash UD-IQ4_XS, a Mixture-of-Experts (MoE) model with approximately 200 billion total parameters and about 11 billion active parameters per token. Its disk size is approximately 88.79 GiB. For the MTP implementation, a Step-3.7-Flash-MTP-Q8_0.gguf "draft" model of about 3.5 GiB was used. The inference backend is a patched llama.cpp build with Vulkan/RADV support, configured with a 12,288-token context and specific MTP parameters (DRAFT_N=2, PMIN=0.60, UBATCH=512).
Performance Analysis and MTP Advantages
The benchmark results highlight a positive and measurable impact of MTP on the decode phase, i.e., token generation. Decode speed increased from 20.4 tokens/second (without MTP) to 26.0 tokens/second (with MTP), marking a 27.5% improvement. This enhancement is particularly relevant as prefill speed (initial prompt processing) remained largely unchanged, at approximately 211-212 tokens/second, indicating that MTP optimizes generation without introducing initial latencies.
In terms of overall efficiency, the normalized wall time for a sequence of 1150 input tokens and 2000 output tokens decreased by 20.8%, from 103.4 seconds to 82.4 seconds. Another significant data point concerns power consumption: during the decode phase, socket power decreased by approximately 14%, from 85 W to 73 W. This aspect is crucial for on-premise deployments, where TCO also includes operational energy costs. MTP's efficiency is further confirmed by the high 84.7% acceptance rate of "drafted" tokens.
Implications for On-Premise Deployments and Strategic Choices
The data collected from this benchmark offers an interesting perspective for organizations considering on-premise LLM implementations. Achieving a decode speed of 26 tokens/second on a 200 billion parameter MoE model with a high-end APU like the AMD Strix Halo demonstrates the feasibility of running complex workloads locally. This is particularly advantageous for scenarios requiring high data sovereignty, air-gapped environments, or where network latency to the cloud is a critical factor.
Comparison with other models, such as Qwen 3.5 122B MTP, which achieves 26.7 tokens/second, suggests that StepFun MTP positions itself in a similar performance tier while handling a model with a significantly larger total parameter count. This raises the question of the trade-off between model quality and inference speed, an aspect that decision-makers must carefully evaluate based on their specific needs. For those evaluating on-premise deployments, AI-RADAR offers analytical frameworks on /llm-onpremise to assess the trade-offs between performance, TCO, and compliance requirements, providing tools for informed decisions without direct recommendations.
💬 Comments (0)
🔒 Log in or register to comment on articles.
No comments yet. Be the first to comment!