A user reported a significant increase in prompt processing speed using llama.cpp with ROCm on a Strix Halo platform, equipped with Ryzen AI Max. The tests, performed with the modified version of llama.cpp-rocm, show a variable improvement depending on the model.

Performance Increases

The results indicate a speed increase ranging from 7% to 132% depending on the LLM model used. In particular, models such as GPT-OSS-120B-MXFP4 showed an increase of 132%, while others such as GLM4.7-Flash-UD-Q4_K_XL benefited from a more modest increase of 7%. Nemotron-3-Nano-30B-A3B-Q8_0 and Qwen3-Coder-Next-MXFP4-MOE recorded +98% and +77% respectively.

Details and Warnings

The user who performed the tests used an AMD Ryzen AI Max system with Radeon 8060S. It is important to note that, as highlighted in the comments to the original report, the performance increase may be related to a temporary bug. The author of the post himself later updated the discussion, indicating a return to previous performance.