AI-RADAR.IT · AI-RADAR.NET · AI-RADAR.TECH

News & analysis on local LLMs, stack & on-prem hardware.

📁 Hardware AI generated

AMD Ryzen AI Max: llama.cpp ROCm Speed Boost Reported

Published on 2026-02-16 15:24 ℹ️ LocalLLaMA 📰 Read the original source article →

🏷️ Hardware 🏷️ LLM On-Premise 🏷️ DevOps

AMD Ryzen AI Max: Incremento di velocità con llama.cpp e ROCm

A user reported a significant increase in prompt processing speed using llama.cpp with ROCm on a Strix Halo platform, equipped with Ryzen AI Max. The tests, performed with the modified version of llama.cpp-rocm, show a variable improvement depending on the model.

Performance Increases

The results indicate a speed increase ranging from 7% to 132% depending on the LLM model used. In particular, models such as GPT-OSS-120B-MXFP4 showed an increase of 132%, while others such as GLM4.7-Flash-UD-Q4_K_XL benefited from a more modest increase of 7%. Nemotron-3-Nano-30B-A3B-Q8_0 and Qwen3-Coder-Next-MXFP4-MOE recorded +98% and +77% respectively.

Details and Warnings

The user who performed the tests used an AMD Ryzen AI Max system with Radeon 8060S. It is important to note that, as highlighted in the comments to the original report, the performance increase may be related to a temporary bug. The author of the post himself later updated the discussion, indicating a return to previous performance.

AI-Radar Takeaway

A significant speed increase in prompt processing with llama.cpp and ROCm on Strix Halo (Ryzen AI Max) has been reported. Benchmarks show improvements of up to 132% in some models, opening new perspectives for LLM inference on integrated AMD hardware. However, it was pointed out that the high performance may be due to a temporary bug.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

Railway Cloud Infrastructure

Modern cloud platform with instant deployments. Deploy from GitHub in seconds with automatic HTTPS, databases, and monitoring. Perfect for web apps, APIs, and LLM inference services.

✓ GitHub integration ✓ Auto HTTPS ✓ Simple pricing

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

AI-RADAR NEWSLETTER

Stay ahead — get AI signals in your inbox

Daily or weekly digest of the most important AI news. 160+ readers, no spam.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

Explore LLM On-Premise

Complete guide to running AI models locally: hardware, stack, and privacy.

Homelab LLM: Consolidated from 3 Models to One on Ryzen AI MAX+

Hardware Mar 27

Homelab LLM: Consolidated from 3 Models to One on Ryzen AI MAX+

A user consolidated their homelab, moving from three distinct LLM models to a single 122B parameter MoE (Mixture of Experts) model on a machine with Ryzen AI MA

Running MoE Models on CPU/RAM: A Guide to Optimizing Bandwidth for GLM-4 and GPT-OSS

Running MoE Models on CPU/RAM: A Guide to Optimizing Bandwidth for GLM-4 and GPT-OSS

Running Mixture-of-Experts (MoE) models on CPU and RAM requires bandwidth optimization. The article analyzes GLM-4.7-Flash and GPT OSS 120B, providing hardware

MoE Training: 12x Faster with Unsloth and Reduced VRAM

Frameworks Feb 10

MoE Training: 12x Faster with Unsloth and Reduced VRAM

Unsloth AI announced optimizations for Mixture of Experts (MoE) model training, promising 12x faster speeds and a VRAM consumption reduction of over 35%. The op

Llama.cpp: 50% faster token generation on M3 Max by cutting a useless softmax

Frameworks Jun 22

Llama.cpp: 50% faster token generation on M3 Max by cutting a useless softmax

A llama.cpp pull request speeds up inference on M3 Max MacBook Pro by 50% for Gemma 4 by skipping an expensive softmax and sort when the Top-N-Sigma sampler is

OpenAI sidesteps Nvidia with GPT-5.3-Codex-Spark coding model on Cerebras

Hardware Feb 12

OpenAI sidesteps Nvidia with GPT-5.3-Codex-Spark coding model on Cerebras

OpenAI released GPT-5.3-Codex-Spark, its first production AI model to run on non-Nvidia hardware, deploying on Cerebras chips. The model delivers code at over 1

More in Hardware

96GB 4090 and 5090 GPUs: Scam Alert from a US Lab

A 96GB VRAM RTX 5090 from Shenzhen's Huaqiangbei Market for $8,200

AI chip demand squeezes global freight, putting on-premise plans at risk

JCET's US$1.1bn expansion shows where China's AI chip crunch is moving

Liquid Cooling Comes to 800V DC Busbars for AI Data Centers

Intel readies HDR support for DP MST configurations on Linux

→ View all in Hardware →

AI-Radar LLM On-Premise

Complete guide to running AI models locally: hardware, stack, privacy, and reference architectures.

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Register free → Already a member? Log in