Gemma 4 QAT on Strix Halo: On-Premise Performance for Quantized LLMs

Optimizing LLMs for Edge Computing

Running Large Language Models (LLMs) on local hardware, particularly on edge devices or on-premise systems with limited resources, presents a significant challenge. The need to balance performance, energy efficiency, and data control drives the industry to explore advanced optimization techniques. In this context, Google's Gemma 4 models, subjected to Quantization-Aware Training (QAT), emerge as a promising solution, especially when deployed on integrated hardware platforms like AMD Strix Halo APUs.

Recent evaluations conducted on a Strix Halo APU have highlighted the capabilities of these quantized models, served locally via llama.cpp with a Vulkan/RADV backend. The results offer important insights for CTOs, infrastructure architects, and DevOps leads considering self-hosted alternatives to cloud services for AI/LLM workloads, emphasizing data sovereignty and Total Cost of Ownership (TCO) optimization.

Technical Details and Performance on Strix Halo

The core of this experimentation lies in the QAT approach. Unlike post-training quantization, which reduces the precision of an already trained model, QAT integrates the quantization process directly into the training or adaptation phase. This allows the model to learn and compensate for precision loss from the outset, maintaining greater fidelity to the original model's behavior even in a low-precision format like Q4_0. The host system used for the benchmarks was an AMD Ryzen AI Max+ 395 with Radeon 8060S (gfx1151), featuring 128 GB of unified LPDDR5X memory, running on Linux Mint 22.3.

Tests involved various variants of the Gemma 4 QAT Q4_0 GGUF models, including the 12B, 26B-A4B, and 31B versions. The 26B-A4B QAT Q4_0 model stood out for its performance. Deployed via llama.cpp and Vulkan/RADV, it achieved approximately 59 tokens/second during the decode phase with a very robust prefill of 1194.4 tokens/second. The introduction of QAT-specific assistant heads, combined with an MTP (Multi-Token Prediction) setup and Q8 quantization for the KV cache, further improved decode performance, bringing it to approximately 71 tokens/second in single-stream, with significantly higher acceptance compared to using non-QAT assistant heads.

Context and Implications for On-Premise Deployments

These results are particularly relevant for organizations that need to run LLMs in on-premise or air-gapped environments. The ability to achieve high performance on an APU, a type of hardware typically more accessible and with lower power consumption than high-end discrete GPUs, opens new possibilities for distributed AI and edge computing. The choice of a QAT approach, combined with runtime optimization like llama.cpp, demonstrates how memory and computational constraints can be mitigated, making larger models usable on less powerful hardware.

The importance of QAT-specific assistant heads underscores that optimization is not just a matter of quantizing the main model but requires a holistic approach that includes all components of the inference pipeline. For those evaluating on-premise deployments, these trade-offs between model size, quantization level, hardware architecture, and software optimization are fundamental for defining TCO and ensuring compliance with data sovereignty regulations. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these trade-offs in detail.

Future Prospects and Final Considerations

The emergence of models like Gemma 4, optimized with QAT and capable of operating efficiently on APUs, marks a significant step forward in the democratization of AI. The ability to run complex LLMs locally reduces cloud dependence, offers greater control over data security and privacy, and can lead to a lower TCO in the long run. However, it is essential for organizations to carefully evaluate their specific needs, considering factors such as desired latency, required throughput, and model complexity.

These benchmarks, although specific to a llama.cpp and Vulkan/RADV configuration on a Strix Halo APU, highlight a clear trend: innovation in model optimization and runtime efficiency is crucial for unlocking the full potential of LLMs in on-premise and edge deployment scenarios. Continuous research and development in these areas will be decisive in defining future AI architectures, balancing performance and operational sustainability.