Dual Radeon R9700 GPUs power a 27B LLM: on-prem benchmarks with llama.cpp

Those who set up multi-GPU configurations with the new Radeon professional cards know they are moving in territory still sparsely documented. A technician recently shared detailed results of his setup: a ThinkStation P7 powered by a Xeon w7-3455 and two Gigabyte Radeon AI PRO R9700 cards with 32 GB each, for a total of 64 GB of VRAM. The workload is a 27-billion-parameter LLM, Qwen 3.6, served with llama.cpp and ROCm 7.2.1 on Ubuntu 24.04. The goal was not an academic exercise but to understand whether the hardware could sustain real workloads — code generation from Markdown specs, processing of long texts (Cisco manuals, medical literature), and summarization of complex sessions — and at what performance.

The hardware and software configuration

The core of the system consists of the two Radeon AI PRO R9700 GPUs, targeted by the gfx1201 compilation target and managed in parallel via ROCm. Llama.cpp was containerized with Docker, exposing the necessary devices (/dev/kfd, /dev/dri) and granting access to both cards through HIP_VISIBLE_DEVICES. The chosen model is a Q8 quantization of Qwen 3.6 27B with Multi-Token Prediction (MTP), a mechanism that generates multiple tokens in a single pass to speed up the decode phase. The context window was pushed to 131,072 tokens, using tensor splitting (--split-mode tensor) and a unified KV cache (--kv-unified) to distribute the load across the two GPUs without obvious bottlenecks.

On the inference side, the configuration enables flash attention, sets a batch size of 2048 and a micro-batch of 1024, with minimal parallelism (--parallel 1) and continuous batching. The MTP draft accepts up to 5 speculative tokens (--spec-draft-n-max 5), a value the tester is still refining: lowering it seems to improve generation speed at high contexts, suggesting that the cost of draft rejection can become noticeable under memory pressure.

Performance: prefill, decode, and draft acceptance

The numbers from the server logs show a machine that, while far from datacenter accelerator peaks, offers more than decent responsiveness for professional on-premise use. During the prefill phase, i.e., when the model must ingest large prompts, throughput stays above 400 t/s even with 100,000 input tokens and reaches nearly 1,500 t/s under 10,000 tokens. The most interesting figure for the end user, however, is the generation speed (decode): with an already filled context of around 3,000–6,000 tokens, the machine produces between 46 and 61 tokens per second; with contexts of 10,000–13,000 tokens it rises to 64–67 t/s. Even at nearly full window (102,000 tokens), 44 t/s remain acceptable, and at 125,000 tokens it still records 45 t/s. MTP draft acceptance oscillates between 0.33 and 0.61, a normal range for this type of acceleration, confirming that the mechanism brings tangible benefit without prohibitive overhead.

A further architectural advantage is prompt caching: the server maintains up to 32 KV cache checkpoints (each between 150 and 580 MiB) and restores them in 60–300 milliseconds, avoiding full prompt reprocessing when a new request shares its prefix with a previous session. This optimization is crucial in real application scenarios, where users ask successive questions about the same document.

Why the on-premise choice matters

The R9700 test is not just a hardware curiosity: for organizations evaluating a local LLM deployment, every piece of data on consumer/professional multi-GPU configurations is a tile in the TCO puzzle. The R9700, with 32 GB of VRAM per card, makes it possible to serve 27B models at high precision (Q8) without resorting to cloud services, while ensuring full data control and regulatory compliance — a factor AI-RADAR constantly monitors in digital sovereignty scenarios. Energy consumption was not measured, but the recorded PCIe bandwidth (below 200 MB/s during decode, peaks of 5–7 GB/s during prefill) indicates that the system does not saturate the bus, suggesting that the main bottleneck remains GPU compute capacity rather than transfer.

It is worth remembering that the ROCm ecosystem on Radeon professional cards is still maturing, and not all optimizations available on CUDA are immediately replicable. Nonetheless, the llama.cpp + ROCm combination is making rapid progress, and a setup like this can serve as a starting point for small development teams, research labs, or IT departments that want to experiment with local inference without six-figure investments. The choice of Q8 quantization, rather than Q4 or Q6, prioritizes output quality over raw speed, and aligns with a usage profile where reliability matters more than tokens per second.

What these tests signal

The published documentation — container, parameters, metrics — is an important signal: the community is beginning to produce concrete references for non-mainstream hardware, narrowing the gap between manufacturer announcements and operational reality. For those following on-premise deployment, having reproducible tests on GPUs like the R9700 means being able to compare cost scenarios, evaluate the sustainability of an all-local approach, and reduce purchase uncertainty. AI-RADAR will keep tracking these experiences, offering analytical tools to decide if and when to bring LLMs inside corporate boundaries.