Local LLMs and agentic workloads: prefill is everything, KV heads beat parameters

When running LLMs locally for agentic workflows—tool calling, coding agents, RAG—the spotlight almost always falls on token generation speed, the so-called tg128. But a comprehensive benchmark spanning 13 models at context sizes up to 131K tokens on a consumer AMD RX 7900 XT and Vulkan backend reveals that the key metric lies elsewhere: prefill, the time spent processing the entire context before emitting the first token.

In a typical scenario with 65K input tokens and a 300-token response, prefill consumes between 94% and 99% of the total wall-clock time. For agentic workloads, where outputs are often short strings (instructions, tool calls), context processing speed becomes the real bottleneck, not generation.

The real game is played on KV heads, not parameters

The data debunks a common myth: within the same class, the number of KV heads in the architecture influences prefill scalability far more than the parameter count. Two dense models—Ornith-9B (4 KV heads, 9B parameters) and Apriel-15B (8 KV heads, 15B parameters)—prove the point: at 128K context, the former is 4.4× faster despite having half the parameters. Every attention pass must scan the full KV cache, and 8 heads mean 2.5× more data per token. Practical rule: when evaluating a model for long contexts, check n_kv_heads and head_dim first, not the parameter count.

The quantization paradox: F16 sometimes wins

Quantizing the KV cache (Q8_0 for keys, Q4_0 for values) is widely considered a best practice to reduce VRAM footprint and speed up inference. But the benchmark reveals a counterintuitive twist: at 65K context, for MoE models and compact dense models, keeping the cache in F16 (no quantization) runs 20–53% faster than Q8/Q4. The reason? On-the-fly dequantization imposes a compute cost that, across thousands of tokens, outweighs the saved memory bandwidth. On GPUs with high compute throughput, the dequantization overhead becomes dominant. Conversely, dense models with 8 KV heads suffer badly with F16 because the doubled cache forces all model weights to spill over PCIe. The takeaway: test F16 on your hardware at real working context lengths; don’t assume quantization is always faster.

Mamba2 and MLA: promises and early cracks

The only Mamba2 hybrid tested, Granite-4.0-H-Small, retains 69% of its prefill speed at 131K context compared to 4K, while every pure transformer drops below 42%. The magic comes from Mamba2 layers with fixed recurrent state that avoid KV cache growth. The catch: reasoning quality is still low and decode is slow (71 t/s). A well-trained model on this architecture could be a serious contender for agentic workloads.

On the opposite end, the MLA-attention model GLM-4.7-Flash loses 80% of its speed going from 512 to 16K context and crashes above 65K on Vulkan. The MLA compression/decompression kernel scales poorly, at least on Vulkan. CUDA or Metal might handle it better, but the message is clear: don’t extrapolate short-context MLA benchmarks to long-context use on Vulkan.

What it means for self-hosting agentic workloads

For those designing on-premise deployments, this benchmark rewrites the priorities. Instead of obsessing over tg128, models should be evaluated on prefill at 65K or 131K tokens. KV head architecture becomes a critical selection criterion: a model with 4 heads and a 64 KB per token cache scales dramatically better than one with 8 heads and 160 KB, regardless of total parameters. This directly impacts hardware choices: GPUs with high memory bandwidth are always valuable, but for prefill-dominated loads the ratio of compute to bandwidth can invert the effectiveness of quantization tricks. MoE models further exploit VRAM spill efficiently: only active parameters cross the PCIe bus, minimizing the penalty and delivering a combination of fast prefill and competitive intelligence. From a TCO perspective, favoring models with these traits can extract more effective throughput from existing infrastructure, deferring investments in pricier hardware.

The full dataset and benchmark script are available in the source post.