A long thread on r/LocalLLaMA has sparked debate about what it really takes to do agentic work on the latest workstation GPUs. The author tested multiple models on a system with two RTX Pro 6000 cards, searching for a kind of “local Sonnet” that could handle contexts up to 150k tokens without grinding to a halt. The verdict? Software and models are still out of sync, and the winners are architectures that don’t rely on data-center-optimized kernels.
Attention makes the difference, not raw power
The core of the issue is how each model handles attention when the context window expands. Mimo 2.5 uses the same 5:1 sliding-window hybrid attention seen in Gemma 3: most layers look only at recent tokens, while a few still read the full context. That keeps speed from collapsing. Step 3.7 Flash uses a 3:1 variant and reaches roughly 40 tokens per second at 178k tokens.
On the other side, MiniMax M3 and DeepSeek V4 depend on CUDA kernels written for data-center Blackwell (SM100, B200 class). On the RTX Pro 6000 – a consumer-tier Blackwell GPU – those kernels aren’t available. MiniMax M3 silently falls back to dense attention and slows to a crawl; DeepSeek V4 offloads operations to the CPU and struggles at 14 t/s.
The software gap that hobbles new GPUs
The root cause is not theoretical. The llama.cpp repository openly discusses the difficulty of shipping a GGUF with flash attention for DeepSeek V4, and an SGLang issue flags bugs with NVFP4 on SM120. In practice, anyone buying an RTX 5090 or a Pro 6000 today for large-scale local inference ends up with powerful hardware but no software to fully exploit the latest models.
For anyone evaluating on-premise deployment, the message is clear: it’s not enough to compare official benchmarks; you have to verify whether the chosen model can run without proprietary accelerations. Architectures that use standard attention mechanisms already supported in mature runtimes – sliding window, grouped query attention – are currently the most pragmatic choice to maintain speed on workstation GPUs.
Coding quality matches Sonnet, but time changes everything
A surprising finding is the quality of the generated code. In the author’s private benchmark, Mimo 2.5, MiniMax 2.7, MiniMax M3, and Step 3.7 Flash all landed at Sonnet’s level (Qwen 3.5 122B excluded). The difference lies in the minutes needed to complete the task: Mimo 2.5 takes about 4 minutes, like Opus and Sonnet; MiniMax M3 takes roughly 40 minutes. A gap that turns an interactive workflow into an unbearable bottleneck.
This means that, in an agentic scenario where the context window fills quickly, the model choice can’t ignore the trade-off between quality and latency at large contexts. Even a larger model (427B vs. 229B) brings no tangible improvement if VRAM limits force it to the same quantization level and if the kernel penalizes it.
What to expect (and what not) from the local ecosystem
The current situation tells an uncomfortable truth for those who see self-hosting LLMs as an immediate alternative to the cloud: serving software still lags behind the newest GPUs and custom architectures. Projects like Unsloth and llama.cpp are working to bridge the gap, but with no definite timeline. In the meantime, models like Mimo 2.5 and Step 3.7 Flash show that “old-school” attention approaches can handle contexts beyond 150k tokens on relatively accessible hardware.
For AI-RADAR readers, this is a useful reminder: when assessing an on-premise deployment on consumer or prosumer GPUs, the analysis must include not only the model’s specs but also the maturity of software support on the chosen platform. Otherwise, you risk investing in expensive machines only to discover that the most promising models run far slower than hoped.
💬 Comments (0)
🔒 Log in or register to comment on articles.
No comments yet. Be the first to comment!