Local LLMs have just passed a pragmatic test: completing real coding tasks faster than their cloud counterparts without dramatic quality loss. This is the finding of an independent benchmark published by Reddit user /u/xquarx, who tested DeepSeek V4 Flash on two RTX PRO 6000 GPUs using the vLLM serving framework, comparing it against Anthropic’s Sonnet and Opus APIs.
The results tell an interesting story. DeepSeek V4 Flash took about 2 minutes per task on average, while Sonnet 5 – the slowest of the group – required roughly 6 minutes. That’s a 3x gap, translating into tangible time savings over long development sessions. The quality of the produced solutions, measured by the ability to generate correct and useful diffs, landed around Sonnet’s level, although the Opus and Fable models (via API) still hold a clear lead: for the single best answer, they remain the benchmark.
The test was not run in sterile conditions. The author deliberately mirrored real-world usage: local models ran inside OpenCode, while the APIs ran in Claude Code. An important detail: part of the performance gap is not solely due to the models themselves, but also to the execution harness. However, the question posed was not which model wins in a vacuum, but what you actually get when each system is set up the way a developer would use it. The answer: if you avoid dense attention – a common Achilles’ heel for many LLMs on long contexts – today’s local models are surprisingly fast and, for the first time, genuinely competitive in coding.
The chosen hardware is noteworthy: two RTX PRO 6000 GPUs provide a total of 96 GB of VRAM, enough to host mid-sized models without resorting to extreme quantization compromises. vLLM, for its part, is one of the most widely used serving frameworks for high-performance inference, capable of efficiently handling continuous attention without the bottlenecks that plague more naive approaches. The entire setup is self-hosted, fully under the user’s control, with all the ensuing benefits of privacy and data sovereignty – an increasingly critical aspect when working on proprietary code.
For those evaluating on-premise deployment of LLM-based coding assistants, this test signals a possible turning point. It’s no longer about accepting slow, approximate answers in exchange for independence from the cloud. With the right acceleration and serving optimizations, it’s now possible to achieve an iteration speed superior to that of APIs, while maintaining sufficient quality for many daily tasks. The trade-off remains: Opus and Fable still deliver the best precision, and for activities where the cost of an error is high, proprietary models might still be the choice. But for most development sessions, where waiting time matters and rapid feedback is sought, the local configuration offers a smoother experience.
For those navigating similar choices, AI-RADAR provides analytical frameworks at /llm-onpremise to weigh the trade-offs between cost, performance, and control. The benchmark author has published the full dataset, charts, and spreadsheets on a dedicated website, and promises to repeat the tests with future models. A sign that the overtaking of APIs by on-device inference is no longer science fiction, but a measurable reality.
💬 Comments (0)
🔒 Log in or register to comment on articles.
No comments yet. Be the first to comment!