Tokens per Second Isn't Everything
A Reddit user shared their experience using different large language models (LLMs) for agentic development tasks. Initially, the user opted for Qwen3 Coder Next, attracted by the high token processing speeds (around 1000 tokens/s for the prompt and 37 tokens/s for generation) on an RTX 5070 TI with 96GB of DDR4.
Stability Beats Speed
Despite the promising speeds, the system proved unstable, with frequent backend crashes and slow overall progress (about 15 tasks completed out of 110 in a day). Frustrated, the user decided to try Qwen3.5 122B, a model with lower specifications (700 tokens/s prefill and 17 tokens/s generation).
Surprisingly, Qwen3.5 122B completed about twice the work in the same amount of time, with fewer errors, greater stability, and better code quality. The experience demonstrates that token processing speed is not the only determining factor for real-world productivity, and that larger, more stable models can be more efficient for complex tasks.
For those evaluating on-premise deployments, there are trade-offs between inference speed and model stability, as discussed in AI-RADAR /llm-onpremise.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!