Local LLM Execution: A Year of Progress

Just over a year since the DeepSeek moment, running large language models (LLMs) locally has made great strides. A tweet from a Hugging Face engineer highlighted how it was possible to run DeepSeek R1 @ Q8 at approximately 5 tokens per second (tps) with an investment of around $6000.

More Efficient Hardware

Today, at the same speed, a mini PC for around $600 allows you to run Qwen3-27B @ Q4, a more advanced model. For even higher speeds, Qwen3.5-35B-A3B @ Q4/Q5 reaches 17-20 tps.

Future Prospects

The rapid improvement of smaller models suggests that, in the near future, it will be possible to run 4B models with performance superior to Kimi 2.5. For those evaluating on-premise deployments, there are trade-offs to consider; AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate the different options.