LLM at 10 tokens/s on an 8th Gen i3: It Can Be Done!

A user shared their experience running a 16 billion parameter LLM on dated hardware: a 2018 HP ProBook 650 G5 laptop, equipped with an 8th generation Intel i3-8145U processor and 16 GB of RAM in dual-channel configuration.

The goal was to demonstrate that, contrary to what some proprietary AI solutions suggest, it is possible to run complex models even with limited resources. The user, writing from Burma, emphasizes that access to the latest generation hardware such as NVIDIA 4090s or high-end MacBooks is not always possible.

CPU vs iGPU: The Challenge

After a month of optimizations, the user managed to reach a speed of 10 tokens per second (TPS) with the DeepSeek-Coder-V2-Lite model (16B MoE). The comparative test between CPU and iGPU (Intel UHD 620) saw the latter prevail, thanks to integration with OpenVINO. The iGPU achieved an average speed of 8.99 tokens/s, with peaks of 9.73 tokens/s, surpassing the performance of the CPU (8.59 tokens/s on average, 9.26 tokens/s peak).

Optimization Strategies

The main strategies adopted include:

Use of MoE (Mixture of Experts) models: despite the 16 billion parameters, only 2.4 billion are calculated per token, making the model more efficient than smaller dense models.
Dual-channel RAM configuration: essential to ensure sufficient bandwidth.
Linux operating system: Ubuntu was chosen to minimize background processes.
OpenVINO integration via llama-cpp-python: to simplify dependency management.

Final Considerations

The user warns that the iGPU takes time for initial compilation and that language errors (Chinese tokens) may occasionally occur, but the logic of the model remains intact. The experience demonstrates that access to AI should not be limited by the availability of high economic resources. For those evaluating on-premise deployments, there are trade-offs to consider, and AI-RADAR offers analytical frameworks on /llm-onpremise to support these evaluations.

LLM at 10 tokens/s on an 8th Gen i3: It Can Be Done!

CPU vs iGPU: The Challenge

Optimization Strategies

Final Considerations

💻 Need GPU Cloud Infrastructure?

💬 Commenti (0)

📚 Approfondimenti

Approfondisci su LLM On-Premise

Configurazione hardware con 3 GPU V620 per 96GB di VRAM

OpenAI adotta chip Cerebras: prima implementazione fuori da Nvidia

Spesa per chip AI vicina a 1 trilione di dollari