LLM Comparison on outdated hardware

A user tested the performance of two LLM models, DeepSeek-V2-Lite and GPT-OSS-20B, on a 2018 HP ProBook laptop equipped with an Intel i3-8145U processor and integrated UHD 620 graphics, without a dedicated GPU. The goal was to evaluate the usability of the models on limited hardware, leveraging the OpenVINO backend for inference.

Testing Methodology

Both models were asked the same 10 questions, covering logic, health, history, programming, creative writing, biographies, mathematics, technical explanations, ethics, and food science. Each model was tested three times, running the questions first on the CPU and then on the iGPU with one layer offloaded. Identical conditions were maintained for context (4096), maximum output (256 tokens), temperature (0.2), and top_p (0.9).

Performance Results

DeepSeek-V2-Lite showed significantly superior performance, with almost double the speed compared to GPT-OSS-20B.

  • DeepSeek-V2-Lite on CPU: 7.93 tok/s average, TTFT 2.36s
  • DeepSeek-V2-Lite on iGPU: 8.08 tok/s average, TTFT 1.86s
  • GPT-OSS-20B on CPU: 4.20 tok/s average, TTFT 3.13s
  • GPT-OSS-20B on iGPU: 4.36 tok/s average, TTFT 3.07s

The iGPU improved the performance of DeepSeek-V2-Lite more than GPT-OSS-20B.

Response Quality

DeepSeek-V2-Lite scored 7.5 out of 10, showing consistent and well-structured answers in various areas. However, it failed a logic test and did not complete the requested code implementation. GPT-OSS-20B scored 2 out of 10, showing flashes of intelligence but with frequent errors, repetitions, and hallucinations. In many cases, GPT-OSS-20B failed to provide complete answers within the 256-token limit, exhausting the token budget in the reasoning phase.

Conclusions

DeepSeek-V2-Lite proved to be more suitable for execution on hardware with limited resources, offering a better balance between speed, coherence, and reliability. GPT-OSS-20B, while showing potential, exhibited usability issues due to its tendency to generate errors and repetitions with the test settings used. It might benefit from an increased maximum number of tokens and higher quantization.