Running MoE LLMs on Legacy Hardware: A New Perspective

The advancement of Large Language Models (LLMs) has often highlighted the need for cutting-edge and expensive hardware for inference. However, a recent experiment has shown that it is possible to achieve significant performance with large Mixture of Experts (MoE) models, such as Qwen 3.6 35B-A3B and Gemma 4 26B-A4B, even on older, more accessible hardware configurations. The test was conducted on a secondhand machine equipped with an i7-6700 CPU, 32 GB of RAM, and, crucially, an NVIDIA GTX 1080 GPU with 8 GB of VRAM, a component readily available on the market at low cost.

This result is particularly relevant for organizations considering the deployment of LLMs in on-premise environments, where cost control and data sovereignty are priorities. The ability to leverage existing hardware or acquire low-cost components can drastically reduce the Total Cost of Ownership (TCO) compared to cloud-based solutions, which often entail high recurring operational costs for complex model inference.

Technical Details and Key Optimizations

The success of this implementation relies on the use of the llama.cpp framework, known for its efficiency in running LLMs on various hardware architectures. The cornerstone was the application of advanced techniques such as Key-Value (KV) cache quantization via TurboQuant/RotorQuant, which allowed for managing a 128k token context window while remaining within the 8 GB of VRAM of the GTX 1080. This optimization is crucial for the inference of MoE models, which tend to be more demanding in terms of memory.

Another fundamental aspect is the offloading of MoE model expert weights. llama.cpp can allocate less frequently used weights (cold expert weights) to system RAM, streaming them to the GPU via PCIe only when needed, while hot layers and the KV cache remain resident on the GPU. Although the GPU operated at approximately 40-50% utilization, the PCIe 3.0 x16 bandwidth proved to be the primary bottleneck, reaching its maximum capacity. Recorded performance was around 24 tokens per second for Qwen 3.6 35B-A3B and up to 24.5 tokens per second for Gemma 4 26B-A4B, the latter after a specific optimization for speculative decoding (MTP) that moved the token embedding table from the CPU to the GPU, improving efficiency by 22% and the draft acceptance rate to 79%.

Implications for On-Premise Deployment

These results have significant implications for companies evaluating on-premise deployment strategies for LLM workloads. The ability to reuse existing hardware or invest in less expensive solutions opens new avenues for managing data sovereignty and regulatory compliance, critical aspects in sectors such as finance or healthcare. Local execution of models ensures that sensitive data does not leave the corporate infrastructure, directly addressing privacy and security concerns.

For those considering on-premise deployment, there are trade-offs to consider, such as configuration complexity and the need for specific technical expertise for optimization. However, the benefits in terms of control, security, and TCO can outweigh these challenges, especially for AI workloads requiring high customization or air-gapped environments. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these trade-offs and support informed decisions.

Final Outlook

The experiment demonstrates that software innovation and optimization techniques can significantly extend the useful life of existing hardware for the latest AI applications. It is not always necessary to invest in state-of-the-art GPUs to start experimenting with or deploying LLM solutions in production, especially in scenarios where budget is a significant constraint. This approach democratizes access to the computational power required for LLMs, making them more accessible to a wider audience of developers and businesses.

The continuous development of frameworks like llama.cpp and the exploration of new quantization and offloading techniques promise to unlock further potential, pushing the boundaries of what can be achieved with limited hardware resources. For CTOs, DevOps leads, and infrastructure architects, understanding these capabilities is crucial for designing resilient, efficient, and business-compliant AI deployment strategies.