The Evolution of Large Language Models on Local Hardware

The landscape of Large Language Models (LLMs) is constantly evolving, with an increasing focus on optimization for execution on local hardware. This trend addresses the need for many companies to maintain control over their data, ensure regulatory compliance, and optimize the Total Cost of Ownership (TCO) of AI workloads. Recently, the introduction of models like Gemma 4 and Qwen 3.6, combined with advanced optimization techniques, is redefining expectations for the capabilities of mid-range GPUs, particularly those with 24 GB of VRAM or less.

These developments represent a tipping point for those considering on-premise deployments, offering the possibility of running complex LLMs without the need for expensive cloud infrastructure or high-end GPUs. The ability to manage AI workloads locally is crucial for sectors requiring high security standards and data sovereignty, such as finance, healthcare, and public administration.

Technical Details and Performance Increase

Recent tests have shown a significant leap forward in inference performance. Using an NVIDIA GeForce RTX 3090 with 24 GiB of VRAM, an Intel Core i9-13900H processor, and 62 GiB of system RAM, a speed increase of between 1.2 and 1.8 times was observed. Specifically, the Gemma 4 31B model, which previously achieved around 40 tokens/s, showed an acceleration up to 70-80 tokens/s.

This improvement was made possible by the application of techniques such as Quantization-Aware Training (QAT) and the use of a draft model based on Medusa-style Tree Attention (MTP), configured with llama-server. The inference context was set to 40960 tokens, with a Q8_0 KV cache, demonstrating the effectiveness of these optimizations even with large context windows. The Gemma 4 12B model, tested in both text-only and multimodal (mmproj) modes, also benefited from a similar speed increase, with near-instantaneous responses for multimodal interactions.

Implications for On-Premise Deployment

These results have profound implications for LLM deployment strategies. For CTOs, DevOps leads, and infrastructure architects, the ability to achieve such high performance on hardware like the RTX 3090 means they can implement advanced AI solutions directly in their data centers or edge environments. This strengthens the ability to maintain data sovereignty, a fundamental aspect for many organizations operating in regulated contexts or handling sensitive information.

Direct control over the infrastructure also allows for more precise TCO management, transforming variable cloud operational costs into more predictable capital investments. Reduced reliance on external cloud services for LLM inference opens new opportunities for creating air-gapped or self-hosted environments where security and privacy are prioritized. For those evaluating on-premise deployments, AI-RADAR offers analytical frameworks on /llm-onpremise to assess the trade-offs between performance, costs, and compliance requirements.

Future Prospects and Trade-offs

The continuous optimization of models and inference frameworks suggests that the capabilities of existing hardware will be further leveraged. The increasing availability of Open Source models and innovation in Quantization techniques and inference architectures like MTP are democratizing access to advanced AI capabilities. This trend could lead to greater adoption of hybrid solutions, where more intensive training workloads remain in the cloud, while inference is handled locally.

However, it is crucial to consider the trade-offs. While GPUs with 24 GB of VRAM are now more capable, hardware selection must always balance VRAM requirements, throughput, latency, and power consumption against the available budget. AI-RADAR is committed to presenting these constraints and various options without direct recommendations, providing decision-makers with the necessary information to choose the most suitable solution for their specific needs.