Optimizing Large Language Models on Local Hardware

Running Large Language Models (LLMs) on on-premise infrastructure presents a growing challenge for companies seeking to balance performance, costs, and data sovereignty. The availability of consumer GPUs with high VRAM, such as the NVIDIA RTX 3090 with its 24GB, has opened new possibilities for local deployment of considerably sized models, like Qwen 3.6 27B. However, achieving optimal performance requires careful selection of the backend, quantization techniques, and hardware configurations.

This article explores the results of a series of benchmarks conducted to identify the most efficient setup for the Qwen 3.6 27B model on a single 24GB GPU. The goal is to provide a practical guide for CTOs, DevOps leads, and infrastructure architects evaluating self-hosted solutions for their AI/LLM workloads, highlighting the trade-offs and specific constraints of this type of deployment.

Backend Comparison and Quantization Choices

The research compared several inference backends, each with its own characteristics. llama.cpp was used as a baseline, representing a well-established starting point for running LLMs in GGUF format. BeeLlama showed promising potential on paper but did not replicate the expected speeds in the test configuration. The backend that delivered the best overall performance was ik_llama.cpp, excelling in decode and prefill speeds, as well as efficient management of available VRAM. Another framework, vLLM, was excluded from the final comparison due to stability issues with long contexts on a single GPU.

A crucial aspect for efficiency on hardware with limited VRAM is the choice of quantization. The Qwen3.6-27B-MTP-IQ4_KS.gguf model proved particularly effective. IQ4_KS quantization allowed maintaining high model quality with a significantly smaller memory footprint compared to other options like Unsloth's UD-Q4_K_XL, which required approximately 2.8 GiB of additional VRAM. This memory saving is critical when handling large contexts, such as the 156,000 tokens tested, and reasonable batch sizes.

Optimal Configuration Details and Performance

The configuration deemed most performant utilized ik_llama.cpp with the Qwen3.6-27B-MTP-IQ4_KS.gguf model. Key settings included a ctx-size of 156,000, cache-type-k and cache-type-v set to q8_0, flash-attn on, and built-in multi-token-prediction. For vision capabilities, the projector was kept on the CPU (--no-mmproj-offload) to save approximately 1.5 GiB of VRAM, with the option to move it to the GPU for faster image processing if VRAM allows.

Benchmarks were performed on a chat-completion task with a prompt of approximately 5,900 tokens and an output of 1,024 tokens, simulating a code-review activity. Results showed a prefill of about 1261 tokens/s and a decode of 72.9 tokens/s, with a total wall time of 18.79 seconds. These numbers, while not representing the theoretical "best-case," offer a realistic estimate of performance in a practical workload. For those evaluating on-premise deployments, understanding these trade-offs between configuration, VRAM, and performance is crucial.

Implications for On-Premise Deployments

The results of this analysis underscore the importance of judicious software and quantization choices to maximize the efficiency of LLM deployments on local hardware. VRAM optimization is a critical factor, especially for consumer cards, where every gigabyte counts for extending the context window or increasing batch size. The ability to run 27-billion-parameter models with extended contexts on a single RTX 3090 demonstrates the growing maturity of tools for local inference.

For organizations prioritizing data sovereignty, compliance, and control over their technology stacks, the self-hosted approach offers significant advantages. However, it requires deep technical expertise in configuration and optimization. The continuous evolution of backends like ik_llama.cpp and advanced quantization techniques make on-premise deployments increasingly competitive compared to cloud-based solutions, especially in terms of Total Cost of Ownership (TCO) for specific workloads.