Introduction: On-Premise LLMs and Hardware Challenges

The deployment of Large Language Models (LLM) in on-premise environments is a strategic priority for many companies, driven by the need to ensure data sovereignty, comply with stringent regulatory requirements, and optimize Total Cost of Ownership (TCO). However, the local implementation of these models, especially large ones, poses significant challenges in terms of hardware requirements, particularly concerning the video memory (VRAM) of GPUs.

In this context, analyzing non-standard hardware configurations and their actual performance becomes crucial. A recent test explored the capabilities of the Qwen3.6 27B model on a local setup, providing valuable data for decision-makers evaluating self-hosted alternatives to cloud solutions. The objective is to understand whether mid-range hardware, such as NVIDIA RTX 5060 Ti 16GB cards, can support complex LLM workloads.

Configuration Details and Performance

The configuration used for the tests was based on two NVIDIA RTX 5060 Ti GPUs, each equipped with 16GB of VRAM, totaling 32GB of available video memory. The execution environment was a Proxmox LXC container, supported by 16 vCPUs and approximately 60GB of system RAM. On the software front, the system utilized CUDA 13 and Torch 2.11 (nightly version), with vLLM (nightly version 0.19.2rc1.dev) as the serving framework for the Qwen3.6 27B model, specifically its sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP variant, which includes Quantization optimizations.

The recorded performance was significant for this hardware category. With an 8,000-token context and speculative-config MTP n=1, the system achieved approximately 50-52 tokens per second. Activating MTP n=3 improved throughput to 62-66 tokens per second. Even with a larger 32,000-token context, performance remained stable between 59 and 66 tokens per second. Operation with an extended context up to 204,800 tokens was also verified, albeit with extremely tight VRAM margins. A 168,000-token needle/retrieval test was completed in approximately 256 seconds, demonstrating the ability to handle very large context windows. VRAM usage was around 14.45GiB per GPU at idle with 204k context, rising to about 15.65GiB per GPU after a 168k-token prefill.

Operational Constraints and Deployment Considerations

Despite the promising results, deploying the Qwen3.6 27B model on this configuration presents some operational constraints. Managing a 204,800-token context proved to be extremely critical for the available VRAM: a gpu_memory_utilization setting of 0.94 caused a KV cache allocation failure, while a value of 0.95 allowed correct operation. This indicates minimal memory tolerance and the need for careful optimization.

The vLLM server startup process took several minutes due to compilation and autotuning phases. Logs showed occasional FlashInfer autotuner Out-Of-Memory (OOM) fallbacks during startup, although the server eventually became healthy. It was also noted that using TRITON_ATTN for the text path yielded better results. It is crucial to emphasize that this configuration is not designed for high concurrency, as max_num_seqs was set to 1. For those evaluating on-premise deployments, these trade-offs between hardware cost, performance, and memory requirements are crucial aspects to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these compromises in detail.

Future Prospects for Local Inference

In summary, the combination of two NVIDIA RTX 5060 Ti 16GB GPUs proved surprisingly usable for inference of the Qwen3.6 27B model, provided the correct checkpoint and runtime are chosen. Although the configuration does not offer ample headroom, it demonstrates the feasibility of running considerably sized LLMs on local, mid-range hardware, opening new possibilities for companies wishing to maintain control over their data and infrastructure.

These results are particularly relevant for organizations requiring air-gapped solutions or needing to comply with stringent data residency regulations. The ability to achieve acceptable performance with a more contained hardware investment compared to high-end GPUs can significantly influence TCO decisions and LLM adoption strategy, balancing performance needs with control and cost considerations.