A Quad-GPU System for On-Premise LLM Inference

In the rapidly evolving landscape of artificial intelligence, the ability to run Large Language Models (LLMs) locally, maintaining full control over data and infrastructure, is becoming a priority for many organizations. A recent project showcased the feasibility of assembling a high-performance quad-GPU system, based on NVIDIA RTX 5060Ti 16GB cards, specifically optimized for LLM inference workloads in an on-premise context. This initiative underscores the growing interest in self-hosted solutions that offer data sovereignty and operational flexibility.

Building dedicated AI infrastructure locally represents a strategic choice for organizations that need to manage sensitive data or aim to optimize their Total Cost of Ownership (TCO) in the long term. The presented configuration, while an individual endeavor, offers significant insights for system architects and DevOps leads evaluating alternatives to cloud services for running complex models. The selection of consumer-grade yet powerful components highlights a pragmatic approach to building AI computing capabilities.

Technical Details and Hardware Optimizations

The core of the system consists of four NVIDIA RTX 5060Ti GPUs, each equipped with 16GB of VRAM, a crucial amount for running considerably sized LLMs. The chosen motherboard, an MSI MEG Z890 Unify-X, plays a fundamental role due to its PCIe 5.0 support. This board can handle two M.2 ports with PCIe 5.0 x4 connectivity directly from the CPU lanes, in addition to two PCIe slots operating at 8x and 4x respectively, also directly connected to the CPU. It is important to note that a PCIe 5.0 x4 connection offers bandwidth equivalent to that of a PCIe 4.0 x8, effectively doubling the data transfer speed compared to the previous generation.

To integrate the four GPUs, the system architect utilized two M.2 adapters, allowing two additional cards to be connected. Power configuration was managed using two separate Power Supply Units (PSUs): one dedicated to the main system and the other, shared via a Y-splitter, to power the two additional GPUs connected via the adapters. A further optimization involves memory overclocking: most of the RTX 5060Ti cards used allowed an overclock of +6000MTs (+3000Mhz), significantly improving memory bandwidth, a critical factor for performance in LLM inference.

Inference Goals and Next Steps

The primary objective of this hardware configuration is the efficient execution of specific Large Language Models. In particular, the system has been designed to handle the Qwen 3.6 27B model, with the intention of testing it in Q8 quantization and, potentially, with INT8 using frameworks like vLLM or the latest versions of llama.cpp. The ability to run models of this scale with good throughput and low latency is fundamental for enterprise applications requiring fast and reliable responses.

The user has already installed NVIDIA drivers compatible with open-source kernel modules that support Peer-to-Peer (P2P) communication between GPUs. This is a crucial step to maximize performance in multi-GPU configurations, as it reduces latency in data exchange between cards. The next steps will include conducting detailed benchmarks, both with and without P2P optimization, to quantify performance gains and validate the effectiveness of the chosen architecture. These tests will provide valuable data for anyone considering a similar deployment.

The On-Premise Perspective for AI

This project highlights a key trend in the tech sector: the increasing adoption of on-premise solutions for AI workloads, particularly for LLM inference. For CTOs, DevOps leads, and infrastructure architects, the ability to build customized systems offers tangible advantages. Among these, data sovereignty is paramount, allowing organizations to keep data within their operational boundaries, complying with privacy regulations and compliance requirements.

Furthermore, a self-hosted deployment can lead to a more favorable TCO in the long run compared to the operational costs (OpEx) of cloud services, especially for predictable and constant workloads. While the initial investment (CapEx) may be higher, direct control over hardware and the ability to optimize each component for specific application needs can result in significant efficiencies. AI-RADAR continues to explore these trade-offs, providing analytical frameworks on /llm-onpremise to help companies evaluate the best deployment strategies for their Large Language Models.