An On-Premise DGX Spark Cluster for LLMs
The generative AI landscape is driving companies to evaluate increasingly powerful and specialized infrastructure solutions. A recent community update showcased the completion of an on-premise cluster consisting of 16 Nvidia DGX Spark units. This project highlights a commitment to self-hosted architectures, where direct control over hardware and data is paramount, addressing data sovereignty and TCO needs.
Configuring a system of this magnitude requires careful planning and meticulous execution. Although the deployment was described as time-consuming, the process proved smoother than anticipated. Each DGX Spark unit was set up with a customized Nvidia version of Ubuntu, pre-installed and ready to use, simplifying some initial phases of the deployment.
Technical Details and High-Speed Connectivity
The cluster's architecture relies on high-speed network connectivity. Each DGX Spark connects to an FS N8510 switch via a single QSFP56 cable. The two NIC interfaces of each DGX Spark are bonded into a single port, creating a "dual rail" that, despite using one cable, delivers an effective bandwidth of 200 Gbps. Measurements confirmed a throughput of 100-111 Gbps per rail, achieving the advertised value.
The choice of this configuration, as an alternative to solutions like H100s or the GB300, was driven by the need to maximize unified memory capacity within the Nvidia ecosystem. This strategy is crucial for managing large LLMs. For instance, with eight cluster nodes, it was possible to serve the GLM-5.1-NVFP4 model, which requires 434 GB of memory, using an eight-way tensor parallelism (TP=8). The team is currently conducting tests with models like DeepSeek and Kimi to further evaluate their performance.
Deployment Strategies and Rack Architecture
The long-term vision for this cluster includes a prefill/decode workload split. The Spark cluster is intended to handle the prefill phase, which demands massive parallel throughput. For the decode phase, which often benefits from lower latencies and memory architecture optimized for sequential inference, the integration of two to four Mac Studio units with M5 Ultra chips is planned once they become available. This hybrid on-premise strategy aims to optimize resource utilization for different stages of the LLM lifecycle.
For CTOs and infrastructure architects evaluating on-premise solutions, projects like this highlight the trade-offs between initial (CapEx) and operational (OpEx) costs, data sovereignty, and customization flexibility. The ability to keep data and models within one's own infrastructure offers significant advantages in terms of compliance and security, aspects often prioritized over the immediate scalability offered by the cloud.
Infrastructure Components and Future Outlook
The complete rack infrastructure was detailed, providing insight into the complexity of a deployment of this scale. In addition to the 16 DGX Sparks, the rack includes an OPNSense firewall, Mikrotik 10 Gb and 100 Gb switches for internet uplink and HPC-NAS connectivity, a 374 TB QNAP NAS with U.2 drives, a management server, and two workstations with dual Nvidia GeForce RTX 4090 GPUs. A SuperMicro 4x H100 NVL Station and a GH200 unit are also present, demonstrating an extremely varied and powerful development and inference environment.
This type of architecture, integrating different generations and types of Nvidia hardware and beyond, reflects the trend towards building highly specialized AI infrastructures optimized for specific workloads. The ability to orchestrate such an on-premise ecosystem offers granular control over performance and costs, an increasingly relevant factor for companies investing in Large Language Models.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!