16x DGX Spark Cluster Update: An On-Premise LLM Architecture

An On-Premise DGX Spark Cluster for LLMs

The generative AI landscape is driving companies to evaluate increasingly powerful and specialized infrastructure solutions. A recent community update showcased the completion of an on-premise cluster consisting of 16 Nvidia DGX Spark units. This project highlights a commitment to self-hosted architectures, where direct control over hardware and data is paramount, addressing data sovereignty and TCO needs.

Configuring a system of this magnitude requires careful planning and meticulous execution. Although the deployment was described as time-consuming, the process proved smoother than anticipated. Each DGX Spark unit was set up with a customized Nvidia version of Ubuntu, pre-installed and ready to use, simplifying some initial phases of the deployment.

Technical Details and High-Speed Connectivity

The cluster's architecture relies on high-speed network connectivity. Each DGX Spark connects to an FS N8510 switch via a single QSFP56 cable. The two NIC interfaces of each DGX Spark are bonded into a single port, creating a "dual rail" that, despite using one cable, delivers an effective bandwidth of 200 Gbps. Measurements confirmed a throughput of 100-111 Gbps per rail, achieving the advertised value.

The choice of this configuration, as an alternative to solutions like H100s or the GB300, was driven by the need to maximize unified memory capacity within the Nvidia ecosystem. This strategy is crucial for managing large LLMs. For instance, with eight cluster nodes, it was possible to serve the GLM-5.1-NVFP4 model, which requires 434 GB of memory, using an eight-way tensor parallelism (TP=8). The team is currently conducting tests with models like DeepSeek and Kimi to further evaluate their performance.

Deployment Strategies and Rack Architecture

The long-term vision for this cluster includes a prefill/decode workload split. The Spark cluster is intended to handle the prefill phase, which demands massive parallel throughput. For the decode phase, which often benefits from lower latencies and memory architecture optimized for sequential inference, the integration of two to four Mac Studio units with M5 Ultra chips is planned once they become available. This hybrid on-premise strategy aims to optimize resource utilization for different stages of the LLM lifecycle.

For CTOs and infrastructure architects evaluating on-premise solutions, projects like this highlight the trade-offs between initial (CapEx) and operational (OpEx) costs, data sovereignty, and customization flexibility. The ability to keep data and models within one's own infrastructure offers significant advantages in terms of compliance and security, aspects often prioritized over the immediate scalability offered by the cloud.

Infrastructure Components and Future Outlook

The complete rack infrastructure was detailed, providing insight into the complexity of a deployment of this scale. In addition to the 16 DGX Sparks, the rack includes an OPNSense firewall, Mikrotik 10 Gb and 100 Gb switches for internet uplink and HPC-NAS connectivity, a 374 TB QNAP NAS with U.2 drives, a management server, and two workstations with dual Nvidia GeForce RTX 4090 GPUs. A SuperMicro 4x H100 NVL Station and a GH200 unit are also present, demonstrating an extremely varied and powerful development and inference environment.

This type of architecture, integrating different generations and types of Nvidia hardware and beyond, reflects the trend towards building highly specialized AI infrastructures optimized for specific workloads. The ability to orchestrate such an on-premise ecosystem offers granular control over performance and costs, an increasingly relevant factor for companies investing in Large Language Models.

16x DGX Spark Cluster Update: An On-Premise LLM Architecture

An On-Premise DGX Spark Cluster for LLMs

Technical Details and High-Speed Connectivity

Deployment Strategies and Rack Architecture

Infrastructure Components and Future Outlook

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

Nvidia DGX Spark review: Blackwell power for AI developers

Sparkli: Interactive AI-Powered Learning App for Kids by Ex-Google Team

Flex appeal: UK datacenter cuts AI power draw 40% on command

👥 Join 160+ AI explorers