Zai Overhauls Network Architecture for LLM Inference: Optimized Performance and Costs

Optimizing network infrastructure represents a crucial challenge for companies managing large-scale artificial intelligence workloads, especially concerning on-premise deployments. In this context, Zai recently demonstrated how targeted innovation in network architecture can yield significant benefits, improving performance while reducing operational costs. The company replaced the standard network configuration on a thousand-GPU cluster, used for GLM-5.1 model inference, with a proprietary solution called ZCube.

This initiative, developed in collaboration with Tsinghua University and HarnetsAI, has produced remarkable results in terms of efficiency. Production data indicates a 15% increase in GPU inference throughput and a 40.6% drop in P99 tail latency on the first token. Concurrently, Zai reported a 33% reduction in costs related to switches and optical modules, highlighting a rare scenario where performance improvement is accompanied by significant hardware cost optimization.

The Technical Detail Behind ZCube

The problem addressed by Zai lies in managing the traffic generated by Prefill-Decode disaggregated inference. While efficient for managing Large Language Models, this methodology creates highly asymmetric traffic patterns between cluster nodes, particularly for KV Cache transfers. Traditional network topologies, such as the ROFT (Routing on Fat-Tree) configuration, are often optimized for training workloads, which exhibit more balanced traffic patterns. However, with disaggregated inference, traffic patterns do not match the static rail mapping, leading to the formation of hotspots on specific Leaf switches and the accumulation of PFC (Priority Flow Control) backpressure.

ZCube solves this problem by adopting a fully flattened architecture, which entirely removes the network's Spine layer. Instead, it uses a complete bipartite interconnect between two switch groups. This innovative configuration eliminates an entire category of congestion that ROFT architectures cannot avoid by design. It is crucial to emphasize that these improvements were achieved while keeping the GPUs, the software stack, and the GLM-5.1 model unchanged. The only variable modified was the underlying network architecture, demonstrating the untapped potential in infrastructural optimization.

Implications for On-Premise Deployments

The results achieved by Zai offer valuable insights for CTOs, DevOps leads, and infrastructure architects evaluating on-premise LLM deployments. The ability to enhance the computational performance of an existing GPU cluster while simultaneously reducing network hardware costs represents a significant competitive advantage. This approach contrasts with the common perception that increased performance necessarily requires a proportionally larger investment in hardware. Network optimization can therefore be a key factor in improving the Total Cost of Ownership (TCO) of AI systems.

For organizations prioritizing data sovereignty, regulatory compliance, or the need for air-gapped environments, the efficiency of self-hosted infrastructure becomes even more critical. The possibility of extracting greater value from existing hardware through architectural innovations strengthens the argument for on-premise deployments. AI-RADAR, in its section dedicated to /llm-onpremise, offers analytical frameworks to help decision-makers evaluate these complex trade-offs, highlighting how network efficiency is a fundamental component in the overall equation.

Future Prospects for AI Infrastructure

Zai's case underscores an emerging trend in the artificial intelligence landscape: the growing importance of infrastructural engineering. While much attention focuses on developing increasingly larger and more performant models, the efficiency with which these models are executed in production largely depends on the robustness and optimization of the underlying infrastructure. Innovation in networking, computing, and storage is essential to unlock the full potential of LLMs, especially in contexts where control, security, and cost efficiency are priorities.

Zai's experience demonstrates that not all performance challenges require a massive GPU upgrade or a shift to new-generation hardware. Sometimes, the most effective solution lies in an intelligent redesign of existing components, such as network architecture. This approach not only maximizes the return on hardware investment but also opens new avenues for the widespread adoption of LLMs in enterprise environments with specific constraints, solidifying the role of network engineering as a fundamental pillar for the future of on-premise AI.