Huawei: DeepSeek-V2 1.6T Post-Training with 1,000 Ascend 910C Chips

Huawei and DeepSeek-V2 Post-Training: Unprecedented Scale

A team led by Huawei recently announced a significant milestone in the Large Language Model (LLM) landscape: the completion of post-training for the DeepSeek-V2 model, which boasts an impressive 1.6 trillion parameters. This operation, of exceptional scope, was achieved by utilizing a vast hardware infrastructure comprising 1,000 Huawei Ascend 910C chips. The announcement not only highlights the growing computational capabilities required for the development of next-generation LLMs but also Huawei's commitment to positioning itself as a key player in both AI software and hardware.

Post-training represents a crucial phase in an LLM's lifecycle, where a model pre-trained on a general data corpus is further refined on more specific datasets or for particular tasks. This process is fundamental for improving the model's performance and adherence to targeted application requirements, demanding immense computational resources comparable to those of the initial pre-training phase. The choice of DeepSeek-V2, a model already known for its innovative architecture and scalability, underscores the ambition to push the boundaries of current processing capabilities.

The Strategic Role of Ascend 910C Chips

At the heart of this endeavor are Huawei Ascend 910C chips, AI accelerators designed for intensive training and inference workloads. The use of 1,000 units of these processors is no small detail: it implies managing a massive computing cluster, with stringent requirements in terms of power, cooling, and high-speed network interconnection. The Ascend 910C chips represent Huawei's answer to the demand for specialized silicon for AI, offering an alternative to dominant market solutions and reinforcing the company's strategy of technological self-sufficiency.

The ability to orchestrate such a large number of accelerators for a single post-training project demonstrates remarkable infrastructural and software maturity. This type of large-scale deployment is typically associated with proprietary data centers or self-hosted infrastructures, where direct control over hardware and the operating environment is paramount. Managing a 1,000-chip cluster requires advanced expertise in areas such as training parallelism (e.g., tensor parallelism and pipeline parallelism), VRAM management, and data throughput optimization.

Implications for On-Premise Deployments and Data Sovereignty

Huawei's announcement has significant implications for organizations evaluating on-premise or hybrid LLM deployment strategies. The ability to perform post-training of 1.6-trillion-parameter models on proprietary infrastructures, rather than relying exclusively on cloud services, offers substantial advantages in terms of data sovereignty, regulatory compliance, and control over long-term operational costs (TCO). For companies with stringent security requirements or those operating in regulated sectors, keeping data and models within their own infrastructural boundaries is often an absolute priority.

A deployment of this magnitude requires considerable initial investment (CapEx) but can result in a lower TCO compared to the recurring costs (OpEx) of cloud services, especially for constant and predictable workloads. However, it also entails the need to internally manage the entire pipeline, from hardware procurement to maintenance and software optimization. For those evaluating on-premise deployments, AI-RADAR offers analytical frameworks on /llm-onpremise to assess the trade-offs between control, performance, and costs, helping to make informed decisions without direct recommendations.

Future Prospects and the Race for AI Silicon

This Huawei milestone is set against a global backdrop of intense competition for the development of increasingly larger and more powerful LLMs, and for the silicon needed to run them. Dependence on a limited number of AI hardware providers is a growing concern for many nations and companies, driving diversification and the development of proprietary solutions. The Ascend 910C is a prime example of this trend, demonstrating that viable alternatives exist to address AI's computational challenges.

Future challenges include not only the continuous pursuit of more efficient model architectures and more powerful hardware but also the management of the enormous energy consumption and operational complexities associated with clusters of this scale. A company's ability to control the entire technology stack, from chip to model, can represent a strategic competitive advantage, ensuring greater agility and security. The DeepSeek-V2 post-training operation with 1,000 Ascend 910C is a clear indicator of the direction the industry is heading: towards greater autonomy and distributed computing capacity for AI.