BitCPM-CANN: Native 1.58-bit LLM Training on Ascend NPU

The landscape of Large Language Models (LLMs) is constantly evolving, with a growing emphasis on efficiency and accessibility. In this context, the BitCPM-CANN project emerges as a significant initiative, proposing a systematic approach to training 1.58-bit (ternary) LLMs directly on the Huawei Ascend NPU platform. This research addresses two crucial challenges for extreme low-bit LLMs: the ability to maintain high performance on complex reasoning tasks at on-device scales, and the possibility of performing end-to-end 1.58-bit training outside the CUDA ecosystem.

To overcome these limitations, the team adapted its GPU-based training pipeline to work with CANN, MindSpeed, and Megatron-LM. This allowed for the training of four model variants (BitCPM-CANN-0.5B, 1B, 3B, and 8B) that are strictly aligned with their full-precision MiniCPM4 counterparts in terms of both architecture and pre-training data. The goal is to demonstrate the feasibility and efficiency of deploying low-bit LLMs on specific hardware, offering new opportunities for self-hosted and data sovereignty-driven scenarios.

Technical Details and Performance

The training methodology employed is 1.58-bit Quantization-Aware Training (QAT), which simulates the effects of quantization during training to mitigate precision loss. The results obtained are remarkable: across a set of 11 benchmarks covering commonsense reasoning, domain knowledge, and mathematics, the 1B, 3B, and 8B variants of BitCPM-CANN retained between 95.7% and 97.2% of the full-precision models' performance. Notably, the 3B variant achieved performance parity on the BBH benchmark, while the 3B and 8B variants recovered almost all performance on GSM8K. The 0.5B variant retained 90.1% of performance, suggesting that for models below one billion parameters, the model's capacity itself, rather than the quantizer, is the bottleneck.

The integration of QAT resulted in minimal overhead on training throughput, amounting to only 4.5% (148 TFLOP/s versus 155 TFLOP/s per NPU). This makes ternary training a potentially default configuration for efficiency. During inference, 1.58-bit quantization allows for up to an 8x reduction in weight memory, which translates to an approximate 6x end-to-end reduction, including scaling factors. This is a crucial advantage for deployments on hardware with limited VRAM.

Implications for On-Premise Deployment

The ability to train and deploy low-bit LLMs on Ascend NPUs has profound implications for organizations considering self-hosted solutions. Independence from NVIDIA's CUDA ecosystem paves the way for greater hardware diversification and reduced reliance on a single vendor. For CTOs, DevOps leads, and infrastructure architects, this means being able to explore alternatives that could offer a more advantageous Total Cost of Ownership (TCO), especially in scenarios where the acquisition and management of specific hardware are priorities.

The significant reduction in memory required for inference (up to 6x end-to-end) is a decisive factor for deploying LLMs on edge devices or servers with less expensive hardware configurations. This not only improves operational efficiency but also strengthens data sovereignty, allowing companies to keep their models and data within their own infrastructure boundaries, complying with regulations like GDPR and ensuring air-gapped environments. For those evaluating on-premise deployments, AI-RADAR offers analytical frameworks on /llm-onpremise to assess the trade-offs between different architectures and hardware solutions.

Future Prospects and Concluding Remarks

The BitCPM-CANN project represents a significant step forward in developing a reusable low-bit training infrastructure for the Ascend ecosystem. It demonstrates that competitive performance can be achieved with highly quantized LLMs, even on complex reasoning tasks, without excessive precision sacrifice. This innovation not only validates the effectiveness of ternary quantization but also expands the options available to companies seeking to implement AI solutions more efficiently and controllably.

The availability of an end-to-end 1.58-bit training system on a "domestic NPU" scaled up to 8 billion parameters is a milestone. It underscores the growing maturity of alternative hardware and software in the AI field, offering technology decision-makers more flexible tools to build and deploy LLMs that meet specific performance, cost, and security requirements, particularly for on-premise workloads.