BitMamba-2, a model combining the Mamba-2 State Space Model (SSM) architecture with BitNet's 1.58-bit quantization, has been introduced.

The primary goal is to demonstrate that ternary scaling laws hold up even for SSMs and to enable efficient inference on legacy hardware, such as edge devices, without requiring high-end GPUs.

Key Specs

  • Architecture: Mamba-2 + BitNet b1.58 (Ternary weights {-1, 0, 1})
  • Training: Trained from scratch on 150B tokens (FineWeb-Edu, Cosmopedia, Stack-Dedup) using Google TPU v6e-8.
  • Performance: The 1B parameter model significantly outperforms the 255M baseline, validating the scaling laws.

A custom C++ inference engine was developed. On a consumer Intel Core i3-12100F CPU, the following performance is achieved:

  • BitMamba-2-1B: ~53 tokens/sec (621 MB RAM)
  • BitMamba-2-255M: ~146 tokens/sec (252 MB RAM)

The code is fully open-source (Apache/MIT).