BitMamba-2: 1.58-bit Mamba-2 model trained on CPU

Pubblicato il 2026-01-28 19:15 ℹ️ LocalLLaMA 📰 Leggi l'articolo originale →

BitMamba-2: modello Mamba-2 da 1.58 bit addestrato su CPU

BitMamba-2, a model combining the Mamba-2 State Space Model (SSM) architecture with BitNet's 1.58-bit quantization, has been introduced.

The primary goal is to demonstrate that ternary scaling laws hold up even for SSMs and to enable efficient inference on legacy hardware, such as edge devices, without requiring high-end GPUs.

Key Specs

Architecture: Mamba-2 + BitNet b1.58 (Ternary weights {-1, 0, 1})
Training: Trained from scratch on 150B tokens (FineWeb-Edu, Cosmopedia, Stack-Dedup) using Google TPU v6e-8.
Performance: The 1B parameter model significantly outperforms the 255M baseline, validating the scaling laws.

A custom C++ inference engine was developed. On a consumer Intel Core i3-12100F CPU, the following performance is achieved:

BitMamba-2-1B: ~53 tokens/sec (621 MB RAM)
BitMamba-2-255M: ~146 tokens/sec (252 MB RAM)

The code is fully open-source (Apache/MIT).

🤖 Ask AI about this

Vuoi approfondire? Leggi l'articolo completo dalla fonte:

📖 VAI ALLA FONTE ORIGINALE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

🚂

Railway Cloud Infrastructure

Modern cloud platform with instant deployments. Deploy from GitHub in seconds with automatic HTTPS, databases, and monitoring. Perfect for web apps, APIs, and LLM inference services.

✓ GitHub integration ✓ Auto HTTPS ✓ Simple pricing

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

💬 Commenti (0)

🔒 Accedi o registrati per commentare gli articoli.

Nessun commento ancora. Sii il primo a commentare!

📚 Approfondimenti

VERTICALE

BitMamba-2: 1.58-bit Mamba-2 model trained on CPU

Key Specs

💻 Need GPU Cloud Infrastructure?

💬 Commenti (0)

📚 Approfondimenti

Approfondisci su LLM On-Premise

GPT-OSS 120B: modello open-source non censurato per inference locale

Ottimizzare modelli MoE su CPU: guida a GLM-4 e GPT-OSS

LLM a 10 token/s su un i3 di 8a generazione: si può fare!