AI-RADAR.IT · AI-RADAR.NET · AI-RADAR.TECH

News & analysis on local LLMs, stack & on-prem hardware.

📁 LLM AI generated

FlashLM v5: Language Model Trained on CPU Beats GPU Baseline

Published on 2026-02-22 06:56 ℹ️ LocalLLaMA 📰 Read the original source article →

🏷️ Hardware 🏷️ LLM On-Premise 🏷️ DevOps

FlashLM v5: modello linguistico addestrato su CPU supera la baseline GPU

FlashLM v5 "Thunderbolt": CPU Training Beats GPU

FlashLM v5 "Thunderbolt" represents a significant evolution in the FlashLM series, demonstrating that competitive results can be achieved in language model training even using a CPU.

Results

The model achieved a final perplexity of 1.36 and a BPC (bits per character) of 0.44. Training was performed on an AMD Ryzen 7950X3D CPU in approximately 40 hours. The model has 29.7 million parameters, of which 26.5 million are ternary.

Architecture

FlashLM v5 uses the ParallelGatedRecurrence architecture, characterized by:

BitLinear with ternary weights {-1, 0, +1}
Parallel gated recurrence with learned decay gates
No matrix multiplications in the forward pass

Comparison with previous versions

The v5 "Thunderbolt" version shows a marked improvement over previous versions (v4 "Bolt" and v5.2 "Nova-Ignition") in terms of perplexity, BPC, and quality of the generated output. In particular, v5 demonstrates better narrative coherence, greater vocabulary diversity, and more correct grammar.

Future directions

The FlashLM project will continue with the v6 series, focusing on validating the ParallelGatedRecurrence architecture. In addition, a new project (Nano-Coder) will be launched to apply FlashLM techniques to code generation.

AI-Radar Takeaway

FlashLM v5, a language model with 29.7 million parameters, was trained on an AMD Ryzen 7950X3D CPU in approximately 40 hours. The model achieved a perplexity of 1.36, surpassing the TinyStories-1M baseline (PPL 1.59). The ParallelGatedRecurrence architecture utilizes ternary weights and does not require matrix multiplications in the forward pass.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

Railway Cloud Infrastructure

Modern cloud platform with instant deployments. Deploy from GitHub in seconds with automatic HTTPS, databases, and monitoring. Perfect for web apps, APIs, and LLM inference services.

✓ GitHub integration ✓ Auto HTTPS ✓ Simple pricing

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

AI-RADAR NEWSLETTER

Stay ahead — get AI signals in your inbox

Daily or weekly digest of the most important AI news. 160+ readers, no spam.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

Explore LLM On-Premise

Complete guide to running AI models locally: hardware, stack, and privacy.

FlashLM v4: 4.3M ternary model trained on CPU in 2 hours

FlashLM v4: 4.3M ternary model trained on CPU in 2 hours

FlashLM v4 is a language model with 4.3 million parameters, ternary weights (-1, 0, +1), and CPU-based training in just two hours. It generates coherent stories

FlashLM: Language Model Trained on CPU in Just Over an Hour

FlashLM: Language Model Trained on CPU in Just Over an Hour

A developer trained a small language model, called FlashLM, entirely on CPU in 1.2 hours, without matrix multiplications. The 13.6M parameter model uses ternary

Step-3.5-Flash: outperforms with fewer parameters

Step-3.5-Flash: outperforms with fewer parameters

The Step-3.5-Flash model, with a reduced active parameter architecture (11B out of 196B total), demonstrates superior performance compared to DeepSeek v3.2 in c

openPangu-2.0-Flash: MoE and Extended Context Trained on Ascend for On-Premise Inference

openPangu-2.0-Flash: MoE and Extended Context Trained on Ascend for On-Premise Inference

A 92B-total, 6B-activated MoE model with a 512k-token context window, trained on Ascend hardware. It features hybrid DSA/SWA attention, multi-token prediction,

Hybrid LLM Architectures and the CPU Bottleneck: The Qwen 27B Case on RTX 3090 Ti

Hardware Apr 30

Hybrid LLM Architectures and the CPU Bottleneck: The Qwen 27B Case on RTX 3090 Ti

A user experienced lower-than-expected Inference performance with Qwen 3.6 27B on an RTX 3090 Ti. Analysis revealed that the model's hybrid SSM architecture req

More in LLM

Google's TabFM: zero-shot tabular predictions without training

Longcat 2: INT8 and FP8 quantization now available for on-prem deployment

Why AI Needs a Glossary (and What It Has to Do with On-Premise Deployment)

Smartschool and AI for admission tests: why teaching is harder than answering

Mistral releases Leanstral 1.5: formal verification with 6 billion active parameters

DeepSeek Unveils DSpark: A Speed Leap for LLM Inference

→ View all in LLM →

AI-Radar LLM On-Premise

Complete guide to running AI models locally: hardware, stack, privacy, and reference architectures.

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Register free → Already a member? Log in