A developer recently shared the results of their experiment with tiny language models, called FlashLM, designed to be trained and run entirely on CPU.
Model Details
The FlashLM v3-13m model has the following characteristics:
- 13.6M parameters, with a
d_modelsize of 256. - Ternary weights ({-1, 0, +1}), meaning inference only requires additions and subtractions, no multiplications.
- Trained on 2-thread CPU, no GPU, in 1.2 hours.
- Trained on 32M tokens from FineWeb-Edu.
- Validation loss: 6.80.
- Uses frozen GPT-2 embeddings (SVD projected) so it doesn't waste training time learning an embedding table.
Performance and Bottlenecks
The model produces grammatical English but lacks semantic coherence. The biggest surprise was that 86% of training time was spent on the output layer, projecting 256 dims to a 50,257 token vocabulary. This bottleneck limited the effectiveness of training the model core.
The developer is working on a next version (v4) that replaces the softmax with a hierarchical tree structure to fix this issue. If successful, this could allow for 5-10x more effective training in the same amount of time.
For those evaluating on-premise deployments, there are trade-offs related to optimizing models for CPUs versus GPUs, which AI-RADAR analyzes in detail in the /llm-onpremise section.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!