A developer recently shared the results of their experiment with tiny language models, called FlashLM, designed to be trained and run entirely on CPU.

Model Details

The FlashLM v3-13m model has the following characteristics:

  • 13.6M parameters, with a d_model size of 256.
  • Ternary weights ({-1, 0, +1}), meaning inference only requires additions and subtractions, no multiplications.
  • Trained on 2-thread CPU, no GPU, in 1.2 hours.
  • Trained on 32M tokens from FineWeb-Edu.
  • Validation loss: 6.80.
  • Uses frozen GPT-2 embeddings (SVD projected) so it doesn't waste training time learning an embedding table.

Performance and Bottlenecks

The model produces grammatical English but lacks semantic coherence. The biggest surprise was that 86% of training time was spent on the output layer, projecting 256 dims to a 50,257 token vocabulary. This bottleneck limited the effectiveness of training the model core.

The developer is working on a next version (v4) that replaces the softmax with a hierarchical tree structure to fix this issue. If successful, this could allow for 5-10x more effective training in the same amount of time.

For those evaluating on-premise deployments, there are trade-offs related to optimizing models for CPUs versus GPUs, which AI-RADAR analyzes in detail in the /llm-onpremise section.