An engineer has created Mini-LLM, a complete transformer language model implemented entirely from scratch.

Key Features

Mini-LLM implements the same components as Llama 3:

  • RoPE (Rotary Position Embeddings) to scale to longer sequences.
  • RMSNorm for faster speed and stability compared to LayerNorm.
  • SwiGLU, a state-of-the-art activation function.
  • Grouped Query Attention for efficient inference.
  • SentencePiece BPE for tokenization with a 32K vocabulary.

Complete Pipeline

The project includes a complete pipeline:

  • Custom tokenization, data processing, training, and inference.
  • Memory-mapped data loading (TB-scale ready).
  • Mixed precision training with gradient accumulation.
  • KV caching for fast generation.

Results

  • 80 million parameters trained on 361 million tokens.
  • 5 hours on a single A100, final loss of approximately 3.25.
  • Generates coherent text with correct grammar.
  • Inference speed between 200 and 500 tokens per second.

The code is clean, well-documented, and designed for learning. Each component has detailed explanations of the "why" and not just the "how".