Dhi-5B: A 5 Billion Parameter LLM Developed with Limited Resources

An undergraduate student has announced the release of Dhi-5B, a 5 billion parameter multimodal language model (LLM). The unique aspect of this project is the extremely low budget used for training: approximately $1200.

The model was developed using a custom codebase and state-of-the-art training methodologies. The training process was divided into five main stages:

  1. Pre-Training: The most computationally intensive phase, dedicated to building the core of the model.
  2. Context-Length-Extension: The model learns to handle 16,000 token contexts, starting from the 4,000 learned during pre-training.
  3. Mid-Training: Annealing on very high quality datasets.
  4. Supervised-Fine-Tuning: The model is fine-tuned to handle conversations.
  5. Vision-Extension: The model acquires the ability to process visual information.

The model will be released in three phases: Dhi-5B-Base (already available), Dhi-5B-Instruct (coming soon), and the full Dhi-5B version (coming soon).

The base version of the model has 4 billion parameters and was trained on 40 billion natural language tokens, mainly in English, from the FineWeb-Edu dataset. The new Muon optimizer was used for optimizing the Matrix Layers, while the rest was optimized with AdamW. The model architecture includes 32 layers, a width of 3072, SwiGLU MLPs, full MHA attention with FlashAttention-3, a context length of 4096, a vocabulary of 64,000 tokens, and a batch size of 2 million during training.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these options.