Taalas is developing an innovative approach for large language model (LLM) inference: integrating the model architecture and its weights directly into the hardware.
Technology Details
Instead of using external HBM memory and complex packaging systems, Taalas etches the complete model onto a single silicio chip. According to the company, this allows them to achieve:
- Latency of less than 1 millisecond
- Over 17,000 tokens per second per user
- 20x lower production costs
- 10x higher energy efficiency
- Development time from software model to ASIC chip of only 60 days
The company claims to have achieved these results with a team of only 24 engineers and an investment of $30 million. Their demonstrator uses Llama 3.1 8B and supports LoRA fine-tuning.
Implications
This approach could be particularly interesting for applications where latency is critical, such as real-time speech models, real-time avatar generation, and computer vision.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!