Taalas is developing an innovative approach for large language model (LLM) inference: integrating the model architecture and its weights directly into the hardware.

Technology Details

Instead of using external HBM memory and complex packaging systems, Taalas etches the complete model onto a single silicio chip. According to the company, this allows them to achieve:

  • Latency of less than 1 millisecond
  • Over 17,000 tokens per second per user
  • 20x lower production costs
  • 10x higher energy efficiency
  • Development time from software model to ASIC chip of only 60 days

The company claims to have achieved these results with a team of only 24 engineers and an investment of $30 million. Their demonstrator uses Llama 3.1 8B and supports LoRA fine-tuning.

Implications

This approach could be particularly interesting for applications where latency is critical, such as real-time speech models, real-time avatar generation, and computer vision.