Taalas: LLMs baked into hardware, up to 16,000 tokens/second

Published on 2026-02-20 19:06 ℹ️ LocalLLaMA 📰 Read the original source article →

🏷️ Hardware 🏷️ LLM On-Premise 🏷️ Fine-Tuning 🏷️ DevOps

Taalas: LLM integrati nell'hardware, fino a 16.000 token/secondo

Taalas is developing an innovative approach for large language model (LLM) inference: integrating the model architecture and its weights directly into the hardware.

Technology Details

Instead of using external HBM memory and complex packaging systems, Taalas etches the complete model onto a single silicio chip. According to the company, this allows them to achieve:

Latency of less than 1 millisecond
Over 17,000 tokens per second per user
20x lower production costs
10x higher energy efficiency
Development time from software model to ASIC chip of only 60 days

The company claims to have achieved these results with a team of only 24 engineers and an investment of $30 million. Their demonstrator uses Llama 3.1 8B and supports LoRA fine-tuning.

Implications

This approach could be particularly interesting for applications where latency is critical, such as real-time speech models, real-time avatar generation, and computer vision.

AI-Radar Takeaway

Startup Taalas takes a radical approach: baking LLM models and their weights directly into a silicio chip. This achieves sub-millisecond latencies and 10x power efficiency, eliminating the need for HBM and advanced packaging. LoRA fine-tuning is supported. The first demonstrator uses Llama 3.1 8B.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

⚡

RunPod GPU Cloud Platform

Flexible GPU cloud with pay-per-second billing. Deploy instantly with Docker support, auto-scaling, and a wide selection of GPU types from RTX 4090 to H100.

✓ No commitments ✓ Instant deployment ✓ Production-ready

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

SECTION

AI-Radar LLM On-Premise

Complete guide to running AI models locally: hardware, stack, privacy, and reference architectures.

→

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Taalas: LLMs baked into hardware, up to 16,000 tokens/second

Technology Details

Implications

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

AI chip spending nears $1tn tipping point

OpenAI taps Cerebras for US$10 billion AI chip buildout

AI chip design is pushing advanced chip packaging to its limits

👥 Join 160+ AI explorers