Taalas Demonstrates Llama 3.1 8B Inference at 16,000 tok/s on ASIC

Published on 2026-02-19 23:31 ℹ️ LocalLLaMA 📰 Read the original source article →

Taalas dimostra inference Llama 3.1 8B a 16.000 tok/s su ASIC

Taalas, a startup specializing in inference hardware, has made available a demo chatbot and an API, both powered by an ASIC chip developed internally.

High-Speed Inference

The platform achieves an inference speed of 16,000 tokens per second using the Llama 3.1 8B model. The choice of a small model was intentional, to validate the concept of accelerated inference via dedicated hardware. Taalas is now focusing its efforts on more complex models.

Free Access

Despite the development of more advanced solutions, Taalas offers free access to its demo, allowing users to directly experience the capabilities of its chip. A demo chatbot and an API for inference are available.

For those evaluating on-premise deployments, there are trade-offs to consider carefully. AI-RADAR offers analytical frameworks on /llm-onpremise to support these evaluations.

AI-Radar Takeaway

Startup Taalas has released a free chatbot demo and API endpoint powered by a proprietary ASIC chip. The goal is to demonstrate high-speed inference of LLM models, achieving 16,000 tokens per second with Llama 3.1 8B. The company is now moving on to larger models, while still offering free access to the demo.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

⚡

RunPod GPU Cloud Platform

Flexible GPU cloud with pay-per-second billing. Deploy instantly with Docker support, auto-scaling, and a wide selection of GPU types from RTX 4090 to H100.

✓ No commitments ✓ Instant deployment ✓ Production-ready

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

SECTION

AI-Radar LLM On-Premise

Complete guide to running AI models locally: hardware, stack, privacy, and reference architectures.

→

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Taalas Demonstrates Llama 3.1 8B Inference at 16,000 tok/s on ASIC

High-Speed Inference

Free Access

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

ChatJimmy: 15,000+ tok/s on dedicated silicio – the "Model-on-Silicio" era?

Meta reveals four new MTIA chips built for AI inference — to be released on a six-month cadence

Microsoft announces Maia 200, a powerful new chip for AI inference

👥 Join 160+ AI explorers