Liquid AI's LFM2-24B at ~50 tokens/second in a browser with WebGPU

Liquid AI has demonstrated running its LFM2-24B-A2B language model directly in a web browser, achieving remarkable performance through the use of WebGPU.

Performance

The model, a Mixture of Experts (MoE) architecture with 24 billion total parameters of which 2 billion are active, generates approximately 50 tokens per second on a device equipped with an M4 Max chip. The smaller variant, LFM2-8B-A1B, exceeds 100 tokens per second on the same hardware configuration.

Resources

Liquid AI has made available a demo and the source code of the project on Hugging Face: https://huggingface.co/spaces/LiquidAI/LFM2-MoE-WebGPU. Optimized ONNX models are also available:
* LFM2-8B-A1B-ONNX: https://huggingface.co/LiquidAI/LFM2-8B-A1B-ONNX
* LFM2-24B-A2B-ONNX: https://huggingface.co/LiquidAI/LFM2-24B-A2B-ONNX

Running large models directly in the browser opens up new possibilities for low-latency AI applications with high privacy requirements. For those evaluating on-premise deployments, there are trade-offs between performance, costs, and data control; AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.

Liquid AI's LFM2-24B at ~50 tokens/second in a browser with WebGPU

Performance

Resources

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

Liquid AI releases LFM2-24B-A2B: a 24 billion parameter MoE model

GPT-OSS 120B: Uncensored Open-Source Model for Local Inference

OpenAI projects US$280B revenue by 2030, plans US$600B in spending

👥 Join 160+ AI explorers