Liquid AI has demonstrated running its LFM2-24B-A2B language model directly in a web browser, achieving remarkable performance through the use of WebGPU.

Performance

The model, a Mixture of Experts (MoE) architecture with 24 billion total parameters of which 2 billion are active, generates approximately 50 tokens per second on a device equipped with an M4 Max chip. The smaller variant, LFM2-8B-A1B, exceeds 100 tokens per second on the same hardware configuration.

Resources

Liquid AI has made available a demo and the source code of the project on Hugging Face: https://huggingface.co/spaces/LiquidAI/LFM2-MoE-WebGPU. Optimized ONNX models are also available:
* LFM2-8B-A1B-ONNX: https://huggingface.co/LiquidAI/LFM2-8B-A1B-ONNX
* LFM2-24B-A2B-ONNX: https://huggingface.co/LiquidAI/LFM2-24B-A2B-ONNX

Running large models directly in the browser opens up new possibilities for low-latency AI applications with high privacy requirements. For those evaluating on-premise deployments, there are trade-offs between performance, costs, and data control; AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.