Faster Inference with MLX and Qwen-3.5
The mlx-lm framework is about to receive a significant update: the introduction of multi-token prediction (MTP) for the Qwen-3.5 series models. This feature allows generating multiple tokens per forward pass, significantly increasing throughput.
Performance Increase
Early tests, performed on an M4 Pro with a Qwen3.5-27B model quantized to 4-bit, show a speed increase from 15.3 to 23.3 tokens/s, corresponding to an improvement of approximately 50%. The acceptance rate is around 80.6%.
This improvement is particularly relevant for those running inference of large language models (LLM) locally, as it allows making the most of available hardware resources.
Implementation Details
The PR introducing this feature is available on GitHub at https://github.com/ml-explore/mlx-lm/pull/990.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!