Faster Inference with MLX and Qwen-3.5

The mlx-lm framework is about to receive a significant update: the introduction of multi-token prediction (MTP) for the Qwen-3.5 series models. This feature allows generating multiple tokens per forward pass, significantly increasing throughput.

Performance Increase

Early tests, performed on an M4 Pro with a Qwen3.5-27B model quantized to 4-bit, show a speed increase from 15.3 to 23.3 tokens/s, corresponding to an improvement of approximately 50%. The acceptance rate is around 80.6%.

This improvement is particularly relevant for those running inference of large language models (LLM) locally, as it allows making the most of available hardware resources.

Implementation Details

The PR introducing this feature is available on GitHub at https://github.com/ml-explore/mlx-lm/pull/990.