vLLM-MLX: Superior Performance on Apple Silicio

A new study compared the performance of vLLM-MLX on Apple Silicio architecture with that of llama.cpp, finding a throughput increase ranging from 21% to 87% in favor of the former.

These results suggest that using Apple Silicio chips, in combination with vLLM-MLX, could represent an efficient solution for running large language models (LLMs) directly on local devices. For those evaluating on-premise deployments, there are trade-offs that AI-RADAR analyzes in detail in the /llm-onpremise section.

Further details are available in the research paper published on arXiv.