Personal Computing Meets Large Language Models: An In-Depth Benchmark

The interest in running Large Language Models (LLMs) directly on local hardware, rather than relying on cloud services, is steadily growing. This trend is driven by the pursuit of greater data control, the need to operate in air-gapped environments, and, not least, the potential optimization of Total Cost of Ownership (TCO). In this context, a recent study explored the capabilities of a MacBook Air M5, configured with 32GB of RAM and a 10-core GPU/CPU, in handling a wide range of LLMs.

The analysis benchmarked 37 different models, belonging to 10 distinct families, using the llama-bench tool with Q4_K_M Quantization. The primary goal of this initiative extends beyond simply measuring performance on a single device: it aims to build a community benchmark database covering the entire range of Apple Silicio chips, from M1 to M5, including base, Pro, Max, and Ultra variants. Such an archive of empirical data would prove invaluable for anyone intending to evaluate LLM performance on their Apple hardware.

Key Findings: The MoE Model Advantage and the 32GB "Wall"

The benchmark results offer significant insights for those involved in deploying LLMs in self-hosted environments. The main metric considered is token generation speed (tg128 in tok/s), alongside processing speed (pp256 in tok/s) and RAM usage. Among the models tested, the Qwen 3.5 35B-A3B MoE stood out as a true "game-changer" for local Inference. This MoE model achieved a speed of 31 tokens per second, a remarkable value when compared to the approximately 2.5 tokens per second recorded by dense 32B models, with similar memory consumption. This translates into an acceleration of about 12 times, offering a level of intelligence comparable to a 35B model at the speed of a 3B model.

The analysis also highlighted a critical "wall" for systems with 32GB of RAM. All dense 32B models settled around 2.5 tokens per second, occupying approximately 18.6 GB of RAM. While this performance is acceptable for batch workloads or offline use, it is not ideal for real-time chat interactions. The MoE architecture thus emerges as an effective solution to overcome these limitations, enabling superior performance without requiring a disproportionate increase in available memory. "Sweet spots" were also identified for various applications: the Qwen 3.5 35B-A3B MoE as the best overall choice, the Qwen 2.5 Coder 7B or 14B for coding tasks, and the DeepSeek R1 Distill 7B or 32B for reasoning.

Context and Implications for On-Premise Deployment

These results have direct implications for CTOs, DevOps leads, and infrastructure architects evaluating deployment strategies for AI workloads. The ability to efficiently run complex LLMs on consumer hardware, such as a MacBook Air, demonstrates the potential of on-premise Inference and edge computing. The choice between dense and MoE models, in particular, becomes a crucial trade-off between memory requirements, generation speed, and model complexity.

For organizations prioritizing data sovereignty, regulatory compliance (such as GDPR), or the need for air-gapped environments, the possibility of deploying LLMs locally is fundamental. Benchmarks like the one presented offer concrete data for making informed decisions on hardware and model selection, balancing performance and cost constraints. AI-RADAR, for example, offers analytical frameworks on /llm-onpremise to evaluate the trade-offs between self-hosted and cloud solutions, considering factors such as TCO and concrete hardware specifications like VRAM and throughput.

Future Outlook: A Community Benchmark Ecosystem

The project behind this benchmark, mac-llm-bench, is entirely Open Source and aims to expand its coverage. Developers are actively seeking contributions from owners of other Apple Silicio chips, including M4 Pro, M4 Max, M3 Max, M2 Ultra, and M1. The expansion of this community database is essential to provide a comprehensive overview of LLM performance across different Apple hardware configurations.

A robust and transparent benchmark ecosystem is crucial for the evolution of LLM deployment. It allows developers and businesses to optimize their pipelines, choose the most suitable models for their needs, and maximize resource efficiency. The availability of standardized and reproducible comparative data, free from custom prompts or subjectivity, represents a significant step forward towards greater clarity and predictability in implementing local AI solutions.