Inference Benchmarks: M5 Max vs M3 Max
A recent benchmark compared the inference performance of Apple's M5 Max and M3 Max chips, using Qwen 3.5 models on 16-inch MacBook Pro. Both systems were equipped with 40 GPU cores and 128GB of unified memory. The tests were conducted with oMLX v0.2.23, evaluating models of different sizes (122B-A10B MoE, 35B-A3B MoE, 27B dense).
Key Findings
- The M5 Max chip demonstrated significantly higher throughput. For example, with the 35B-A3B model, it reached 134.5 tokens/s compared to the M3 Max's 80.3 tokens/s.
- The gap between the two chips widens with longer contexts. At 65K, the 27B dense model drops to 6.8 tokens/s on the M3 Max, compared to the M5 Max's 19.6 tokens/s.
- Batching is crucial for agentic workloads. The M5 Max scales up to 2.54x throughput with a batch size of 4 on the 35B-A3B model, while the M3 Max shows performance degradation with dense models.
- The efficiency of MoE models is remarkable. The 122B model (10B active) generates faster than the 27B dense model on both machines.
The memory bandwidth (614 GB/s vs 400 GB/s) appears to be a limiting factor, especially for multi-step agent loops or parallel tool calls.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!