AI-RADAR.IT · AI-RADAR.NET · AI-RADAR.TECH

News & analysis on local LLMs, stack & on-prem hardware.

📁 Hardware AI generated

M5 Max vs M3 Max Inference Benchmarks: Qwen3.5 on MacBook Pro

Published on 2026-03-28 03:27 ℹ️ LocalLLaMA 📰 Read the original source article →

🏷️ Hardware 🏷️ LLM On-Premise 🏷️ DevOps

Benchmark M5 Max vs M3 Max: Inference Qwen3.5 su MacBook Pro

Inference Benchmarks: M5 Max vs M3 Max

A recent benchmark compared the inference performance of Apple's M5 Max and M3 Max chips, using Qwen 3.5 models on 16-inch MacBook Pro. Both systems were equipped with 40 GPU cores and 128GB of unified memory. The tests were conducted with oMLX v0.2.23, evaluating models of different sizes (122B-A10B MoE, 35B-A3B MoE, 27B dense).

Key Findings

The M5 Max chip demonstrated significantly higher throughput. For example, with the 35B-A3B model, it reached 134.5 tokens/s compared to the M3 Max's 80.3 tokens/s.
The gap between the two chips widens with longer contexts. At 65K, the 27B dense model drops to 6.8 tokens/s on the M3 Max, compared to the M5 Max's 19.6 tokens/s.
Batching is crucial for agentic workloads. The M5 Max scales up to 2.54x throughput with a batch size of 4 on the 35B-A3B model, while the M3 Max shows performance degradation with dense models.
The efficiency of MoE models is remarkable. The 122B model (10B active) generates faster than the 27B dense model on both machines.

The memory bandwidth (614 GB/s vs 400 GB/s) appears to be a limiting factor, especially for multi-step agent loops or parallel tool calls.

AI-Radar Takeaway

Inference performance comparison of Qwen 3.5 models on 16-inch MacBook Pro, equipped with M5 Max and M3 Max chips (40 GPU cores, 128GB unified memory). Tests, performed with oMLX v0.2.23, reveal significant differences in throughput and scalability, especially with larger contexts and Mixture of Experts (MoE) models. The M5 Max shows superior advantages in batching scenarios and with extended contexts.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

Vast.ai GPU Marketplace

Decentralized GPU marketplace with ultra-competitive pricing. Rent from a global network of providers. Perfect for experimentation, development, and cost-optimized workloads.

✓ Lowest prices ✓ Global network ✓ Flexible options

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

AI-RADAR NEWSLETTER

Stay ahead — get AI signals in your inbox

Daily or weekly digest of the most important AI news. 160+ readers, no spam.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

Explore LLM On-Premise

Complete guide to running AI models locally: hardware, stack, and privacy.

MiniMax M3 on-premise: 19 tokens/s on 8 MI50s, but agents remain out of reach

Hardware Jun 21

MiniMax M3 on-premise: 19 tokens/s on 8 MI50s, but agents remain out of reach

A test with 2018-era AMD MI50 GPUs and a 4-bit quantized MiniMax M3 model achieves 19 tokens/s on 8 cards and 18 tokens/s on 16, aided by speculative decoding.

LLMs on Apple Silicio: A Benchmark of 37 Models on MacBook Air M5 32GB

LLMs on Apple Silicio: A Benchmark of 37 Models on MacBook Air M5 32GB

A comprehensive analysis evaluated the performance of 37 Large Language Models on a MacBook Air M5 with 32GB of RAM, using Q4_K_M Quantization. The results high

Llama.cpp: 50% faster token generation on M3 Max by cutting a useless softmax

Frameworks Jun 22

Llama.cpp: 50% faster token generation on M3 Max by cutting a useless softmax

A llama.cpp pull request speeds up inference on M3 Max MacBook Pro by 50% for Gemma 4 by skipping an expensive softmax and sort when the Top-N-Sigma sampler is

Apple's 18-core M5 Max destroys 96-core Ryzen Threadripper Pro in Geekbench

Hardware Mar 07

Apple's 18-core M5 Max destroys 96-core Ryzen Threadripper Pro in Geekbench

Apple's M5 Max chip outperforms AMD's Ryzen Threadripper Pro 9995WX in Geekbench benchmarks. Despite the core count difference, Apple's CPU achieves better resu

LLM Inference: 8 AMD MI50 GPUs for Performance and Affordability

Hardware Jan 21

LLM Inference: 8 AMD MI50 GPUs for Performance and Affordability

A setup with eight 32GB AMD MI50 GPUs delivers notable performance in large language model (LLM) inference. It achieves 26 tokens per second with MiniMax-M2.1,

More in Hardware

Intel's Nova Lake: 52 cores and up to 474W for the next-gen desktop

96GB 4090 and 5090 GPUs: Scam Alert from a US Lab

A 96GB VRAM RTX 5090 from Shenzhen's Huaqiangbei Market for $8,200

AI chip demand squeezes global freight, putting on-premise plans at risk

JCET's US$1.1bn expansion shows where China's AI chip crunch is moving

Liquid Cooling Comes to 800V DC Busbars for AI Data Centers

→ View all in Hardware →

AI-Radar LLM On-Premise

Complete guide to running AI models locally: hardware, stack, privacy, and reference architectures.

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Register free → Already a member? Log in