GPU Acceleration for PyTorch on Apple Silicon
Apple Silicon has solidified its position as a popular platform for running Large Language Models (LLMs) locally. Until recently, ExecuTorch users on macOS were often limited to CPU-based backends, such as XNNPACK, or the AOTI Metal backend, which did not always guarantee optimal performance for intensive workloads.
In this context, the PyTorch team has released the MLX delegate, an innovation that introduces fully optimized, GPU-accelerated Inference on Macs equipped with Apple Silicon. This innovation leverages Apple's MLX framework, specifically designed for machine learning on Cupertino's chips, opening new possibilities for developers and system architects seeking on-premise or edge Deployment solutions for their AI models.
The MLX Delegate: Architecture and Workflow Integration
The MLX delegate functions as a new backend for ExecuTorch, tasked with compiling and running PyTorch models directly on Apple Silicon GPUs. The process is designed to be transparent to the user: after exporting the model through the standard ExecuTorch pipeline, the delegate handles graph partitioning, serialization into an optimized format, and dispatching operations to MLX's Metal GPU kernels at runtime.
From a workflow perspective, the approach remains consistent with other ExecuTorch backends. Developers export the model with torch.export, lower it with to_edge_transform_and_lower using MLXPartitioner, and then run the resulting .pte file with the ExecuTorch runtime. The delegate currently supports approximately 90 ATen operations, covering the full range required for transformer Inference, including quantized matmul, multi-head attention, rotary position embeddings, mixture-of-experts routing, and recurrent state-space operations.
A crucial aspect for efficiency and flexibility is the support for various Quantization options and data types. The delegate handles BF16, FP16, and FP32 for weights and activations, in addition to 2, 4, and 8-bit affine Quantization via TorchAO's quantize_ API. The latter adopts the same Quantization scheme as the XNNPACK and Vulkan backends, allowing a single quantized model definition to be compatible with multiple backends. NVFP4 Quantization, which uses NVIDIA's FP4 data type, and tied quantized embeddings for models that share weights between the embedding layer and the language model head, are also supported.
Key Advantages for Local Deployment
The introduction of the MLX delegate brings significant advantages, particularly for those evaluating on-premise or edge AI solutions. The first and most evident is performance improvement: the MLX delegate achieves 3 to 6 times higher throughput in generative AI workloads compared to existing ExecuTorch delegates on macOS. This increase is crucial for applications requiring rapid responses, such as LLM-based chats or real-time audio transcription.
Another strength is the native integration with the PyTorch 2 export stack. The delegate connects directly to torch.export for graph capture and TorchAO for Quantization, the same tools used by all other ExecuTorch backends. This means that new models or Quantization techniques landing in PyTorch become immediately available to the MLX delegate without requiring additional work, ensuring greater agility in development and Deployment. Finally, ExecuTorch provides a single runtime API that works across all backends. An application developed with the ExecuTorch C++ or Python runtime can run models exported for MLX, XNNPACK, CoreML, Vulkan, or CUDA without changing application code. This portability is an enabler for hybrid architectures and for managing TCO, reducing complexity and maintenance costs in heterogeneous environments.
Supported Models and Future Outlook
The MLX delegate has demonstrated its effectiveness across a wide range of model architectures. For Large Language Models, it supports dense transformers such as Llama 3.2 1B, Qwen 3 (0.6B, 1.7B, 4B), Phi-4 mini (3.8B), and Gemma 3 (1B, 4B), including those with sliding window attention. It is also compatible with Sparse Mixture-of-Experts models, such as Qwen 3.5 35B-A3B, thanks to custom gather operations that efficiently route tokens to the correct experts on the GPU.
In the field of speech-to-text, the delegate enables models for offline transcription, including OpenAI Whisper (from tiny to large-v3-turbo), NVIDIA Parakeet TDT (0.6B) with word-level timestamps, and Mistral Voxtral (3B). For real-time use cases, it supports Mistral Voxtral Realtime (4B), handling live microphone input, ring buffer KV caches, and sliding window attention. Beyond these flagship models, over 30 additional models have been validated through the backend's test suites, covering dense transformers, encoder-decoder architectures, and vision models. It is important to note that the MLX delegate is currently experimental and under active development, which implies that APIs and supported features may evolve. For those evaluating on-premise AI solutions, the emergence of tools like the MLX delegate underscores the importance of analyzing the trade-offs between performance, flexibility, and data control, aspects that AI-RADAR explores in detail in its analytical frameworks on /llm-onpremise.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!