PyTorch 2.11 is now available with a series of updates aimed at improving the performance and usability of the framework, especially in distributed training and inference scenarios on various hardware platforms.
Main Features
- Differentiable Collectives for Distributed Training: Introduced differentiability support for collective communications, allowing backpropagation through collective operations. This simplifies the implementation of advanced distributed training techniques.
- FlexAttention with FlashAttention-4: The FlashAttention-4 backend for FlexAttention, now available on NVIDIA Hopper and Blackwell GPUs, promises speed increases from 1.2x to 3.2x compared to the existing Triton implementation for compute-bound workloads. This feature is still under development.
- MPS Expansion (Apple Silicio): Expanded support for Apple Silicio devices, with new distribution functions and the migration of existing operators.
- GPU Export Support for RNN/LSTM: RNN modules (LSTM, GRU, etc.) can now be exported on GPUs, with support for tracing LSTMs with dynamic shapes. This expands the types of models that can be deployed using
torch.exportfor production inference. - XPUGraph for Intel GPUs: Introduced XPUGraph support to optimize execution on Intel GPUs, reducing CPU overhead.
Other News
- Support for device-side assertions on ROCm (AMD) and optimizations for the TopK operator.
- Added support for FP16 half-precision GEMM via OpenBLAS on CPU, useful for inference scenarios on edge devices.
- CUDA 13 is now the default version.
- Torchscript has been deprecated.
For those evaluating on-premise deployments, there are trade-offs to consider carefully. AI-RADAR offers analytical frameworks on /llm-onpremise to support these evaluations.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!