AI-RADAR.IT · AI-RADAR.NET · AI-RADAR.TECH

News & analysis on local LLMs, stack & on-prem hardware.

📁 Frameworks AI generated

PyTorch 2.11: extended support for GPUs and distributed training

Published on 2026-03-23 20:17 ✅ PyTorch Blog 📰 Read the original source article →

🏷️ Hardware 🏷️ LLM On-Premise 🏷️ Fine-Tuning 🏷️ DevOps

PyTorch 2.11: supporto esteso per GPU e training distribuito

PyTorch 2.11 is now available with a series of updates aimed at improving the performance and usability of the framework, especially in distributed training and inference scenarios on various hardware platforms.

Main Features

Differentiable Collectives for Distributed Training: Introduced differentiability support for collective communications, allowing backpropagation through collective operations. This simplifies the implementation of advanced distributed training techniques.
FlexAttention with FlashAttention-4: The FlashAttention-4 backend for FlexAttention, now available on NVIDIA Hopper and Blackwell GPUs, promises speed increases from 1.2x to 3.2x compared to the existing Triton implementation for compute-bound workloads. This feature is still under development.
MPS Expansion (Apple Silicio): Expanded support for Apple Silicio devices, with new distribution functions and the migration of existing operators.
GPU Export Support for RNN/LSTM: RNN modules (LSTM, GRU, etc.) can now be exported on GPUs, with support for tracing LSTMs with dynamic shapes. This expands the types of models that can be deployed using torch.export for production inference.
XPUGraph for Intel GPUs: Introduced XPUGraph support to optimize execution on Intel GPUs, reducing CPU overhead.

Other News

Support for device-side assertions on ROCm (AMD) and optimizations for the TopK operator.
Added support for FP16 half-precision GEMM via OpenBLAS on CPU, useful for inference scenarios on edge devices.
CUDA 13 is now the default version.
Torchscript has been deprecated.

For those evaluating on-premise deployments, there are trade-offs to consider carefully. AI-RADAR offers analytical frameworks on /llm-onpremise to support these evaluations.

AI-Radar Takeaway

The new PyTorch 2.11 release introduces significant improvements for distributed training, with a focus on NVIDIA Hopper and Blackwell GPUs thanks to the FlashAttention-4 backend for FlexAttention. Expanded support for Apple Silicio devices (MPS) and optimizations for Intel GPUs (XPUGraph) and AMD (ROCm). Torchscript has been deprecated in favor of torch.export and Executorch. The release cadence has been increased to one every two months starting in 2026.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

Vast.ai GPU Marketplace

Decentralized GPU marketplace with ultra-competitive pricing. Rent from a global network of providers. Perfect for experimentation, development, and cost-optimized workloads.

✓ Lowest prices ✓ Global network ✓ Flexible options

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

AI-RADAR NEWSLETTER

Stay ahead — get AI signals in your inbox

Daily or weekly digest of the most important AI news. 160+ readers, no spam.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

Explore LLM On-Premise

Complete guide to running AI models locally: hardware, stack, and privacy.

PyTorch 2.12: Performance and Multi-Hardware Deployment Optimizations

Frameworks May 13

PyTorch 2.12: Performance and Multi-Hardware Deployment Optimizations

PyTorch 2.12 introduces significant optimizations for inference and training across various hardware architectures. Key new features include performance improve

PyTorch 2.10 Released With More Improvements For AMD ROCm & Intel GPUs

Frameworks Jan 21

PyTorch 2.10 Released With More Improvements For AMD ROCm & Intel GPUs

PyTorch 2.10 is out today as the latest feature update to this widely-used deep learning library. The new PyTorch release continues improving support for Intel

AI Boom Drives Fiber Optic Demand: Nvidia and Corning Boost Output

AI Boom Drives Fiber Optic Demand: Nvidia and Corning Boost Output

The artificial intelligence boom is straining the optical component supply chain. To meet the increasing demand for high-speed connectivity, crucial for AI work

PyTorch Compile and Kernel Fusion: Optimizing GPU Efficiency for LLMs

Frameworks May 27

PyTorch Compile and Kernel Fusion: Optimizing GPU Efficiency for LLMs

PyTorch's compiler, `torch.compile`, can accelerate model execution by up to ten times. The key to this optimization is "kernel fusion," a technique that groups

DeepSpeed: Enhancing Multimodal Training and Memory Efficiency

Frameworks Feb 25

DeepSpeed: Enhancing Multimodal Training and Memory Efficiency

DeepSpeed introduces a PyTorch-identical backward API to simplify the training of complex multimodal models, enabling advanced parallelism schemes. A new option

More in Frameworks

GNOME’s AI Assistant Now Generates Images: Newelle 1.4.5 Arrives

Llama.cpp cuts CUDA synchronizations, boosting on-premise inference performance

DeepSeek V4 Flash and MiniMax M3 on llama.cpp: When will native support arrive?

llama.cpp: Vulkan Tensor Parallelism Now Within Reach

A software veteran builds a local LLM harness and asks the community: what do you need?

Patronus AI secures $50M to crash-test AI agents

→ View all in Frameworks →

AI-Radar LLM On-Premise

Complete guide to running AI models locally: hardware, stack, privacy, and reference architectures.

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Register free → Already a member? Log in