AI-RADAR.IT · AI-RADAR.NET · AI-RADAR.TECH

News & analysis on local LLMs, stack & on-prem hardware.

📁 Frameworks AI generated

DeepSpeed: Enhancing Multimodal Training and Memory Efficiency

Published on 2026-02-25 00:49 ✅ PyTorch Blog 📰 Read the original source article →

🏷️ Hardware 🏷️ LLM On-Premise 🏷️ Fine-Tuning 🏷️ DevOps

DeepSpeed: training multimodale e ottimizzazione della memoria

DeepSpeed, Microsoft's deep learning library, introduces two significant updates focused on optimizing the training of multimodal models and reducing memory consumption.

PyTorch-Identical Backward API

The first update concerns a backward API identical to that of PyTorch. This new API simplifies the writing of training loops for multimodal models, which often combine vision encoders and LLMs, enabling sophisticated parallelism schemes with cleaner code. DeepSpeed transparently manages performance optimizations. One example is the enablement of disaggregated hybrid parallelism, which led to a 30% improvement in training speed for multimodal AI models.

DeepSpeed's original backward API required the use of model_engine.backward(loss) instead of PyTorch's usual loss.backward(), limiting flexibility. The new API allows combining multiple models, defining separate loss functions, and handling non-scalar tensors with custom gradients, while maintaining DeepSpeed optimizations such as ZeRO and offloading.

Low-Precision Training to Reduce Memory Usage

The second update introduces an option to keep all model states (parameters, gradients, and optimizer states) in low precision, such as BF16 or FP16. This drastically reduces the memory footprint, allowing researchers to fine-tune larger models on hardware with limited resources. Integration with torch.autocast ensures numerical stability during training.

A test showed a 40% reduction in peak memory while maintaining numerical stability. The results show that low-precision training with BF16 maintains comparable convergence to training with FP32, but with significant memory savings.

AI-Radar Takeaway

DeepSpeed introduces a PyTorch-identical backward API to simplify the training of complex multimodal models, enabling advanced parallelism schemes. A new option to keep all model states in lower precision (BF16/FP16) drastically reduces memory usage, allowing fine-tuning of larger models on constrained hardware, with peak memory reduction up to 40%.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

RunPod GPU Cloud Platform

Flexible GPU cloud with pay-per-second billing. Deploy instantly with Docker support, auto-scaling, and a wide selection of GPU types from RTX 4090 to H100.

✓ No commitments ✓ Instant deployment ✓ Production-ready

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

AI-RADAR NEWSLETTER

Stay ahead — get AI signals in your inbox

Daily or weekly digest of the most important AI news. 160+ readers, no spam.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

Explore LLM On-Premise

Complete guide to running AI models locally: hardware, stack, and privacy.

LLM Inference: DeepSpeed Optimization and Performance

Frameworks Feb 06

LLM Inference: DeepSpeed Optimization and Performance

A user shares an image related to optimizing the inference of large language models (LLM) using DeepSpeed. The image suggests an analysis of performance and con

STAM: A New Optimization Algorithm Reduces AI Training Costs

STAM: A New Optimization Algorithm Reduces AI Training Costs

A researcher has published "Stable Training with Adaptive Momentum (STAM)," an optimization algorithm for deep learning. The method outperformed several popular

Gait2Hip-60: Deep Learning for Predicting Hip Dynamics from Gait Kinematics

Frameworks Jun 01

Gait2Hip-60: Deep Learning for Predicting Hip Dynamics from Gait Kinematics

A study developed a deep learning framework to predict hip muscle forces and joint moments directly from gait kinematics. Comparing models like LSTM, Transforme

SOTA Normalization Performance with torch.compile on H100 and B200

Frameworks Apr 08

SOTA Normalization Performance with torch.compile on H100 and B200

This analysis details how torch.compile achieved state-of-the-art performance for normalization operations (LayerNorm and RMSNorm) on NVIDIA H100 and B200 GPUs.

Physics-informed Neural Surrogates for Domain Growth Meet On-Premise Deployment

Frameworks Jun 26

Physics-informed Neural Surrogates for Domain Growth Meet On-Premise Deployment

A research team has developed a physics-guided convolutional neural network that predicts the time-evolution of phase separation in binary mixtures. The surroga

More in Frameworks

GNOME’s AI Assistant Now Generates Images: Newelle 1.4.5 Arrives

Llama.cpp cuts CUDA synchronizations, boosting on-premise inference performance

DeepSeek V4 Flash and MiniMax M3 on llama.cpp: When will native support arrive?

llama.cpp: Vulkan Tensor Parallelism Now Within Reach

A software veteran builds a local LLM harness and asks the community: what do you need?

Patronus AI secures $50M to crash-test AI agents

→ View all in Frameworks →

AI-Radar LLM On-Premise

Complete guide to running AI models locally: hardware, stack, privacy, and reference architectures.

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Register free → Already a member? Log in