TorchInductor Integrates CuteDSL: Advanced GEMM Optimization for LLMs on NVIDIA GPUs

TorchInductor Welcomes CuteDSL to Optimize Critical LLM Operations

TorchInductor, the Just-In-Time (JIT) compiler integrated into PyTorch, recently announced the integration of CuteDSL as its fourth backend for General Matrix Multiplications (GEMMs). This strategic move complements existing backends like Triton, CUTLASS (C++), and cuBLAS, aiming to further enhance the performance of Large Language Model (LLM) workloads on NVIDIA hardware.

GEMM optimization is a crucial aspect for LLM efficiency, as these operations account for the majority of the compute profile during the forward pass in Transformer-based models. The introduction of CuteDSL addresses the growing need for more granular control over latest-generation hardware, while maintaining fast compilation times and a reduced maintenance burden for development teams. This combination of factors positions it as a long-term strategic investment for the PyTorch ecosystem.

The Strategy Behind GEMM Optimization and CuteDSL's Advantages

Not all operations benefit equally from a new backend. For memory-bound operations, such as elementwise math, activations, and reductions, Triton already generates high-quality code, thanks to its block-level programming model. GEMMs, however, present a different challenge. These operations require extremely precise control over the hardware features introduced by each new GPU generation, including tile sizes, explicit shared memory management, warp-level scheduling, and, on newer architectures like B200, thread block clusters and distributed shared memory.

CuteDSL addresses these complexities through a custom Python to MLIR compiler. Although built on the same abstractions as CUTLASS C++ – the same tile algebra, memory hierarchy primitives, and epilogue fusion model – CuteDSL compiles at speeds comparable to TorchInductor's other backends. This resolves the compilation overhead issue of the CUTLASS C++ backend, which requires full nvcc invocations for each kernel variant, making it impractical to evaluate many candidates during autotuning. Furthermore, NVIDIA actively contributes optimized kernel templates, giving CuteDSL an early advantage in adopting hardware-specific optimizations for the latest hardware.

Architecture and Performance Results on NVIDIA B200

The CuteDSL backend integrates into TorchInductor's autotuning pipeline in an additive manner. When the compiler encounters a matrix multiplication, it queries cutlass_api, an NVIDIA-maintained Python library containing the full space of CuteDSL GEMM kernel configurations. To manage the hundreds of compatible configurations, the backend uses nvMatmulHeuristics, an NVIDIA analytical performance model, to select the most promising candidates (typically five) for compilation and benchmarking on the target hardware. This approach ensures that enabling NVGEMM cannot cause performance regressions, as the system automatically selects the fastest backend.

Benchmarks, conducted on a single NVIDIA B200 GPU at 850W with PyTorch nightly and Cuda 13.1, showed significant improvements. At the kernel level, CuteDSL demonstrated speedups of up to 1.73x for BF16 operations in decode scenarios, up to 1.78x for MXFP8 on medium-sized shapes, and up to 1.6x for NVFP4 on smaller shapes. For end-to-end LLM inference with vLLM (on models like Llama 3.1 8B, Qwen3 32B, and Llama 3.3 70B), NVGEMM integration reduced latency by up to 6.5% for BF16 and up to 4.2% for NVFP4, with more consistent gains at batch sizes between 16 and 64. For organizations evaluating on-premise LLM deployments, low-level optimization like that offered by CuteDSL is crucial for maximizing hardware ROI and ensuring data sovereignty. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these trade-offs.

Future Prospects and Impact on On-Premise Deployments

The development roadmap for the CuteDSL backend is ambitious. Future priorities include benchmarking epilogue fusion, which will allow evaluating the profitability of fusing downstream operations directly into the GEMM kernel, and implementing asynchronous parallel precompilation with persistent caching. This latter feature will further reduce autotuning times by eliminating the need to recompile already optimized kernels.

Plans also include the development of exportable configuration caches for portability across environments and support for Ahead-Of-Time (AOT) compilation for inference deployments, eliminating autotuning overhead at runtime. In the long term, the goal is for CuteDSL to achieve full performance parity with the CUTLASS C++ backend on new hardware generations, enabling its replacement and simplifying the TorchInductor codebase. These developments are particularly relevant for infrastructure architects and DevOps leads managing AI/LLM workloads in self-hosted environments, where efficiency, control, and TCO are determining factors.