Accelerating Autotuning in Helion with Bayesian Optimization

Helion, a high-level DSL (Domain Specific Language) designed to simplify the creation of high-performance machine learning kernels, has introduced a new approach to accelerate the autotuning process. Autotuning, essential for optimizing kernel performance on specific hardware, can be time-consuming, representing a bottleneck in development.

Bayesian Optimization for Autotuning

The new algorithm, called LFBO (Likelihood-Free Bayesian Optimization) Pattern Search, is based on Bayesian optimization, a machine learning technique that uses probabilistic models to intelligently select points to evaluate. Instead of exhaustively exploring all possible configurations, LFBO trains a classification model (RandomForest) on latency data collected during the search. This model predicts whether a configuration falls within the top 10% in terms of latency, allowing it to filter out less promising candidates.

Results and Benefits

The implementation of LFBO has led to significant improvements:

36.5% reduction in autotuning time and a 2.6% improvement in kernel latency on average on NVIDIA B200 GPUs.
25.9% reduction in autotuning time and a 1.7% improvement in kernel latency on AMD MI350 GPUs.

In some cases, such as for layer-norm kernels on B200, the reduction in execution time reached 50%, while for Helion FlashAttention kernels, an improvement in latency of over 15% was observed.

Challenges of Kernel Autotuning

Kernel autotuning is a complex process due to several factors:

High-Dimensional Search Space: The number of possible combinations of parameters (block sizes, unroll factors, etc.) is enormous.
Long Compile Times: Some configurations can require significant compilation times.
Configuration Errors and Timeouts: The search space may include configurations that generate compilation errors or inaccurate results.

LFBO Pattern Search addresses these challenges by exploring the search space more broadly and focusing on the most promising configurations.

Conclusions

The use of machine learning, particularly Bayesian optimization, proves effective in accelerating autotuning in Helion, improving the kernel development experience. The LFBO approach saves time and discovers faster configurations, paving the way for further improvements through reinforcement learning techniques and LLMs.