KernelAgent: Hardware-Guided GPU Kernel Optimization via Multi-Agent Orchestration

KernelAgent: Hardware-Guided GPU Kernel Optimization

The PyTorch team has introduced KernelAgent, an open-source agent-based system for optimizing GPU kernels. This system integrates hardware performance signals into a closed-loop multi-agent workflow to guide the optimization of Triton kernels.

KernelAgent demonstrated a 2.02x speedup over kernels generated from earlier versions and an average 1.56x speedup compared to torch.compile. Specifically, it outperformed torch.compile in 65% of KernelBench's L1 tasks and achieved 89% of the hardware roofline efficiency on NVIDIA H100.

Optimizing GPU kernels is critical for modern AI workloads, where performance is often limited by kernel efficiency. KernelAgent automates the traditionally skill-intensive optimization process by profiling kernels, diagnosing bottlenecks, and proposing optimizations.

Optimization Workflow

KernelAgent automates the workflow followed by experienced engineers, breaking it down into a set of cooperating agents. Each agent is responsible for a well-defined stage of the optimization loop, forming a hardware-guided feedback system.

The workflow includes the following stages:

Profile → Diagnose → Prescribe → Orchestrate → Explore → Measure

Each stage produces structured outputs that directly feed into the next, enabling fast, data-driven iteration.

How Data Flows Through the System

Profiling: The Profiling Agent uses NVIDIA Nsight Compute (NCU) to capture hardware-level performance metrics.
Diagnosis: The Diagnose Agent interprets profiling metrics to classify the kernel's dominant performance bottleneck.
Prescribing Fixes: The Analyzer Agent generates concrete optimization recommendations, taking into account GPU specifications.
Orchestration: The Orchestrator Agent synthesizes current diagnostics with historical optimization data to formulate a search strategy.
Exploration: The Optimization Manager executes the exploration phase, maintaining top-performing kernels and exploring different fixes in parallel.
Measure: The Benchmarking Agent validates correctness and measures real performance for each kernel variant.

Conclusion

KernelAgent demonstrates that the deep agent principles naturally extend to performance optimization. By adding hardware profiling and working memory to the loop, allowing multi-agents to develop and explore different optimization paths, we can push verified kernels from "correct" to "correct and fast."

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.