TokenSpeed-Kernel: Portable APIs and High-Performance Kernels Bring Multi-Silicon LLM Inference

LLM inference stacks have long relied on an unspoken pact: each serving engine accumulates fast paths for silicon, quantization, and model architecture until the runtime turns into a maze of special cases. TokenSpeed-kernel upends that logic. Instead of hiding complexity, it seals it inside a subsystem with a stable public API and a runtime selection mechanism that picks the right kernel on the fly.

The missing layer between runtime and silicon

The design bet is straightforward: the runtime should know nothing about the backend. When a model needs an attention or Mixture-of-Experts operation, it calls functions like mha_prefill, mha_decode_with_kvcache, or moe_apply. A selector then checks the platform’s capabilities and the traits declared by every registered kernel, discards incompatible ones, and ranks the candidates by priority. The result is a stable callable for the model-format-hardware combination. Under the hood, specialized implementations (Gluon for AMD, CuteDSL or TensorRT-LLM for NVIDIA) stay completely isolated, each with its own architecture and tensor shape constraints.

The difference from monolithic approaches is profound. Adding support for a new silicon target doesn’t require threading conditionals into model code. Writing a faster kernel for a specific attention shape won’t force you to touch the serving stack. TokenSpeed-kernel even ships standalone benchmarking and numerics tools, so kernel development can happen outside the full server with tight iteration cycles.

AMD MI355X: the proving ground

The testbed is GPT-OSS 120B, a modern model mixing sliding-window attention, attention sinks, and MXFP4-weighted MoE with FP8 activations. On a single AMD MI355X GPU (CDNA4 architecture), Gluon attention kernels deliver prefill throughput 1.4–2.3× higher than the portable Triton baseline and 10–30% above the AITER solution with a CK backend. For MoE, where the decode bottleneck often lies in routing and kernel launch overhead, Gluon modules use distinct paths: a warp-decode for tiny batches that fuses routing and GEMM, and a grouped GEMM for medium batches with a reduced pipeline. At the smallest batch sizes the gap is stark (1.7–2.1× over Triton, 1.1–1.6× over AITER); at medium loads AITER edges ahead, but Gluon stays between 0.9× and 1.4× faster than the Triton baseline. End-to-end throughput improves by 1.6–3.6× over the original Triton path.

What it means for on-premise deployments

The open plugin architecture of TokenSpeed-kernel holds implications far beyond software engineering. For organizations weighing on-premise deployment—where hardware choice is driven by TCO, availability, and data sovereignty—having a single API layer that abstracts both AMD and NVIDIA reduces vendor lock-in. The CDNA4-optimized kernels are already released as a separate package (tokenspeed-kernel-amd), usable by engines like vLLM without swallowing the full TokenSpeed runtime. This means the community can bring hardware-specific acceleration to different serving stacks, lowering the barrier for teams that want to build or manage a mixed GPU fleet on-premise, in line with TCO evaluations where AMD often competes on price/performance for inference.

It’s not just flexibility. The clean runtime-kernel separation, paired with profiling and numerics tools, lets teams validate performance on real workload shapes before committing to a vendor or a GPU generation. In a landscape where new models and quantization formats emerge every few months, the ability to specialize kernels without rewriting the serving stack is an agility multiplier.

Kernels as products, not shortcuts

TokenSpeed-kernel normalizes a mindset shift: kernels are no longer internal artifacts to be guarded, but public, installable, and profilable components. The central registry, operator APIs, and selection mechanisms turn hardware optimization work into a disciplined, verifiable, and reusable process. In an ecosystem heading toward ever more heterogeneous inference—across silicon, format, and model architecture—a hard boundary between what the runtime asks and how the silicon answers is no longer a luxury: it’s the prerequisite for scaling LLM serving without accumulating technical debt.