AI Inference Redefines the Chip Market: New Opportunities for Startups

The artificial intelligence landscape is experiencing a significant turning point. The focus is increasingly shifting from the training phase of new models to their serving, or inference. This change in emphasis represents a crucial opportunity for AI chip startups eager to carve out a market share traditionally dominated by giants like Nvidia. Unlike training, inference presents much more heterogeneous workloads, requiring a variable mix of compute capacity, memory, and bandwidth. This diversity paves the way for specialized hardware solutions capable of addressing specific needs with greater efficiency.

Heterogeneous Architectures for Diverse Workloads

The increasing heterogeneity of inference has led to the development of disaggregated architectures, where different hardware components manage specific phases of the process. A striking example is the approach adopted by Nvidia with its acquihire of Groq. Groq's LPUs (Language Processing Units), characterized by an SRAM-heavy architecture, excel at token generation (the decode phase), outperforming GPUs in speed. However, their limited compute capacity and older technology compromised their scalability. Nvidia resolved this constraint by moving the more computationally intensive prefill phase to its own GPUs, while keeping the bandwidth-constrained decode operations on its new LPUs.

This combination is not unique to Nvidia. AWS also announced its own disaggregated compute platform that uses its custom Trainium accelerators for prefill and Cerebras Systems' wafer-scale accelerators for decode. Even Intel has explored this path, proposing a reference design that utilizes GPUs for prefill and SambaNova's new RDUs (Reconfigurable Dataflow Units) for decode. So far, most of the AI chip startups' successes have focused on the decode side, where the speed of SRAM, despite not being particularly capacious, proves to be a decisive advantage.

Innovation Beyond Silicio: Optical Accelerators

However, startups are not limited to optimizing silicio-based architectures. Lumai, for example, unveiled its optical inference accelerator, which uses light rather than electrons to perform the matrix multiplication operations fundamental to most machine learning workloads. This hybrid electro-optical architecture promises significantly lower power consumption compared to purely digital solutions. Lumai expects its next-gen Iris Tetra systems to achieve an exaOPS of AI performance within a 10kW power budget by 2029.

Initially, the company is positioning the chip as a standalone alternative to GPUs for compute-bound inference workloads, such as batch processing. Longer-term, Lumai also plans to use its optical accelerators as prefill processors. Although the architecture is still in its early stages of development, it is already capable of running billion-parameter models like Llama 3.1 8B or 70B. The UK-based startup has already opened its chips to neoclouds and hyperscalers for evaluation, indicating potential interest for large-scale deployments.

An Alternative Approach and Deployment Implications

Not all AI chip startups share the enthusiasm for disaggregated architectures. Tenstorrent, for example, unveiled its RISC-V-based Galaxy Blackhole compute platforms, and CEO Jim Keller expressed skepticism towards the disaggregated inference formula. "Every company in the industry is pairing up to build the accelerator accelerator accelerator. CPUs run code. GPUs accelerate CPUs. TPUs accelerate GPUs. LPUs accelerate TPUs. And so on. This leads to complex solutions which are unlikely to be compatible with changes in AI models and uses. At Tenstorrent, we thought something more general and simpler would work," he stated.

This perspective highlights a fundamental debate in the industry: the pursuit of maximum efficiency through extreme specialization versus the need for generality and simplicity to ensure compatibility and longevity. For CTOs, DevOps leads, and infrastructure architects evaluating deployment options, especially in self-hosted or on-premise contexts, it is crucial to consider these trade-offs. The choice between highly specialized architectures and more versatile solutions can impact not only the Total Cost of Ownership (TCO) and current performance but also the ability to adapt to future AI models and workloads, as well as data sovereignty and air-gapped environment requirements. AI-RADAR offers analytical frameworks on /llm-onpremise to delve deeper into the evaluation of these complex scenarios.