LLMs and Code: Why Can't ROCm and Intel Close the Gap with CUDA?

The community experimenting with language models on local hardware has posed a question as simple as it is uncomfortable: if LLMs are so good at writing code, why can’t entire software stacks like AMD’s ROCm or Intel’s solutions close the gap with CUDA? The answer, beyond the headlines, reveals the dynamics of a market where artificial intelligence struggles to accelerate its own hardware independence.

The real moat: beyond code generation

LLM-assisted programming is a powerful accelerator for micro-tasks, refactoring, and boilerplate, but an accelerated computing stack is not merely a collection of well-written functions. CUDA carries decades of optimizations that start at the silicon level, pass through libraries like cuBLAS and cuDNN, profiling tools, and culminate in a third-party ecosystem that has shaped the entire deep learning industry. Generating portions of kernels or library wrappers is useful, but recreating the deep integration between drivers, compilers, and hardware demands a systems-level effort that transcends the current capabilities of language models.

AMD has made significant strides with ROCm, especially on server-grade GPUs like the Instinct line, but consumer support remains fragmented and software compatibility is often chased through wrappers that emulate the CUDA API, adding latency and complexity. Intel has embarked on the oneAPI path, an open framework aimed at unification, yet the maturity of optimizations for LLM workloads is still experimental. In both cases, what’s missing is the “it just works” quality that NVIDIA sells at a premium.

The weight of legacy in the silicon

Another often overlooked factor is the hardware itself. NVIDIA GPUs were designed with general-purpose compute features that co-evolved with CUDA; architectural choices became interdependent with software abstractions. Porting the same optimizations to another architecture, such as AMD’s CDNA or Intel’s Xe, requires a radical rethinking of kernels — something that cannot be improvised with automatic code generation. Even with specific fine-tuning or quantization pipelines, the performance gap in inference on non-NVIDIA hardware is not merely a matter of lines of code, but of joint software-hardware design.

Implications for the on-premise world

Anyone evaluating self-hosted or on-premise deployments collides with this imbalance every time they look at the Total Cost of Ownership. The premium paid for cards like the A100 or H100 stems not only from silicon scarcity, but from the certainty that training and inference pipelines will work from day one, with a mature community, guides, and tooling. In contexts where data sovereignty demands local and air-gapped nodes, the temptation to reduce dependence on a single vendor exists, but it remains blocked by the cost of migration and debugging. Platforms like vLLM or Ollama simplify inference, but most of the value still lies at the lower layers, where the CUDA ecosystem rules.

An open but slow game

The arrival of solutions like Triton, which aims to separate high-level logic from hardware-specific kernels, and the growing investment of hyperscalers in custom silicon, are gradually shifting the terms of competition. However, the speed at which LLMs advance does not automatically translate into instant maturity for alternative stacks. The community that currently relies on NVIDIA for its AI adventures dreams of more affordable prices, but the road to genuine competition passes through years of systems development, not through clever prompts. For those who govern on-premise architectures, the era of NVIDIA lock-in is still far from over, and that must be factored into any long-term TCO calculation.