Miami-based startup Subquadratic has made a claim that sounds like a breakthrough: it has solved a mathematical problem that has been the main performance bottleneck for Large Language Models (LLMs) for nearly a decade. The story recalls overblown promises of the past, but this time the company brings something concrete: independent tests that, at least in part, confirm the validity of its approach. The news, reported by The Next Web, signals a possible turning point in the evolution of transformers.

The quadratic barrier

To grasp the significance of the claim, we need to look at how attention works – the mechanism that allows models to weigh relationships between tokens in a sequence. In traditional transformers, this process has O(n²) complexity: as the context length grows, computational cost and memory requirements increase quadratically. In practice, doubling the context window quadruples the load on GPU and VRAM, limiting the size of models that can be served for inference without resorting to expensive cloud clusters.

Subquadratic promises to circumvent this bottleneck with an algorithm that reduces complexity to sub-quadratic levels – hence the name – without sacrificing prediction quality. Technical details are still scarce, but the existence of independent benchmarks that support the claims shifts the story from speculation to verifiable engineering.

Why it matters for on-premise deployments

The stakes are high for organizations evaluating on-premise deployment of LLMs for reasons of data sovereignty, cost control, or latency. Reducing quadratic complexity means being able to handle longer contexts with the same hardware setup, or achieve higher throughput for a given GPU footprint. For workloads that currently require cards with tens of gigabytes of VRAM, an efficiency gain could make more modest accelerators viable, lowering Total Cost of Ownership and overall energy consumption.

It’s not just about raw power. Cutting the computational footprint also favors scenarios where inference runs on edge devices or in environments with thermal constraints. Of course, every new technique introduces trade-offs: sub-quadratic attention variants often need to be validated on multiple architectures and datasets, and moving from research to production deployment requires mature frameworks and community support.

Caution and outlook

The comparison with Theranos, noted by the publication itself, calls for prudence. AI history is littered with bold announcements followed by disappointments. Subquadratic will have to demonstrate not only that its algorithm works at scale, but also that it can be integrated without upending existing training and serving pipelines. For now, the startup has produced “receipts” that, for the first time, give numerical substance to the proposal. If confirmations come from multiple independent labs, we will be looking at a decisive piece of the puzzle to make self-hosted AI more accessible and sustainable.