Baseten and the inference gold rush: a $1.5 billion round in the making

The news hits like a thunderclap in a Silicon Valley already buzzing with AI: Baseten, a California startup specializing in LLM inference services, is reportedly on the verge of closing a $1.5 billion funding round, catapulting its valuation to $13 billion. The deal, reported by sources close to the matter, comes just months after its last mega-round and proves that the “inference gold rush” is very much alive.

A heavyweight investment

Baseten is no stranger to those tracking AI infrastructure. The company offers a cloud platform that lets organizations run large models without managing underlying hardware. In an era of GPU shortages and exploding training costs, this business model is attracting massive capital. The new round, if confirmed, would elevate the company’s total funding to top-tier unicorn status, signaling investor confidence in surging demand for inference as a service.

But what’s behind these figures? The logic is straightforward: while model training remains the domain of a few large players, inference—the stage where models generate responses—is becoming the real bottleneck for mass adoption. Serving millions of requests per second with acceptable latency requires optimized infrastructure, and Baseten positions itself as an intermediate layer between models and enterprise applications.

The control dilemma: on-premise vs. cloud

For organizations evaluating LLM deployment, the news raises concrete questions. Relying on a service like Baseten means outsourcing inference infrastructure to a third party, with benefits in scalability and predictable OpEx. But for those with stringent data sovereignty requirements, GDPR compliance, or regulated industries, public cloud isn’t always viable.

This brings the classic trade-off into focus: on one hand, on-premise—or more broadly self-hosted—deployment ensures full control over models, data, and pipelines, avoiding contractual lock-in and unpredictable variable costs. On the other, it demands significant upfront capital expenditure (CapEx) on specialized hardware, in-house expertise for orchestration (from Kubernetes to serving frameworks like vLLM), and careful management of compute capacity, especially when working with long-context models or quantization (INT8, FP16) that push GPUs to their last gigabyte of VRAM.

The surge of inference services like Baseten is driving market maturity, which indirectly can benefit on-premise deployments too. Competition is lowering per-token costs and boosting runtime efficiency—advances that often can be adopted in-house. Companies choosing cloud today may tomorrow migrate to hybrid or fully self-hosted architectures if the numbers justify it.

Looking to 2025 and beyond

Beyond the record-breaking number, Baseten’s round is a thermometer for a sector that is quickly structuring itself. Inference as a utility, akin to electricity, is becoming a commodity, but with critical differences: models evolve, workloads are heterogeneous, and latency is a competitive differentiator. For IT decision-makers, the challenge is to build an architecture that can absorb innovation without becoming trapped by a single vendor.

AI-RADAR will continue tracking these developments, offering analytical frameworks on /llm-onpremise to navigate such choices. The real question isn’t whether inference as a service will grow, but how enterprises will balance flexibility and control in a landscape where capital moves faster than technology.