The starting point

In many African cities, the shift to electric mobility isn’t blocked by a lack of motorcycles but by a charging infrastructure that can’t keep up with riders’ urgency. Spiro built a network of stations where a depleted battery is swapped out in minutes, eliminating hours of downtime. The company has now closed a $55 million round led by Chinese fund NewTrails and is nearing a $1 billion valuation. Behind the capital lies an operational logic that speaks directly to those running AI workloads on-premise.

Swapping as a paradigm

Spiro’s model is simple: you don’t charge the battery, you exchange it. The analogy with on-prem LLM workloads is immediate. In a local datacenter or on an edge server, compute resources – GPUs, TPUs, dedicated chips – can’t afford to sit idle for hours while a model loads or while sequential inference queues build up. The operator aims to minimize the time hardware stands still, exactly as a rider can’t afford to miss trips. The swapping parallel explains why architectures built on model pre-loading, aggressive quantization, and task distribution have become pillars of local AI.

Implications for on-prem deployment

Anyone managing LLMs in-house must contend with constrained hardware budgets and the need to keep energy consumption in check. The TCO of an on-prem setup isn’t measured only by GPU acquisition cost but by how long those GPUs remain genuinely productive. The «charging» phase – setup times, data transfers, model reloading – is a hidden cost that can devour returns. Techniques like model caching, multi-model serving with dynamic VRAM allocation, and lightweight containers bring the setup closer to the swapping paradigm: run a workload, then rapidly swap in another, keeping hardware almost always busy. This philosophy reduces idle time and improves the economic sustainability of a self-hosted environment.

The sovereignty and control factor

Spiro operates where the power grid is unreliable and dependence on outside providers can be a risk. A similar concern drives organizations that handle sensitive data and choose not to rely on external clouds: data sovereignty demands direct infrastructure control, but it also requires the ability to handle demand spikes without downtime. In an on-prem scenario, model swapping can become as critical as battery swapping: having several optimized models ready to go allows an organization to serve diverse requests without exhausting resources. Orchestration tools and frameworks like vLLM or Ollama help build inference pipelines where models are loaded into memory efficiently, reducing wait times.

A perspective for the Italian market

Spiro’s lesson isn’t limited to emerging economies. Italian SMEs that consider bringing AI in-house also face hardware scarcity and the need to extract value from every euro spent on graphics cards. The swapping concept – applied to software and model management – can inspire architectures where a single workstation with a mid-range GPU serves multiple departments by elastically alternating workloads. Total cost (TCO) evaluations and analytical frameworks for on-prem – such as those explored on AI-RADAR – help determine whether investing in local infrastructure makes more sense than consuming tokens in the cloud.

The African startup, with its unconventional approach, reminds us that operational efficiency is born from constraints. And for those running LLMs on physical hardware, those constraints are daily bread.