Is it cheaper to run an LLM locally or in the cloud?

Local is cheaper only at high, sustained utilization. The GPU is a fixed upfront cost, so cost-per-token falls the more you keep it busy. Below roughly 30-40% utilization, per-second cloud GPUs almost always win.

How do I calculate cost per million tokens locally?

Take total monthly cost (amortized hardware + electricity + maintenance) and divide by the tokens you actually generate that month. The same hardware can cost 10x more per token at 5% utilization than at 60%.

What are the hidden costs of local LLMs?

Electricity and cooling, hardware depreciation, downtime, spare parts, and engineer time for drivers/updates/monitoring. These often exceed the GPU sticker price over a 3-year life.

Does electricity cost matter for a single GPU?

Yes, especially in Europe. A 350W GPU running 24/7 draws ~250 kWh/month; at EU prices that is a meaningful recurring cost, and it compounds with cooling and PSU inefficiency. Power-limiting the GPU recovers much of it for a small speed loss.

Why does batching change LLM serving cost so much?

A GPU serving many concurrent requests with continuous batching (vLLM) can produce 10-20x the total tokens of a single chat stream on the same hardware — dividing cost per token by the same factor. Multi-user workloads make local hardware dramatically more economical.

What is a hybrid local-cloud LLM strategy?

Own hardware sized for your steady base load and rent cloud GPUs or API capacity for bursts, experiments and oversized models. You get owned-hardware economics where utilization is high and cloud flexibility where it is not.

Cost of Running LLMs Locally (2026): Local vs Cloud TCO

"How much does it cost to run an LLM locally?" has no single answer, because the dominant variable is not the hardware — it's how busy you keep it. The most common and most expensive mistake is comparing a GPU's purchase price to a cloud hourly rate. They are not comparable: one is sunk capital, the other is metered consumption. The only comparison that tells the truth is total cost of ownership (TCO) per unit of output — euros per million tokens — over the realistic life of the hardware. This guide builds that number from the ground up at three scales, then shows the levers that swing the verdict.

The full cost stack

A local deployment has five cost layers. Most teams budget only the first and are surprised by the rest:

Layer	What it includes	Often forgotten?
Hardware (CapEx)	GPU, CPU, RAM, PSU, storage, chassis	No
Electricity	GPU + system draw under load and idle	Usually
Cooling & power	Cooling, PSU losses, room/rack, UPS	Yes
Maintenance	Engineer time, drivers, monitoring, spares	Yes
Depreciation	Value lost over 2–4 year useful life (minus resale)	Yes

Worked examples at three scales

Illustrative figures to show the method — plug in your own prices. All assume 3-year straight-line depreciation and ~€0.25/kWh.

Scale 1 — Individual: one used RTX 3090 (24GB)

Hardware: ~€900 rig share → ~€25/month amortized.
Power: ~400W system under load at 15% duty ≈ 45 kWh → ~€11/month.
Maintenance: your own time (unpaid, but real).
Total ≈ €36/month fixed. At hobbyist volumes (say 2–3M tokens/month of actual use) that's ~€12–18 per million tokens — far above a budget cloud API. The honest conclusion: individuals rarely run local to save money; they run local for privacy, offline capability, unlimited experimentation and learning — which are real returns, just not billable ones.

Scale 2 — Team: dual-3090 rig serving a 70B at 4-bit

Hardware: ~€2,400 → ~€67/month. Power at 50% duty: ~250 kWh → ~€63/month. Maintenance/overhead: ~€40/month.
Total ≈ €170/month, fixed — whether you generate 1 million tokens or 1 billion.
Serving a team through vLLM with batching (see below), sustained output in the tens of millions of tokens/month is realistic → ~€2–8 per million tokens, competitive with or beating equivalent-quality API pricing, with data never leaving the building.

Scale 3 — Company: a 4×48GB inference server

Hardware: ~€25–35k server → ~€700–950/month. Power: ~1.5–2kW at 60% duty ≈ 650–900 kWh → ~€160–225/month. Maintenance: real engineer hours now — budget ~€300–500/month.
Total ≈ €1,200–1,700/month. Batched across departments, hundreds of millions of tokens/month → well under €5 per million tokens at equivalent-model quality, plus compliance/residency benefits that often motivated the project in the first place. This is the scale where on-prem economics genuinely shine — if the load exists.

The break-even: it's all about utilization

Cloud GPUs and token APIs charge only for what you use, so their cost scales linearly. Owned hardware costs the same idle or maxed. Plot both and they cross at a utilization threshold — empirically often around 30–40% sustained. Below it, you're paying for an expensive idle asset. Above it, the fixed cost is "filled" and every extra token is nearly free — the gap then widens with scale.

Run the 5% scenario honestly before buying: the team rig above at 5% utilization produces a tenth of the tokens for the same €170 — cost-per-token 10× worse, losing to any API. The hardware didn't change; the utilization did. Most local-AI cost disappointments are utilization disappointments.

The multiplier everyone misses: batching

Single-stream chat leaves a GPU mostly idle — token generation is memory-bandwidth bound and one user's request uses a sliver of compute. A serving engine with continuous batching (vLLM, TGI) interleaves many concurrent requests on the same card, multiplying total token throughput 10–20× with modest latency impact. Since your costs are fixed, cost-per-token divides by the same factor.

Practical consequence: the local-vs-cloud verdict flips with concurrency. A single-user assistant almost never pays for its GPU; a 20-user internal service on the same GPU often beats every alternative. When estimating, count concurrent demand, not just total volume — and if you have it, serve through vLLM rather than a single-stream runtime, because the economics are night and day.

Electricity, honestly

The 24/7 baseline: a 350W GPU at full tilt around the clock is ~250 kWh/month — €50–90/month at typical EU prices (Italy often sits at the high end). Idle draw (10–30W GPU + 40–80W system) runs €10–25/month just for being on.
Power-limiting is free money: capping a 3090 at ~250–280W (nvidia-smi -pl) costs ~5–10% inference speed and saves ~25–30% energy. For bandwidth-bound inference, high power states are mostly wasted anyway.
Count the whole chain: PSU losses (~10%), cooling, and in an office, the air-conditioning that removes the heat you just paid to make. In summer you pay twice.
Schedule what can be scheduled: batch jobs (indexing, evals, fine-tunes) moved to night tariffs cut the bill where time-of-use pricing exists.

Depreciation and resale: gentler than it looks

Straight-line depreciation to zero is the conservative accounting fiction. Reality is kinder: high-VRAM GPUs hold value unusually well because AI demand keeps absorbing used supply — 3090s still sell today for a substantial fraction of their old price years after launch. A realistic model: assume you recover 30–50% of GPU cost at resale after 3 years, which materially lowers true TCO. Two caveats: datacenter cards without display outputs resell into a thinner market, and a genuine architecture break (a new generation with step-change efficiency) can reprice the used market overnight — that's the risk premium of owning.

Hidden costs people forget

Engineer time — drivers, CUDA updates, monitoring, the occasional 2am failure. At team scale this is often the largest line after hardware.
Redundancy — one GPU is one point of failure; production needs a spare, a failover plan, or an accepted downtime story.
The experimentation tax — model churn is real: re-downloading, re-quantizing, re-evaluating each new release costs storage and hours.
Opportunity cost — capital tied up in depreciating hardware; and the calendar risk that next year's GPU does the job at half the watts.

What actually swings the verdict — sensitivity

Utilization (dominant): 5% → cloud wins big; 60% + batching → local wins big. Everything else is second-order.
Electricity price: at €0.10/kWh (some regions/solar) local gets ~30% cheaper to run than at €0.30 EU-average math.
Model size vs need: if a 8–14B fine-tune covers your task, the hardware bar (and bill) collapses versus insisting on 70B for everything. Right-sizing the model is the biggest single cost optimization.
Concurrency: as above — multi-user workloads are where owned hardware earns its keep.

The hybrid strategy (what most teams should actually do)

Own hardware sized for your steady base load; rent for everything spiky: cloud GPUs by the second for experiments, oversized one-off jobs and traffic bursts, and API calls for models you can't justify hosting. You get owned-hardware economics where utilization is provably high, cloud flexibility where it isn't, and a natural migration path — when the rented burst becomes constant, that's your signal to buy the next card. Route requests by difficulty (small local model first, big model on escalation) and cache aggressively: both cut tokens before they're ever generated.

A quick decision checklist

Estimate your real monthly token volume and concurrency (be honest — most overestimate volume and underestimate idle time).
Compute local TCO/month: hardware ÷ 36 − expected resale/36 + power + maintenance.
Divide by realistic token output (with batching if multi-user) → local €/M tokens.
Compare to a cloud token API and a per-second GPU rental at the same volume.
Add a value for privacy/residency/control if it applies — then decide, and revisit yearly as prices move.