Inference

Core

The process of running a trained model to generate output. Inference cost is hardware electricity + depreciation per token — the operational cost of on-premise LLMs.

In ML, inference means using a trained model to produce predictions — as opposed to training, which creates the model. For LLMs, inference is the generation of tokens in response to a prompt.

Inference Pipeline Stages

Prefill (Prompt Processing)

All input tokens are processed in parallel. Fast — takes ~0.1–0.5s even for long prompts on GPU. Computes and stores the KV cache.

Decode (Token Generation)

Tokens are generated one at a time (autoregressive). Each step reads the full KV cache and appends one new token. This is the bottleneck for long responses.

Key Inference Metrics

MetricDefinitionTarget
TTFT (ms)Time To First Token<200ms for chat UX
Tokens/s (decode)Generation throughput≥30 t/s for reading speed
Batch throughputTokens/s across all usersDepends on concurrent users
Cost/1M tokensElectricity + depreciation$0.01–$0.50 on-prem vs $1–15 cloud

Inference Engines for On-Premise

  • Ollama — Easiest deployment. REST API. Uses llama.cpp backend. Great for development and single-user production.
  • vLLM — Best throughput for multi-user. PagedAttention for KV cache. OpenAI-compatible API. Requires CUDA GPU.
  • llama.cpp — Raw C++ engine. Runs on CPU, GPU, Apple Metal. GGUF format only. Maximum hardware compatibility.
  • TGI (Hugging Face) — Supports GPTQ, AWQ. Docker-native. Used in many enterprise deployments.