Inference – LLM Glossary

In ML, inference means using a trained model to produce predictions — as opposed to training, which creates the model. For LLMs, inference is the generation of tokens in response to a prompt.

Inference Pipeline Stages

Prefill (Prompt Processing)

All input tokens are processed in parallel. Fast — takes ~0.1–0.5s even for long prompts on GPU. Computes and stores the KV cache.

Decode (Token Generation)

Tokens are generated one at a time (autoregressive). Each step reads the full KV cache and appends one new token. This is the bottleneck for long responses.

Key Inference Metrics

Metric	Definition	Target
TTFT (ms)	Time To First Token	<200ms for chat UX
Tokens/s (decode)	Generation throughput	≥30 t/s for reading speed
Batch throughput	Tokens/s across all users	Depends on concurrent users
Cost/1M tokens	Electricity + depreciation	$0.01–$0.50 on-prem vs $1–15 cloud

Inference Engines for On-Premise

Ollama — Easiest deployment. REST API. Uses llama.cpp backend. Great for development and single-user production.
vLLM — Best throughput for multi-user. PagedAttention for KV cache. OpenAI-compatible API. Requires CUDA GPU.
llama.cpp — Raw C++ engine. Runs on CPU, GPU, Apple Metal. GGUF format only. Maximum hardware compatibility.
TGI (Hugging Face) — Supports GPTQ, AWQ. Docker-native. Used in many enterprise deployments.