In ML, inference means using a trained model to produce predictions — as opposed to training, which creates the model. For LLMs, inference is the generation of tokens in response to a prompt.
Inference Pipeline Stages
Prefill (Prompt Processing)
All input tokens are processed in parallel. Fast — takes ~0.1–0.5s even for long prompts on GPU. Computes and stores the KV cache.
Decode (Token Generation)
Tokens are generated one at a time (autoregressive). Each step reads the full KV cache and appends one new token. This is the bottleneck for long responses.
Key Inference Metrics
| Metric | Definition | Target |
|---|---|---|
| TTFT (ms) | Time To First Token | <200ms for chat UX |
| Tokens/s (decode) | Generation throughput | ≥30 t/s for reading speed |
| Batch throughput | Tokens/s across all users | Depends on concurrent users |
| Cost/1M tokens | Electricity + depreciation | $0.01–$0.50 on-prem vs $1–15 cloud |
Inference Engines for On-Premise
- Ollama — Easiest deployment. REST API. Uses llama.cpp backend. Great for development and single-user production.
- vLLM — Best throughput for multi-user. PagedAttention for KV cache. OpenAI-compatible API. Requires CUDA GPU.
- llama.cpp — Raw C++ engine. Runs on CPU, GPU, Apple Metal. GGUF format only. Maximum hardware compatibility.
- TGI (Hugging Face) — Supports GPTQ, AWQ. Docker-native. Used in many enterprise deployments.