vLLM – LLM Glossary

vLLM (Kwon et al., 2023) is an open-source LLM inference server that achieved 24× higher throughput than naive HuggingFace Transformers by solving KV cache memory fragmentation with PagedAttention — borrowed from operating systems virtual memory concepts.

PagedAttention: The Core Innovation

In standard inference, each request reserves a contiguous block of VRAM for its KV cache (proportional to the maximum sequence length). If a request uses less than the reserved space, the remainder is wasted — GPU memory fragmentation of up to 60–80%. PagedAttention manages KV cache in fixed-size pages (like OS memory pages), allocating pages on demand and allowing them to be non-contiguous. This enables:

Higher batch sizes: more requests fit simultaneously in VRAM
Prefix caching: shared system prompt KV cache across all requests
Beam search memory efficiency via copy-on-write page sharing

Key vLLM Features

Continuous Batching

New requests are dynamically added to the running batch as soon as previous requests finish. Eliminates idle GPU time between batches. Critical for production serving.

OpenAI-Compatible API

Drop-in replacement for the OpenAI API. Any client using openai.ChatCompletion works unchanged by pointing at http://localhost:8000/v1.

Tensor Parallelism

Split model across multiple GPUs: --tensor-parallel-size 4. Linear throughput scaling. Requires NVLink for best results.

Quantization Support

GPTQ, AWQ, EXL2, FP8, INT8 — pass --quantization awq. INT8 KV cache: --kv-cache-dtype fp8.

vLLM vs Ollama

	vLLM	Ollama
Best for	Production, multi-user, high throughput	Development, single-user, ease of use
Backend	CUDA (NVIDIA only)	llama.cpp (any HW)
Formats	HuggingFace, GPTQ, AWQ	GGUF only
Setup complexity	Medium	Minimal (one binary)
Concurrent users	Excellent (PagedAttention)	Limited

Quick Start

pip install vllm
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct --quantization awq --gpu-memory-utilization 0.90