vLLM (Kwon et al., 2023) is an open-source LLM inference server that achieved 24× higher throughput than naive HuggingFace Transformers by solving KV cache memory fragmentation with PagedAttention — borrowed from operating systems virtual memory concepts.
PagedAttention: The Core Innovation
In standard inference, each request reserves a contiguous block of VRAM for its KV cache (proportional to the maximum sequence length). If a request uses less than the reserved space, the remainder is wasted — GPU memory fragmentation of up to 60–80%. PagedAttention manages KV cache in fixed-size pages (like OS memory pages), allocating pages on demand and allowing them to be non-contiguous. This enables:
- Higher batch sizes: more requests fit simultaneously in VRAM
- Prefix caching: shared system prompt KV cache across all requests
- Beam search memory efficiency via copy-on-write page sharing
Key vLLM Features
Continuous Batching
New requests are dynamically added to the running batch as soon as previous requests finish. Eliminates idle GPU time between batches. Critical for production serving.
OpenAI-Compatible API
Drop-in replacement for the OpenAI API. Any client using openai.ChatCompletion works unchanged by pointing at http://localhost:8000/v1.
Tensor Parallelism
Split model across multiple GPUs: --tensor-parallel-size 4. Linear throughput scaling. Requires NVLink for best results.
Quantization Support
GPTQ, AWQ, EXL2, FP8, INT8 — pass --quantization awq. INT8 KV cache: --kv-cache-dtype fp8.
vLLM vs Ollama
| vLLM | Ollama | |
|---|---|---|
| Best for | Production, multi-user, high throughput | Development, single-user, ease of use |
| Backend | CUDA (NVIDIA only) | llama.cpp (any HW) |
| Formats | HuggingFace, GPTQ, AWQ | GGUF only |
| Setup complexity | Medium | Minimal (one binary) |
| Concurrent users | Excellent (PagedAttention) | Limited |
Quick Start
pip install vllm
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct --quantization awq --gpu-memory-utilization 0.90