vLLM

Infrastructure NEW

A high-throughput LLM inference engine featuring PagedAttention for optimal KV cache utilisation. The production standard for serving LLMs to multiple concurrent users.

vLLM (Kwon et al., 2023) is an open-source LLM inference server that achieved 24× higher throughput than naive HuggingFace Transformers by solving KV cache memory fragmentation with PagedAttention — borrowed from operating systems virtual memory concepts.

PagedAttention: The Core Innovation

In standard inference, each request reserves a contiguous block of VRAM for its KV cache (proportional to the maximum sequence length). If a request uses less than the reserved space, the remainder is wasted — GPU memory fragmentation of up to 60–80%. PagedAttention manages KV cache in fixed-size pages (like OS memory pages), allocating pages on demand and allowing them to be non-contiguous. This enables:

  • Higher batch sizes: more requests fit simultaneously in VRAM
  • Prefix caching: shared system prompt KV cache across all requests
  • Beam search memory efficiency via copy-on-write page sharing

Key vLLM Features

Continuous Batching

New requests are dynamically added to the running batch as soon as previous requests finish. Eliminates idle GPU time between batches. Critical for production serving.

OpenAI-Compatible API

Drop-in replacement for the OpenAI API. Any client using openai.ChatCompletion works unchanged by pointing at http://localhost:8000/v1.

Tensor Parallelism

Split model across multiple GPUs: --tensor-parallel-size 4. Linear throughput scaling. Requires NVLink for best results.

Quantization Support

GPTQ, AWQ, EXL2, FP8, INT8 — pass --quantization awq. INT8 KV cache: --kv-cache-dtype fp8.

vLLM vs Ollama

vLLMOllama
Best forProduction, multi-user, high throughputDevelopment, single-user, ease of use
BackendCUDA (NVIDIA only)llama.cpp (any HW)
FormatsHuggingFace, GPTQ, AWQGGUF only
Setup complexityMediumMinimal (one binary)
Concurrent usersExcellent (PagedAttention)Limited

Quick Start

pip install vllm
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct --quantization awq --gpu-memory-utilization 0.90