During token generation, the model computes K and V tensors for each token at each layer. Without caching, generating the 100th token would require reprocessing all 99 previous tokens from scratch. The KV cache stores these tensors so each decode step only processes the newly generated token.
KV Cache Size Calculation
For a model with N layers, H attention heads, D head dimension, in FP16:
KV cache VRAM = 2 (K+V) × N_layers × N_heads × D_head × seq_length × 2 bytes
Example: Llama 3 8B (32 layers, GQA 8 KV heads, head_dim=128) at 8K context:
2 × 32 × 8 × 128 × 8192 × 2 = ~1 GB — manageable. At 128K context: ~16 GB — dominates VRAM.
KV Cache Management Strategies
PagedAttention (vLLM)
Manages KV cache pages like virtual memory — non-contiguous, shareable across requests (prefix caching). Eliminates memory fragmentation and enables high batch sizes.
Quantised KV Cache
Store K and V in INT8 or INT4 instead of FP16. Cuts KV cache size by 2–4×. Available in vLLM and SGLang. Minimal quality impact for most tasks.
Prefix Caching (Radix Cache)
Reuse the KV cache for identical system prompts across requests. A 2K system prompt cached once saves 2K tokens of prefill for every subsequent request. SGLang is the leader here.
Why It Matters for On-Premise
KV cache is often the binding VRAM constraint on multi-user deployments, not the model weights. A 7B model takes ~4GB (Q4). But serving 16 simultaneous users with 8K context each takes 16 GB of KV cache — totalling 20 GB. Quantised KV cache and PagedAttention are essential for production on-premise serving.