KV Cache

Optimization

A cache of the Key and Value attention tensors for all tokens already processed — avoiding redundant recomputation and making autoregressive generation efficient.

During token generation, the model computes K and V tensors for each token at each layer. Without caching, generating the 100th token would require reprocessing all 99 previous tokens from scratch. The KV cache stores these tensors so each decode step only processes the newly generated token.

KV Cache Size Calculation

For a model with N layers, H attention heads, D head dimension, in FP16:

KV cache VRAM = 2 (K+V) × N_layers × N_heads × D_head × seq_length × 2 bytes

Example: Llama 3 8B (32 layers, GQA 8 KV heads, head_dim=128) at 8K context:
2 × 32 × 8 × 128 × 8192 × 2 = ~1 GB — manageable. At 128K context: ~16 GB — dominates VRAM.

KV Cache Management Strategies

PagedAttention (vLLM)

Manages KV cache pages like virtual memory — non-contiguous, shareable across requests (prefix caching). Eliminates memory fragmentation and enables high batch sizes.

Quantised KV Cache

Store K and V in INT8 or INT4 instead of FP16. Cuts KV cache size by 2–4×. Available in vLLM and SGLang. Minimal quality impact for most tasks.

Prefix Caching (Radix Cache)

Reuse the KV cache for identical system prompts across requests. A 2K system prompt cached once saves 2K tokens of prefill for every subsequent request. SGLang is the leader here.

Why It Matters for On-Premise

KV cache is often the binding VRAM constraint on multi-user deployments, not the model weights. A 7B model takes ~4GB (Q4). But serving 16 simultaneous users with 8K context each takes 16 GB of KV cache — totalling 20 GB. Quantised KV cache and PagedAttention are essential for production on-premise serving.