Context Window

Core

The maximum number of tokens (input + output combined) a model can process in a single call. Directly tied to VRAM usage via the KV cache.

The context window is the model's working memory for a single inference call. Everything — the system prompt, conversation history, retrieved documents, and the generated response — must fit within this limit.

Why Context Length Is Bounded

Standard attention has O(n²) memory complexity with respect to sequence length. A 128K-token context produces 16 billion attention scores per layer. This is tractable only with Flash Attention and careful KV cache management. Longer contexts linearly increase the KV cache size in VRAM.

Context Sizes by Model Generation

EraCommon LimitExamples
2020–20222K – 4K tokensGPT-3, early BERT
20238K – 32K tokensLlama 2, Mistral 7B, GPT-4 (8K)
2024128K tokensLlama 3.1, GPT-4o, Gemini 1.5 Pro
2025–20261M – 10M tokensGemini 2.5 Pro (1M), Gemini 3.0 (10M)

KV Cache VRAM Cost

For a 7B Llama 3 model with GQA at 128K context in FP16: the KV cache alone consumes ≈ 8 GB of VRAM. At 1M tokens it would require ~62 GB — this is why ultra-long context requires GQA, quantised KV cache, or specialised architectures.

Why It Matters for On-Premise

If you're using RAG, a 32K context can hold ≈ 50 chunked documents, which is usually sufficient. If you need to process entire codebases or legal contracts in one shot, you need 128K+ models and ≥80 GB VRAM configured with Flash Attention. Plan your hardware sizing around realistic context requirements before purchasing GPUs.