Context Window – LLM Glossary

The context window is the model's working memory for a single inference call. Everything — the system prompt, conversation history, retrieved documents, and the generated response — must fit within this limit.

Why Context Length Is Bounded

Standard attention has O(n²) memory complexity with respect to sequence length. A 128K-token context produces 16 billion attention scores per layer. This is tractable only with Flash Attention and careful KV cache management. Longer contexts linearly increase the KV cache size in VRAM.

Context Sizes by Model Generation

Era	Common Limit	Examples
2020–2022	2K – 4K tokens	GPT-3, early BERT
2023	8K – 32K tokens	Llama 2, Mistral 7B, GPT-4 (8K)
2024	128K tokens	Llama 3.1, GPT-4o, Gemini 1.5 Pro
2025–2026	1M – 10M tokens	Gemini 2.5 Pro (1M), Gemini 3.0 (10M)

KV Cache VRAM Cost

For a 7B Llama 3 model with GQA at 128K context in FP16: the KV cache alone consumes ≈ 8 GB of VRAM. At 1M tokens it would require ~62 GB — this is why ultra-long context requires GQA, quantised KV cache, or specialised architectures.

Why It Matters for On-Premise

If you're using RAG, a 32K context can hold ≈ 50 chunked documents, which is usually sufficient. If you need to process entire codebases or legal contracts in one shot, you need 128K+ models and ≥80 GB VRAM configured with Flash Attention. Plan your hardware sizing around realistic context requirements before purchasing GPUs.