The context window is the model's working memory for a single inference call. Everything — the system prompt, conversation history, retrieved documents, and the generated response — must fit within this limit.
Why Context Length Is Bounded
Standard attention has O(n²) memory complexity with respect to sequence length. A 128K-token context produces 16 billion attention scores per layer. This is tractable only with Flash Attention and careful KV cache management. Longer contexts linearly increase the KV cache size in VRAM.
Context Sizes by Model Generation
| Era | Common Limit | Examples |
|---|---|---|
| 2020–2022 | 2K – 4K tokens | GPT-3, early BERT |
| 2023 | 8K – 32K tokens | Llama 2, Mistral 7B, GPT-4 (8K) |
| 2024 | 128K tokens | Llama 3.1, GPT-4o, Gemini 1.5 Pro |
| 2025–2026 | 1M – 10M tokens | Gemini 2.5 Pro (1M), Gemini 3.0 (10M) |
KV Cache VRAM Cost
For a 7B Llama 3 model with GQA at 128K context in FP16: the KV cache alone consumes ≈ 8 GB of VRAM. At 1M tokens it would require ~62 GB — this is why ultra-long context requires GQA, quantised KV cache, or specialised architectures.
Why It Matters for On-Premise
If you're using RAG, a 32K context can hold ≈ 50 chunked documents, which is usually sufficient. If you need to process entire codebases or legal contracts in one shot, you need 128K+ models and ≥80 GB VRAM configured with Flash Attention. Plan your hardware sizing around realistic context requirements before purchasing GPUs.