Attention is the mechanism that gives transformers their power. For each token being generated, it computes a weighted sum over all previous tokens, telling the model which parts of the input to "pay attention to".
How It Works (Simplified)
Each token is projected into three vectors: Query (Q), Key (K), and Value (V). The attention score between two tokens is the dot product of their Q and K vectors, scaled by √d and passed through softmax. The output is a weighted sum of V vectors. This is done in parallel across multiple "heads" — hence Multi-Head Attention.
Variants Used in Modern LLMs
Multi-Head Attention (MHA)
Classic. Each of N heads has its own Q, K, V projections. Used in original GPT, BERT. Memory-heavy for long contexts since each head needs its own KV cache.
Multi-Query Attention (MQA)
All heads share a single K and V projection. Massively reduces KV cache size. Used in Falcon, PaLM. Slight quality drop vs MHA.
Grouped-Query Attention (GQA)
Groups of heads share K and V. Best balance of quality and memory efficiency. Used in Llama 3, Mistral, Gemma — the current standard.
Sliding Window Attention
Each token attends only to a local window of tokens. Enables very long contexts cheaply. Used in Mistral with mixed global layers for long-range reasoning.
Why It Matters for On-Premise
The KV cache (which stores K and V tensors for all tokens in the context) grows linearly with context length and consumes significant VRAM. Choosing a model with GQA (like Llama 3) over MHA can reduce KV cache memory by 8–16x, allowing you to serve longer contexts on the same hardware.