Attention Mechanism – LLM Glossary

Attention is the mechanism that gives transformers their power. For each token being generated, it computes a weighted sum over all previous tokens, telling the model which parts of the input to "pay attention to".

How It Works (Simplified)

Each token is projected into three vectors: Query (Q), Key (K), and Value (V). The attention score between two tokens is the dot product of their Q and K vectors, scaled by √d and passed through softmax. The output is a weighted sum of V vectors. This is done in parallel across multiple "heads" — hence Multi-Head Attention.

Variants Used in Modern LLMs

Multi-Head Attention (MHA)

Classic. Each of N heads has its own Q, K, V projections. Used in original GPT, BERT. Memory-heavy for long contexts since each head needs its own KV cache.

Multi-Query Attention (MQA)

All heads share a single K and V projection. Massively reduces KV cache size. Used in Falcon, PaLM. Slight quality drop vs MHA.

Grouped-Query Attention (GQA)

Groups of heads share K and V. Best balance of quality and memory efficiency. Used in Llama 3, Mistral, Gemma — the current standard.

Sliding Window Attention

Each token attends only to a local window of tokens. Enables very long contexts cheaply. Used in Mistral with mixed global layers for long-range reasoning.

Why It Matters for On-Premise

The KV cache (which stores K and V tensors for all tokens in the context) grows linearly with context length and consumes significant VRAM. Choosing a model with GQA (like Llama 3) over MHA can reduce KV cache memory by 8–16x, allowing you to serve longer contexts on the same hardware.