Attention Mechanism

Architecture

The mathematical core of every transformer model — it lets the model weigh which tokens in the context are most relevant when generating each new token.

Attention is the mechanism that gives transformers their power. For each token being generated, it computes a weighted sum over all previous tokens, telling the model which parts of the input to "pay attention to".

How It Works (Simplified)

Each token is projected into three vectors: Query (Q), Key (K), and Value (V). The attention score between two tokens is the dot product of their Q and K vectors, scaled by √d and passed through softmax. The output is a weighted sum of V vectors. This is done in parallel across multiple "heads" — hence Multi-Head Attention.

Variants Used in Modern LLMs

Multi-Head Attention (MHA)

Classic. Each of N heads has its own Q, K, V projections. Used in original GPT, BERT. Memory-heavy for long contexts since each head needs its own KV cache.

Multi-Query Attention (MQA)

All heads share a single K and V projection. Massively reduces KV cache size. Used in Falcon, PaLM. Slight quality drop vs MHA.

Grouped-Query Attention (GQA)

Groups of heads share K and V. Best balance of quality and memory efficiency. Used in Llama 3, Mistral, Gemma — the current standard.

Sliding Window Attention

Each token attends only to a local window of tokens. Enables very long contexts cheaply. Used in Mistral with mixed global layers for long-range reasoning.

Why It Matters for On-Premise

The KV cache (which stores K and V tensors for all tokens in the context) grows linearly with context length and consumes significant VRAM. Choosing a model with GQA (like Llama 3) over MHA can reduce KV cache memory by 8–16x, allowing you to serve longer contexts on the same hardware.