The attention mechanism is the key innovation that makes Transformers work. Understanding it means understanding modern AI at its core.
**The problem attention solves:**
In a sentence like 'The trophy didn't fit in the suitcase because it was too big' — what does 'it' refer to? The trophy or the suitcase? For humans, obvious. For sequential models, impossible without reading the whole sentence together.
**How attention works:**
For each token (word), the model computes three vectors:
- **Query (Q)**: 'What am I looking for?'
- **Key (K)**: 'What do I contain?'
- **Value (V)**: 'What information do I carry?'
To compute attention:
1. Compute similarity scores: Q · Kᵀ (dot product of each query with every key)
2. Scale and apply softmax to get attention weights (sum to 1)
3. Weighted sum of Values using those weights
Result: each token's new representation is a blend of all other tokens' values, weighted by relevance. 'Trophy' and 'it' end up correlated because they share high attention weights in that sentence.
**Multi-head attention:**
Instead of one attention calculation, run H parallel attention computations (heads) with different learned weights. Each head captures different relationships — one head might focus on syntactic structure, another on semantic similarity, another on pronoun resolution. Results are concatenated.
**Self-attention vs. cross-attention:**
- Self-attention: tokens attend to other tokens in the same sequence (most common)
- Cross-attention: queries from one sequence attend to keys/values from another (used in encoder-decoder models for translation)
GPT-4 has 96 attention heads per layer and 96 layers — 9,216 attention computations per forward pass.
**Key takeaway:** Attention lets every token look at every other token simultaneously — the mechanism that makes AI context-aware across entire documents.