Before attention, AI read sentences like scrolling through text — never looking back. Attention lets models connect any word to any other word in a sentence simultaneously. It's the core of every modern AI.

The attention mechanism is the key innovation that makes Transformers work. Understanding it means understanding modern AI at its core.

**The problem attention solves:**

In a sentence like 'The trophy didn't fit in the suitcase because it was too big' — what does 'it' refer to? The trophy or the suitcase? For humans, obvious. For sequential models, impossible without reading the whole sentence together.

**How attention works:**

For each token (word), the model computes three vectors:

- **Query (Q)**: 'What am I looking for?'

- **Key (K)**: 'What do I contain?'

- **Value (V)**: 'What information do I carry?'

To compute attention:

1. Compute similarity scores: Q · Kᵀ (dot product of each query with every key)

2. Scale and apply softmax to get attention weights (sum to 1)

3. Weighted sum of Values using those weights

Result: each token's new representation is a blend of all other tokens' values, weighted by relevance. 'Trophy' and 'it' end up correlated because they share high attention weights in that sentence.

**Multi-head attention:**

Instead of one attention calculation, run H parallel attention computations (heads) with different learned weights. Each head captures different relationships — one head might focus on syntactic structure, another on semantic similarity, another on pronoun resolution. Results are concatenated.

**Self-attention vs. cross-attention:**

- Self-attention: tokens attend to other tokens in the same sequence (most common)

- Cross-attention: queries from one sequence attend to keys/values from another (used in encoder-decoder models for translation)

GPT-4 has 96 attention heads per layer and 96 layers — 9,216 attention computations per forward pass.

**Key takeaway:** Attention lets every token look at every other token simultaneously — the mechanism that makes AI context-aware across entire documents.

The Attention Mechanism: AI's Selective Focus