The attention mechanism lets neural networks focus on the most relevant parts of their input when producing each output — similar to how you re-read a specific paragraph of a contract before answering a question about it. It's the core innovation that made transformers, and therefore modern LLMs, possible.

Before attention, sequence models like RNNs processed text word by word in order, compressing everything seen so far into a single fixed-size hidden state. By the end of a long sentence, early words were poorly represented. Attention solved this by allowing the model to directly look back at any earlier position when processing the current one, weighting how relevant each past position is. The mechanics: for each output position, the model computes a query vector (what am I looking for?), compares it against key vectors (what does each input position offer?), and uses the resulting similarity scores as weights over value vectors (what information does each position carry?). This query-key-value structure is the mathematical heart of attention. Self-attention applies this mechanism within a single sequence — allowing every word to attend to every other word simultaneously — which is what enables transformers to capture long-range dependencies that RNNs couldn't. Multi-head attention runs several attention operations in parallel, each potentially capturing different types of relationships (syntactic, semantic, coreference). The attention map — the matrix of weights showing which positions attend to which — has become a tool for interpretability research, revealing which parts of input a model focuses on when generating each output token. Attention is why transformers scale so effectively: unlike recurrent architectures, self-attention parallelizes completely over sequence positions.

What is the Attention Mechanism in Deep Learning?