Before Transformers, AI models processed text sequentially — one word at a time, left to right. This was slow and made it hard to connect ideas that were far apart in a sentence.
Transformers solved this with a mechanism called **self-attention**. Instead of reading word by word, the model looks at the entire input at once and calculates how much each word should 'attend to' every other word. For the sentence 'The animal didn't cross the street because it was too tired,' the model learns that 'it' refers to 'animal,' not 'street' — because attention weights say so.
The architecture has two main parts:
- **Encoder**: Reads the input and builds a rich internal representation
- **Decoder**: Generates output token by token, attending to both the encoder output and its own previous outputs
Modern LLMs like GPT use decoder-only Transformers (no encoder). BERT uses encoder-only. T5 and early translation models use both.
What made Transformers revolutionary wasn't just accuracy — it was **parallelizability**. Unlike sequential models, Transformers can process all tokens simultaneously during training, making them dramatically faster to train on GPUs. This is why we could scale to models with hundreds of billions of parameters.
**Key takeaway:** 'Attention Is All You Need' — the 2017 paper that gave every major AI model its brain.