The progression from one-hot encoding to Word2Vec to transformer-based contextual embeddings represents one of the most consequential arcs in modern AI. Each generation solved real limitations of the previous one, and understanding why reveals the design principles behind today's most powerful models.

The history of embeddings maps directly to the history of deep learning progress. Generation one was one-hot encoding: a vocabulary of 50,000 words became a 50,000-dimensional sparse vector with a single 1. Computationally expensive, captures no semantic relationships, catastrophically fails on unseen words. Generation two was Word2Vec (2013) and GloVe (2014): shallow neural networks trained to predict word context produced dense 300-dimensional vectors where semantic similarity emerged as a geometric property. Revolutionary, but static — every occurrence of 'jaguar' gets the same vector regardless of whether you mean the car or the animal. Generation three was ELMo (2018): the first contextual embeddings, using bidirectional LSTMs to produce word representations that changed with context. Computationally expensive, but a proof of concept for contextual representation. Generation four is the transformer era: BERT (2018), RoBERTa, and their descendants produce rich contextual embeddings using attention mechanisms that weigh every word against every other word in the input. Today's sentence transformers (SBERT, E5, BGE) are optimized specifically for producing embeddings where the full sentence, not just individual words, is represented as a single vector suitable for similarity search and retrieval tasks.

From Static to Contextual: The Evolution of Embedding Fundamentals