The progression from one-hot encoding to Word2Vec to transformer-based contextual embeddings represents the most consequential arc in modern NLP. Each generation solved specific limitations of the previous one, and the design principles behind this evolution explain why today's LLMs work the way they do.

The history of text embeddings is a story of successive failures solved by increasingly sophisticated ideas. Generation one was one-hot encoding: each word in a 50,000-word vocabulary became a 50,000-dimensional sparse vector with a single 1. Useless for capturing semantic relationships and computationally wasteful. Generation two arrived in 2013 with Word2Vec and 2014 with GloVe. Shallow neural networks trained to predict word context produced dense 300-dimensional vectors where semantic similarity emerged as geometric proximity. Revolutionary — but the vectors were static. Every occurrence of 'bank' got the same vector regardless of whether it was the river bank or the financial bank. Generation three was ELMo in 2018: the first contextual embeddings using bidirectional LSTMs. A word's vector now varied based on context. Computationally expensive but a crucial proof of concept. Generation four is the transformer era, launched by BERT in 2018 and continuing today. Self-attention replaced recurrence, and the pretrain-then-finetune paradigm took over NLP. Modern sentence transformers like SBERT, E5, and BGE specifically optimize for producing embeddings suitable for similarity search and retrieval — the foundation of RAG systems powering much of today's LLM applications. Each generation didn't just make embeddings better — it changed what was possible. The shift from static to contextual enabled modern chatbots. The shift to sentence embeddings enabled semantic search at scale. The next frontier is multimodal embeddings that encode text, images, and audio in shared space, enabling cross-modal search and generation.

From One-Hot to Contextual: Embeddings Evolution Explained