Retrieval-Augmented Generation lets LLMs answer questions about documents they were never trained on — by searching a database and injecting relevant context at inference time.

**The problem:** LLMs have a knowledge cutoff and can hallucinate. You can't fit your entire company's docs in the context window.

**RAG architecture:**

1. **Embed** all your documents into a vector database

2. **Query**: user's question → embed → find top-K similar document chunks

3. **Augment**: inject retrieved chunks into the prompt: 'Based on these docs: [...] answer: [question]'

4. **Generate**: LLM answers with grounded context

**Result:** The model answers from real documents, not hallucinated memory. Citations are possible.

**Key components:** vector DB (Pinecone, Weaviate, pgvector), embedding model, chunking strategy, reranker.

RAG: Giving LLMs Long-Term Memory