Every response you get from ChatGPT is generated one token at a time. A token is roughly a word (or part of one). The model looks at everything written so far and asks: 'What word is most likely to come next?' Then it picks one, adds it, and repeats — until the answer is complete.
This process is called autoregressive generation. The model was trained on a massive chunk of the internet — books, code, articles, forums — and learned the statistical patterns of language. It never memorized facts like a database; it learned *relationships between concepts*.
Behind this is a Transformer architecture. The key innovation is 'attention' — the model can look at every word in the input simultaneously and weigh which ones matter most for predicting the next token. This is what makes it context-aware.
When you ask 'What is the capital of France?' — the model doesn't look it up. It saw 'capital of France' → 'Paris' billions of times during training, and learned that pattern deeply enough to reproduce it reliably.
The training process has two stages: (1) Pre-training on raw text to learn language patterns, and (2) RLHF (Reinforcement Learning from Human Feedback) to make responses helpful and safe. Human raters score outputs, and the model learns to produce higher-rated responses.
**Key takeaway:** ChatGPT predicts the next word based on patterns learned from billions of texts — it doesn't 'know' things the way humans do.