LLMs are next-token predictors.
Given the sequence 'The capital of France is', the model outputs a probability distribution over all tokens in its vocabulary. 'Paris' gets high probability. That token is sampled and appended. Repeat.
**Temperature** controls randomness:
- Low (0.1) → deterministic, repetitive
- High (1.5) → creative but incoherent
**Context window** is how many tokens the model 'sees' at once. GPT-4 can see ~128k tokens — about 100,000 words.
This simple mechanism, scaled to hundreds of billions of parameters, produces remarkably coherent and useful text.