**The Chinchilla insight (2022):** Hoffman et al. showed most models were under-trained relative to their size. The optimal ratio is ~20 tokens of training data per model parameter.
**What scales with power laws:**
- Loss decreases predictably as parameters increase (for fixed compute)
- Same compute → better to have more parameters OR more data (not just one)
- Performance on downstream tasks follows similar trends
**The implication:** You can *predict* how a model will perform before training it, if you know the compute budget. This made AI research more systematic.
**Caveats:** Scaling laws have limits. Beyond a certain point, emergent abilities appear unpredictably. Some tasks don't improve with scale.