**The problem:** A pre-trained LLM predicts text. It'll complete 'How do I make a bomb?' with equal enthusiasm as 'Explain photosynthesis'. Not ideal.
**RLHF solution (3 steps):**
1. **SFT (Supervised Fine-Tuning)**: Fine-tune on human-written ideal responses
2. **Reward Model**: Train a model to score response quality from human preferences (A vs B)
3. **PPO training**: Use reinforcement learning to optimise the LLM to maximise reward model scores
The result: a model that prefers helpful, safe responses — because those get high reward.
Current frontier: RLAIF (AI feedback), Constitutional AI, DPO (Direct Preference Optimization — simpler than PPO).