Mechanistic interpretability (mech interp) tries to reverse-engineer what a neural network is actually computing.
**Anthropic's findings:** Neural network features aren't neurons — a single neuron activates for multiple unrelated concepts (polysemanticity). Features are better described as directions in high-dimensional space (superposition hypothesis).
**Circuits work:** Identified specific algorithmic circuits in transformers — induction heads (responsible for in-context learning), attention patterns that detect syntactic structure.
**The goal:** If we understand *what* a model 'knows' and *how* it uses that knowledge, we can:
- Detect deceptive alignment
- Remove dangerous capabilities surgically
- Verify safety properties formally
**Current state:** Early days. Works well for small models; scaling to frontier models is the key challenge.