A growing field of researchers is reverse-engineering neural networks to understand what features and circuits they use. Early results are surprising — and concerning.

Mechanistic interpretability (mech interp) tries to reverse-engineer what a neural network is actually computing.

**Anthropic's findings:** Neural network features aren't neurons — a single neuron activates for multiple unrelated concepts (polysemanticity). Features are better described as directions in high-dimensional space (superposition hypothesis).

**Circuits work:** Identified specific algorithmic circuits in transformers — induction heads (responsible for in-context learning), attention patterns that detect syntactic structure.

**The goal:** If we understand *what* a model 'knows' and *how* it uses that knowledge, we can:

- Detect deceptive alignment

- Remove dangerous capabilities surgically

- Verify safety properties formally

**Current state:** Early days. Works well for small models; scaling to frontier models is the key challenge.

Mechanistic Interpretability: Looking Inside the Black Box