List of general criticisms of interpretability for alignment
Updated automatically every 5 minutes
It’s too dual-use: once you know where the useful modules are you can dump the rest and distill training / save 99% on inference cost. Induction heads were actually used for capabilities (see).
Intelligence is way too messy and heuristic, it'll never be neatly interpretable.
We want to predict capabilities and dynamics, but interpretability is mostly done ex-post the discovery of a phenomenon, but not ex-ante. (see)
Almost all of this work is post-hoc interpretation of already trained models, what about Intrinsic interpretability techniques (training models to be easier to study in the first place)?
Any interpretability-based modifications to a model apply optimization pressure for anti-interpretable scheming and steganography.
Deception detection is too hard via this route (see). (Deep) deception is a property of the AI-world interaction, not the AI alone (see).
Circuits/features are context-dependent, even if you find a way to describe them well within your train/test distributions that doesn't mean your interpretation will generalize infinitely.
Current methods require that you already know and understand the algorithm you’re looking for inside the model weights. This obviously doesn’t scale to superintelligence.