Further Mechanistic Interpretability of Vision Models
Berkan Ottlik
“We would like to see research building towards the ability to "reverse engineer" trained neural networks into human-understandable algorithms, enabling auditors to catch unanticipated safety problems in these models.” - Chris Olah
Methods should…
My Goals
Trying to improve naive assumptions and intuitions in explainability work!
Prior Research: Vision Transformer (ViT)
Split input into 16x16 patches (+ a class patch) and feed them into a regular transformer
Prior Research: Data Efficient Image Transformer (DeiT)
Same architecture as ViTs, except training procedure + distillation token.
Distillation token allows for a teacher model to train the DeiT and get better performance, but not used my model (so for me all that changes is the training)!
Prior Research: MLP-Mixer
Basically a ViT, except replaces attention with a patch mixing MLP.
Prior Research
Reverse engineered InceptionNetV1
Three Speculative Claims about Neural Networks:
Prior Research
Aims to apply things from the circuits thread to transformer language models.
Discovered induction heads, an algorithm implemented by some attention heads that enable incontext learning.
Built up lots of theory about how to understand attention heads and transformers.
Residual Stream Terminology!!
Prior Research
ViTs have more uniform representations, with greater similarity between lower and higher layers
ViT incorporates more global information than ResNet at lower layers, leading to quantitatively different features.
Prior Research: Adversarial training→better interpretability
Adversarial training improves the interpretability of local feature visualizations!
Intuition → adversarial training adds more relevant priors into the model that help it learn more human interpretable features.
Prior Research: Apples-to-apples comparisons
Shows that doing fair, apples-to-apples comparisons (model size, training procedure, evaluation process) of CNNs and DeiTs, yields better insights, namely models are more similar than previously believed (although DeiTs have better performance OOD)
Possible problems:
Models to compare (all have ~76% accuracy on ImageNet 1k):
Current Progress on DeiT-S/16
Patch Embeddings
Patch Embeddings
I analyzed and clustered all 384 patch embeddings of DeiT-S/16
Categories:
(Very dark might not matter because there is a layernorm right after)
Regular
Regular more black
Almost a color
Solid color
Very dark
InceptionNetV1 (CNN) Patch Embeddings
Mostly interpretable, with lots of color contrast or gabor filters!
Privileged/Non-Privileged Basis
There is sometimes an incentive for a feature to be aligned with basis dimensions. Representing a feature in a different way would introduce a lot of noise into the signal.
A neuron → an activation with a privileged basis
(Note: this has not yet been rigorously defined, the best explanations are this YouTube video and the SoLU paper)
LayerNorm
LayerNorm
Really annoying to deal with because mu and sigma aren’t constant, but without it model performance is terrible!
Normalizes patches (rows), weight and bias column wise.
Redwood Research has been trying modeling LN with a linear transformation or smth for interpretations → I will try too!
Multi Head Self Attention
Understanding Attention Heads (QK Side)
QK side → determines attention pattern
VO side → determines outputs given some attention pattern
Pre-softmax vector should have a privileged basis, so I’ve tried doing activation maximization of it!
Asks “what input image causes the ith attention head to favor attending to the jth key patch given the kth query patch?”
Understanding Attention Heads (QK Side)
Activation maximization of pre-softmax vector → weird results
Hard to tell where the query side and where the key side are coming from.
There is hope though!!
L 2, Head 0
L 3, Head 2
L 4, Head 0
General Trends
Patches don’t have to correspond to actual image patches, but mostly do in early layers (bc attention patterns are mostly local in early layers).
In early layers, either a local patch or the class patch are attended to
Understanding Attention Heads
Looking at places where the query and key patch match up might reveal more interesting information though!
Layer 2, Head 0 →
Consistent feature when equal (or close) and when not equal?
Q patch = K patch
Q patch != K patch
Q = 20
K = 20
Q = 30
K = 30
Q = 10
K = 10
Q = 10
K = 30
Q = 10
K = 20
Q = 20
K = 10
Q = 20
K = 30
Q = 30
K = 10
Q = 30
K = 20
Understanding Attention Heads
Consistent feature when equal (or close) and when not equal applies to other heads in layer 2!
Layer 2, Head 2 →
Understanding Attention Heads
When Q=K, cross patch optim sometimes reveals the same pattern! (we expect more global similarity early in the model)
Layer 1, Head 2 →
Cross Patch Optim (when Q=K)
Q patch = K patch
Understanding Attention Heads
But we can’t always trust cross patch optim!
Layer 1, Head 1 →
Q patch = K patch
Cross Patch Optim (when Q=K)
Demo: Understanding layer 1 head 2 with dataset examples
Multi Layer Perceptrons
Current Progress: MLP Feature Visualizations
Feature visualizations are created by activation maximization, optimizing the input image using gradient descent to highly activate one specific part of a model.
Naive optimization gives poor results.
Optimization preconditioned in a fourier basis
Hyperparams are slightly different for ViTs than CNNs:
Current Progress: MLPs
I’ve generated several thousand neuron optimized images of the first 5 MLP layers and built a frontend UI to visualize them all in.
There is far too much to actually analyze the entire network, but so far I have tackled the first 30 neurons from the first 23 patches of the 0th layer and the first 200 patches of layers 0, 1, and 2.
It only makes sense to �visualize features with a �privileged basis (such as �the fc outputs in the MLP).
MLP: Patch 0, Layer 0
The 0th patch corresponds to the class patch.
Examples from InceptionNetV1
Examples from InceptionNetV1
MLP: Patch 1, Layer 0
The 1th patch corresponds to the top left corner.
Solid Color
Beehive/Hatch Texture?
Bright or dark with color?
MLP: Layer 0
In the 0th layer, neuron activations sometimes stay consistent across patches, sometimes not.
Backgrounds stay more consistent than “balls”
Corners sometimes different, namely dark or bright
Neuron 10
Neuron 1
Neuron 25
Patch 0 ^
Patch 1 ^
MLP: Layer 0, Cross Patch Optim
We actually care about neuron optim, but cross patch optim often (not always!) gives us a good summary!
Neuron 1
Patch Optim
Neuron Optim
Neuron 10
Neuron 25
MLP: Neuron 0, Layer 0
Analyzing a specific neuron across patches, simply slides across the entire image
Diversity for Patch 1 (I think this is just bad)
Optimization across all patches
Patches 0-7
Patches 19 and 23
MLP: Neuron 0, Layer 0
Naive tests seem to show that the neuron cares more about blue than the brownish yellow background pattern. (more rigorous tests still need to be done)
Testing layer 0, patch 7, neuron 0 activations:
Optimization across all patches
0.8999
-2.4609
-1.5889
1.9746
2.4727
MLP: Layer 0, Cross Patch Optim
Strange features in general, diversity often introduces white or black noise.
Lots of textures.
Again, corners often white or black
Diversity doesn’t affect some neurons much, others messes up
Textures/directed textures
Neon stuff?
Blue/beige Color Invariant ^? (no diversity)
No diversity →
MLP: Layer 0, Cross Patch Optim
For these where there is sometimes more variance of the pattern across the image, (pattern in bottom right different from top left), it might show that the attention heads have a bit more global mixing of features
Mostly Solid Color
Weird Lines/contour?
Mostly solid color with diff corners
MLP: Layer 1, Cross Patch Optim
Much more coherent looking textures:
Mostly Solid Color
Oriented texture
(lots of these)
Dotted Textures
~Hatch
MLP: Layer 2, Cross Patch Optim
Even more complex features and textures:
Emergence of faces or other pattern breakers
Oriented lines/textures
Other textures
MLP: Layer 3, Cross Patch Optim
Far more complex features:
Dogs??
Scale textures?
Other textures
Neuron 1, 7.5 diversity�Chefs and other people?
Neuron 134, 7.5 diversity�eye + people from behind?
Webb texture
Next Steps
Understanding DeiT Better:
Cross Architecture Comparisons:
Thank You 🙂