2 of 45

Mechanistic Interpretability Goals

“We would like to see research building towards the ability to "reverse engineer" trained neural networks into human-understandable algorithms, enabling auditors to catch unanticipated safety problems in these models.” - Chris Olah

Methods should…

Map neural network parameters to human understandable algorithms.
Rigorous understanding of narrow aspects models > less rigorous understanding of entire models.
Discover unknown and unanticipated algorithms. Not based on a priori assumptions.
Apply to popular models.
Have a path to giving a full mechanistic understanding of models.
Reasonably scale to completely understanding with enough human effort (not scaling exponentially w.r.t. model size).

3 of 45

My Goals

Adapting/creating mechanistic interpretability tools for ViTs.
Doing apples-to-apples comparisons of CNNs and ViTs.

Comparing features, advantages of models, and more

Trying to improve naive assumptions and intuitions in explainability work!

4 of 45

Prior Research: Vision Transformer (ViT)

Split input into 16x16 patches (+ a class patch) and feed them into a regular transformer

5 of 45

Prior Research: Data Efficient Image Transformer (DeiT)

Same architecture as ViTs, except training procedure + distillation token.

Distillation token allows for a teacher model to train the DeiT and get better performance, but not used my model (so for me all that changes is the training)!

6 of 45

Prior Research: MLP-Mixer

Basically a ViT, except replaces attention with a patch mixing MLP.

7 of 45

Prior Research

Circuits Thread:

Reverse engineered InceptionNetV1

Three Speculative Claims about Neural Networks:

Features: Features are the fundamental unit of neural networks. They correspond to linear combinations of neurons in a layer. These features can be rigorously studied and understood.
Circuits: Features are connected by weights, forming circuits. These circuits can also be rigorously studied and understood.
Universality: Analogous features and circuits form across models and tasks.

8 of 45

Prior Research

Transformer Interpretability:

Aims to apply things from the circuits thread to transformer language models.

Discovered induction heads, an algorithm implemented by some attention heads that enable incontext learning.

Built up lots of theory about how to understand attention heads and transformers.

Residual Stream Terminology!!

9 of 45

Prior Research

ViTs vs CNNs - macro level

ViTs have more uniform representations, with greater similarity between lower and higher layers

ViT incorporates more global information than ResNet at lower layers, leading to quantitatively different features.

10 of 45

Prior Research: Adversarial training→b etter interpretability

Adversarial training improves the interpretability of local feature visualizations!

Intuition → adversarial training adds more relevant priors into the model that help it learn more human interpretable features.

11 of 45

Prior Research: Apples-to-apples comparisons

Shows that doing fair, apples-to-apples comparisons (model size, training procedure, evaluation process) of CNNs and DeiTs, yields better insights, namely models are more similar than previously believed (although DeiTs have better performance OOD)

Possible problems:

These models are a lot smaller than the SOTA models (20-25M params vs ~400 M params)

Models to compare (all have ~76% accuracy on ImageNet 1k):

ResNet50 with GeLU
DeiT-S/16
MLP-Mixer-“M”/16

I came up with medium, it has about 22.5M params, but hasn’t been trained yet

12 of 45

Current Progress on DeiT-S/16

13 of 45

Patch Embeddings

14 of 45

Patch Embeddings

I analyzed and clustered all 384 patch embeddings of DeiT-S/16

Categories:

Regular (311)
Regular more black (36)
Almost a color (19)
Solid color (14)
Very dark (4)

(Very dark might not matter because there is a layernorm right after)

Regular

Regular more black

Almost a color

Solid color

Very dark

15 of 45

InceptionNetV1 (CNN) Patch Embeddings

Mostly interpretable, with lots of color contrast or gabor filters!

16 of 45

Privileged/Non-Privileged Basis

There is sometimes an incentive for a feature to be aligned with basis dimensions. Representing a feature in a different way would introduce a lot of noise into the signal.

A neuron → an activation with a privileged basis

(Note: this has not yet been rigorously defined, the best explanations are this YouTube video and the SoLU paper)

18 of 45

LayerNorm

Really annoying to deal with because mu and sigma aren’t constant, but without it model performance is terrible!

Normalizes patches (rows), weight and bias column wise.

Redwood Research has been trying modeling LN with a linear transformation or smth for interpretations → I will try too!

19 of 45

Multi Head Self Attention

20 of 45

Understanding Attention Heads (QK Side)

QK side → determines attention pattern

VO side → determines outputs given some attention pattern

Pre-softmax vector should have a privileged basis, so I’ve tried doing activation maximization of it!

Asks “what input image causes the ith attention head to favor attending to the jth key patch given the kth query patch?”

For the corresponding matrix on the right to the ith attention head, maximizing the entry at the kth row and jth column (but without the softmax and HW_vo)

21 of 45

Understanding Attention Heads (QK Side)

Activation maximization of pre-softmax vector → weird results

Hard to tell where the query side and where the key side are coming from.

There is hope though!!

L 2, Head 0

L 3, Head 2

L 4, Head 0

22 of 45

General Trends

Patches don’t have to correspond to actual image patches, but mostly do in early layers (bc attention patterns are mostly local in early layers).

In early layers, either a local patch or the class patch are attended to

Class patch seems to get high weight by default (so high that some attention heads only attend to the class patch for most inputs), and if a local feature is found the head might attend locally.
Even optimizing for a key in a different patch than the query often doesn’t exceed the class patch or a local patch (points to the fact that just looking at visualizations locally in early layers (<3/12) might make sense)

23 of 45

Understanding Attention Heads

Looking at places where the query and key patch match up might reveal more interesting information though!

Layer 2, Head 0 →

Consistent feature when equal (or close) and when not equal?

Q patch = K patch

Q patch != K patch

Q = 20

K = 20

Q = 30

K = 30

Q = 10

K = 10

Q = 10

K = 30

Q = 10

K = 20

Q = 20

K = 10

Q = 20

K = 30

Q = 30

K = 10

Q = 30

K = 20

24 of 45

Understanding Attention Heads

Consistent feature when equal (or close) and when not equal applies to other heads in layer 2!

Layer 2, Head 2 →

25 of 45

Understanding Attention Heads

When Q=K, cross patch optim sometimes reveals the same pattern! (we expect more global similarity early in the model)

Layer 1, Head 2 →

Cross Patch Optim (when Q=K)

Q patch = K patch

26 of 45

Understanding Attention Heads

But we can’t always trust cross patch optim!

Layer 1, Head 1 →

Q patch = K patch

Cross Patch Optim (when Q=K)

27 of 45

Demo: Understanding layer 1 head 2 with dataset examples

http://localhost:8080/index.html

28 of 45

Multi Layer Perceptrons

29 of 45

Current Progress: MLP Feature Visualizations

Feature visualizations are created by activation maximization, optimizing the input image using gradient descent to highly activate one specific part of a model.

Naive optimization gives poor results.

Optimization preconditioned in a fourier basis

Hyperparams are slightly different for ViTs than CNNs:

Less transformations and more L2 regularization for earlier layers
Less transformations in general

30 of 45

Current Progress: MLPs

I’ve generated several thousand neuron optimized images of the first 5 MLP layers and built a frontend UI to visualize them all in.

There is far too much to actually analyze the entire network, but so far I have tackled the first 30 neurons from the first 23 patches of the 0th layer and the first 200 patches of layers 0, 1, and 2.

It only makes sense to �visualize features with a �privileged basis (such as �the fc outputs in the MLP).

31 of 45

MLP: Patch 0, Layer 0

The 0th patch corresponds to the class patch.

General textures
Weird Neon Patterns
Often corners are white or black
Diversity gives similar patterns across �neurons

32 of 45

Examples from InceptionNetV1

33 of 45

Examples from InceptionNetV1

34 of 45

MLP: Patch 1, Layer 0

The 1th patch corresponds to the top left corner.

Small “ball” corresponding to patch and rest has general textures
Ball can be solid color
Ball can be texture
Diversity leads to vast changes in the ball
High low frequency in ball?

Solid Color

Beehive/Hatch Texture?

Bright or dark with color?

35 of 45

MLP: Layer 0

In the 0th layer, neuron activations sometimes stay consistent across patches, sometimes not.

Backgrounds stay more consistent than “balls”

Corners sometimes different, namely dark or bright

Neuron 10

Neuron 1

Neuron 25

Patch 0 ^

Patch 1 ^

36 of 45

MLP: Layer 0, Cross Patch Optim

We actually care about neuron optim, but cross patch optim often (not always!) gives us a good summary!

Neuron 1

Patch Optim

Neuron Optim

Neuron 10

Neuron 25

37 of 45

MLP: Neuron 0, Layer 0

Analyzing a specific neuron across patches, simply slides across the entire image

Diversity for Patch 1 (I think this is just bad)

Optimization across all patches

Patches 0-7

Patches 19 and 23

38 of 45

MLP: Neuron 0, Layer 0

Naive tests seem to show that the neuron cares more about blue than the brownish yellow background pattern. (more rigorous tests still need to be done)

Testing layer 0, patch 7, neuron 0 activations:

Optimization across all patches

0.8999

-2.4609

-1.5889

1.9746

2.4727

39 of 45

MLP: Layer 0, Cross Patch Optim

Strange features in general, diversity often introduces white or black noise.

Lots of textures.

Again, corners often white or black

Diversity doesn’t affect some neurons much, others messes up

Textures/directed textures

Neon stuff?

Blue/beige Color Invariant ^? (no diversity)

No diversity →

40 of 45

MLP: Layer 0, Cross Patch Optim

For these where there is sometimes more variance of the pattern across the image, (pattern in bottom right different from top left), it might show that the attention heads have a bit more global mixing of features

Mostly Solid Color

Weird Lines/contour?

Mostly solid color with diff corners

41 of 45

MLP: Layer 1, Cross Patch Optim

Much more coherent looking textures:

Textures
Dotted Patterns
Solid Colors
Mostly global

Mostly Solid Color

Oriented texture

(lots of these)

Dotted Textures

~Hatch

42 of 45

MLP: Layer 2, Cross Patch Optim

Even more complex features and textures:

Complex textures
Emergence of faces
Clear hatches
More primitive features too like oriented textures
Still mostly spatially consistent, with more interruptions though

Emergence of faces or other pattern breakers

Oriented lines/textures

Other textures

43 of 45

MLP: Layer 3, Cross Patch Optim

Far more complex features:

Clearer faces + many eyes
Diversity seems to work better
Different kind of people?
Maybe basic text?

Dogs??

Scale textures?

Other textures

Neuron 1, 7.5 diversity�Chefs and other people?

Neuron 134, 7.5 diversity�eye + people from behind?

Webb texture

44 of 45

Next Steps

Understanding DeiT Better:

Get interpretability tools working better for DeiT

Improve attention head results (given a query patch, optimize key patch for instance)
Survey MLPs better
Test modeling LayerNorm
Find circuits
Make more quantitative evaluations (running on image net validation set, etc.)

Cross Architecture Comparisons:

Train MLP-Mixer same as other models
Get interpretability tools to work for MLP-Mixers
Use interpretability tools for ResNet
Do quantitative and qualitative comparisons of all 3 models

1 of 45

2 of 45

3 of 45

4 of 45

5 of 45

6 of 45

7 of 45

8 of 45

9 of 45

10 of 45

11 of 45

12 of 45

13 of 45

14 of 45

15 of 45

16 of 45

17 of 45

18 of 45

19 of 45

20 of 45

21 of 45

22 of 45

23 of 45

24 of 45

25 of 45

26 of 45

27 of 45

28 of 45

29 of 45

30 of 45

31 of 45

32 of 45

33 of 45

34 of 45

35 of 45

36 of 45

37 of 45

38 of 45

39 of 45

40 of 45

41 of 45

42 of 45

43 of 45

44 of 45

45 of 45