Mechanistic Interpretability of Vision Language Action Models
Bear Häon, Amil Dravid, Shoumik Roychowdhury
Neural networks are everywhere…but can we trust them?
What is Mechanistic Interpretability?
Input x
“Giraffe”
Output y
Backprop
What is Mechanistic Interpretability?
Input x
“Giraffe”
Output y
Still a Black Box
What is Mechanistic Interpretability?
Input x
“Giraffe”
Output y
Goal: Identify internal “mechanisms” of the network
Transformer Architecture (simplified)
really
like
going
to
Tokens
I
really
like
going
to
Tokens
Feed-Forward + Nonlinearity
I
really
like
going
to
Tokens
Feed-Forward + Nonlinearity
I
really
like
going
to
Tokens
Feed-Forward + Nonlinearity
I
really
like
going
to
Tokens
Feed-Forward + Nonlinearity
I
really
like
going
to
Tokens
Feed-Forward + Nonlinearity
I
really
like
going
to
Tokens
Feed-Forward + Nonlinearity
Project
I
I
really
like
going
to
Feed-Forward + Nonlinearity
I
really
like
going
to
Attention Operation
really
like
going
to
really
like
going
to
I
I
Attention Operation
really
like
going
to
really
like
going
to
I
I
0.3
Attention Operation
0.1
really
like
going
to
really
like
going
to
I
I
0.3
Attention Operation
really
like
going
to
really
like
going
to
I
I
0.3
0.1
0.2
0.3
0.1
Attention Operation
really
like
going
to
really
like
going
to
I
I
0.3
0.1
0.2
0.3
0.1
0.3
0.2
0.3
0.2
0.0
0.3
0.1
0.1
0.4
0.1
0.4
0.0
0.3
0.1
0.3
0.2
0.0
0.3
0.5
0.0
Repeat for Each Subspace = An “Attention Head”
0.4
0.0
0.3
0.1
0.3
0.3
0.2
0.3
0.2
0.0
0.3
0.1
0.1
0.4
0.1
0.3
0.1
0.2
0.3
0.1
really
like
going
to
really
like
going
to
I
I
0.4
0.0
0.3
0.1
0.3
Attention
Attention
really
like
going
to
Tokens
Feed-Forward + Nonlinearity
I
Attention
Attention
Attention
really
like
going
to
I
Attention
Attention
Attention
really
like
going
to
I
Attention
Repeat Feed Forward and Attention Operations
Either Classification or Next Token Prediction
really
like
going
to
I
Feed Forward
Class: “Happy”
Either Classification or Next Token Prediction
really
like
going
to
I
Feed Forward
EE229
What about Images?
Patchify!
Similar Transformer Idea