1 of 27

Mechanistic Interpretability of Vision Language Action Models

Bear Häon, Amil Dravid, Shoumik Roychowdhury

2 of 27

Neural networks are everywhere…but can we trust them?

3 of 27

What is Mechanistic Interpretability?

Input x

“Giraffe”

Output y

Backprop

4 of 27

What is Mechanistic Interpretability?

Input x

“Giraffe”

Output y

Still a Black Box

5 of 27

What is Mechanistic Interpretability?

Input x

“Giraffe”

Output y

Goal: Identify internal “mechanisms” of the network

6 of 27

Transformer Architecture (simplified)

really

like

going

to

Tokens

I

7 of 27

really

like

going

to

Tokens

Feed-Forward + Nonlinearity

I

8 of 27

really

like

going

to

Tokens

Feed-Forward + Nonlinearity

I

9 of 27

really

like

going

to

Tokens

Feed-Forward + Nonlinearity

I

10 of 27

really

like

going

to

Tokens

Feed-Forward + Nonlinearity

I

11 of 27

really

like

going

to

Tokens

Feed-Forward + Nonlinearity

I

12 of 27

really

like

going

to

Tokens

Feed-Forward + Nonlinearity

Project

I

13 of 27

I

really

like

going

to

Feed-Forward + Nonlinearity

I

really

like

going

to

14 of 27

Attention Operation

really

like

going

to

really

like

going

to

I

I

15 of 27

Attention Operation

really

like

going

to

really

like

going

to

I

I

0.3

16 of 27

Attention Operation

0.1

really

like

going

to

really

like

going

to

I

I

0.3

17 of 27

Attention Operation

really

like

going

to

really

like

going

to

I

I

0.3

0.1

0.2

0.3

0.1

18 of 27

Attention Operation

really

like

going

to

really

like

going

to

I

I

0.3

0.1

0.2

0.3

0.1

0.3

0.2

0.3

0.2

0.0

0.3

0.1

0.1

0.4

0.1

0.4

0.0

0.3

0.1

0.3

0.2

0.0

0.3

0.5

0.0

19 of 27

Repeat for Each Subspace = An “Attention Head”

0.4

0.0

0.3

0.1

0.3

0.3

0.2

0.3

0.2

0.0

0.3

0.1

0.1

0.4

0.1

0.3

0.1

0.2

0.3

0.1

really

like

going

to

really

like

going

to

I

I

0.4

0.0

0.3

0.1

0.3

20 of 27

Attention

Attention

really

like

going

to

Tokens

Feed-Forward + Nonlinearity

I

Attention

21 of 27

Attention

Attention

really

like

going

to

I

Attention

22 of 27

Attention

Attention

really

like

going

to

I

Attention

Repeat Feed Forward and Attention Operations

23 of 27

Either Classification or Next Token Prediction

really

like

going

to

I

Feed Forward

Class: “Happy”

24 of 27

Either Classification or Next Token Prediction

really

like

going

to

I

Feed Forward

EE229

25 of 27

What about Images?

26 of 27

Patchify!

27 of 27

Similar Transformer Idea