JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 27

Mechanistic Interpretability of Vision Language Action Models

Bear Häon, Amil Dravid, Shoumik Roychowdhury

2 of 27

Neural networks are everywhere…but can we trust them?

3 of 27

What is Mechanistic Interpretability?

Input x

“Giraffe”

Output y

Backprop

4 of 27

What is Mechanistic Interpretability?

Input x

“Giraffe”

Output y

Still a Black Box

5 of 27

What is Mechanistic Interpretability?

Input x

“Giraffe”

Output y

Goal: Identify internal “mechanisms” of the network

6 of 27

Transformer Architecture (simplified)

really

going

Tokens

7 of 27

really

going

Tokens

Feed-Forward + Nonlinearity

8 of 27

really

going

Tokens

Feed-Forward + Nonlinearity

9 of 27

really

going

Tokens

Feed-Forward + Nonlinearity

10 of 27

really

going

Tokens

Feed-Forward + Nonlinearity

11 of 27

really

going

Tokens

Feed-Forward + Nonlinearity

12 of 27

really

going

Tokens

Feed-Forward + Nonlinearity

Project

13 of 27

really

going

Feed-Forward + Nonlinearity

really

going

14 of 27

Attention Operation

really

going

really

going

15 of 27

Attention Operation

really

going

really

going

0.3

16 of 27

Attention Operation

0.1

really

going

really

going

0.3

17 of 27

Attention Operation

really

going

really

going

0.3

0.1

0.2

0.3

0.1

18 of 27

Attention Operation

really

going

really

going

0.3

0.1

0.2

0.3

0.1

0.3

0.2

0.3

0.2

0.0

0.3

0.1

0.4

0.1

0.4

0.0

0.3

0.1

0.3

0.2

0.0

0.3

0.5

0.0

19 of 27

Repeat for Each Subspace = An “Attention Head”

0.4

0.0

0.3

0.1

0.3

0.2

0.3

0.2

0.0

0.3

0.1

0.4

0.1

0.3

0.1

0.2

0.3

0.1

really

going

really

going

0.4

0.0

0.3

0.1

0.3

20 of 27

Attention

really

going

Tokens

Feed-Forward + Nonlinearity

Attention

21 of 27

Attention

really

going

Attention

22 of 27

Attention

really

going

Attention

Repeat Feed Forward and Attention Operations

23 of 27

Either Classification or Next Token Prediction

really

going

Feed Forward

Class: “Happy”

24 of 27

Either Classification or Next Token Prediction

really

going

Feed Forward

EE229

25 of 27

What about Images?

26 of 27

Patchify!

27 of 27

Similar Transformer Idea