1 of 19

Experiments with Mechanistic Interpretability

Shreyans Jain

3 of 19

Source: https://transformer-circuits.pub/2024/july-update/index.html#pivot-tables

4 of 19

Feature Understanding at the dawn of Interp

Source: https://playground.tensorflow.org/

5 of 19

Feature Understanding (atleast for now!!)

Source: https://transformer-circuits.pub/2022/toy_model/index.html#motivation

6 of 19

Polysemanticity and Superposition

Polysemanticity: A neuron can activate on multiple seemingly unrelated topics. For ex: it can activate on the topic of basketball and dogs

Superposition: A model can represent more than n features in an n dimensional activation space. (n - no of neurons in the hidden layer).

8 of 19

Soo…. how to interpret then ?

9 of 19

Enter!! ..Sparse Autoencoders

Source: https://transformer-circuits.pub/2022/toy_model/index.html#motivation

11 of 19

Effect of Non uniform Sparsity on Superposition in Toy Models

Setup:

Case 1: feature probability varies within the instance (higher importance feature has lower sparsity)
Case 2: feature probability varies within the instance (higher importance feature has higher sparsity)
Case 3: feature probability varies within the instance (random sparsity across feature importance)
Case 4: feature probability varies within the instance (random sparsity across feature importance)
Case 5: Constant Feature Importance. feature probability varies within the instance (random sparsity across feature importance)

12 of 19

Effect of Non uniform Sparsity on Superposition in Toy Models

Key Observations:

In no scenario, all the 5 features are ever represented. the representation maxes out at 4 features and lot of times its at 3 only

13 of 19

Effect of Non uniform Sparsity on Superposition in Toy Models

Key Observations:

When we give highest feature importance feature the least density, every feature is always in superposition with other features and at no point any feature is represented independently irrespective of feature importance

15 of 19

Effect of Non uniform Sparsity on Superposition in Toy Models

Key Observations:

Keeping feature importance constant, even then almost every feature is always in superposition with other features and at no point any feature is represented independently

17 of 19

Whats next??

ELI5 blog series on Mech Interp
Mechanistic Analysis of why CoT works, how adding just “Lets think step by step”, changes the answer in such a drastic way.

18 of 19

Contact:

If anyone would like to collaborate on any Interpretability or RL related project. You can reach out to me on:

Discord: unnamed8802

Mail: jshrey8@gmail.com

Twitter: py_parrot

1 of 19

2 of 19

3 of 19

4 of 19

5 of 19

6 of 19

7 of 19

8 of 19

9 of 19

10 of 19

11 of 19

12 of 19

13 of 19

14 of 19

15 of 19

16 of 19

17 of 19

18 of 19

19 of 19