1 of 19

Experiments with Mechanistic Interpretability

Shreyans Jain

2 of 19

3 of 19

Source: https://transformer-circuits.pub/2024/july-update/index.html#pivot-tables

4 of 19

Feature Understanding at the dawn of Interp

Source: https://playground.tensorflow.org/

5 of 19

Feature Understanding (atleast for now!!)

Source: https://transformer-circuits.pub/2022/toy_model/index.html#motivation

6 of 19

Polysemanticity and Superposition

  • Polysemanticity: A neuron can activate on multiple seemingly unrelated topics. For ex: it can activate on the topic of basketball and dogs

  • Superposition: A model can represent more than n features in an n dimensional activation space. (n - no of neurons in the hidden layer).

7 of 19

8 of 19

Soo…. how to interpret then ?

9 of 19

Enter!! ..Sparse Autoencoders

Source: https://transformer-circuits.pub/2022/toy_model/index.html#motivation

10 of 19

SAE Setup

11 of 19

Effect of Non uniform Sparsity on Superposition in Toy Models

Setup:

  • Case 1: feature probability varies within the instance (higher importance feature has lower sparsity)
  • Case 2: feature probability varies within the instance (higher importance feature has higher sparsity)
  • Case 3: feature probability varies within the instance (random sparsity across feature importance)
  • Case 4: feature probability varies within the instance (random sparsity across feature importance)
  • Case 5: Constant Feature Importance. feature probability varies within the instance (random sparsity across feature importance)

12 of 19

Effect of Non uniform Sparsity on Superposition in Toy Models

Key Observations:

  • In no scenario, all the 5 features are ever represented. the representation maxes out at 4 features and lot of times its at 3 only

13 of 19

Effect of Non uniform Sparsity on Superposition in Toy Models

Key Observations:

  • When we give highest feature importance feature the least density, every feature is always in superposition with other features and at no point any feature is represented independently irrespective of feature importance

14 of 19

15 of 19

Effect of Non uniform Sparsity on Superposition in Toy Models

Key Observations:

  • Keeping feature importance constant, even then almost every feature is always in superposition with other features and at no point any feature is represented independently

16 of 19

17 of 19

Whats next??

  • ELI5 blog series on Mech Interp
  • Mechanistic Analysis of why CoT works, how adding just “Lets think step by step”, changes the answer in such a drastic way.

18 of 19

Contact:

If anyone would like to collaborate on any Interpretability or RL related project. You can reach out to me on:

Discord: unnamed8802

Mail: jshrey8@gmail.com

Twitter: py_parrot

19 of 19

Gracias..!!!

Questions anyone?