Experiments with Mechanistic Interpretability
Shreyans Jain
Source: https://transformer-circuits.pub/2024/july-update/index.html#pivot-tables
Feature Understanding at the dawn of Interp
Source: https://playground.tensorflow.org/
Feature Understanding (atleast for now!!)
Source: https://transformer-circuits.pub/2022/toy_model/index.html#motivation
Polysemanticity and Superposition
Soo…. how to interpret then ?
Enter!! ..Sparse Autoencoders
Source: https://transformer-circuits.pub/2022/toy_model/index.html#motivation
SAE Setup
Effect of Non uniform Sparsity on Superposition in Toy Models
Setup:
Effect of Non uniform Sparsity on Superposition in Toy Models
Key Observations:
Effect of Non uniform Sparsity on Superposition in Toy Models
Key Observations:
Effect of Non uniform Sparsity on Superposition in Toy Models
Key Observations:
Whats next??
Contact:
If anyone would like to collaborate on any Interpretability or RL related project. You can reach out to me on:
Discord: unnamed8802
Mail: jshrey8@gmail.com
Twitter: py_parrot
Gracias..!!!
Questions anyone?