Mechanistic Interpretability: Compiling Transformer Models into Code
Chenghao Yang
University of Chicago
Agenda
Background: Transformer-based LLMs
Vaswani, Ashish, et al. "Attention is all you need." NeurIPS 2017.
Elhage, Nelson, et al. "A mathematical framework for transformer circuits." Transformer Circuits Thread 1 (2021): 1.
Encoder
Decoder
Background: Mechanistic Interpretability
Elhage, Nelson, et al. "A mathematical framework for transformer circuits." Transformer Circuits Thread 1 (2021): 1.
Meng, Kevin, et al. "Locating and editing factual associations in GPT." NeurIPS 2022.
Transformer “Circuit”
Causal Intervention
Challenges
A Promising New Direction: Code/Algorithmic Equivalence for Transformers
Recurrent Neural Network (RNN)
Finite-State Automata (FSA)
Transformer
RASP: Restricted Access Sequence Processing Language
RASP: Basics
Attention Map
RASP: Basics (Cont’d)
Different Selectors * Different Aggregators Combinations?
Reused Selectors?
Additional Operators
Put it Together!
Reverse the String
Histogram (with BOS token)
Powerful Selector_Width()
(“Count the selected elements for each row”)
Sorting
Dyck-k Language Recognition
RASP: Upper Bound the Architecture Design
1. RASP-forecasted numbers of layers and heads are sufficient to solve the given task!
2. Also, as all the task evaluated may exist multiple solutions, it is not always the case that the attention matches the predicted attention. For example, the machine may learn to do bucket sorting.
The predicted numbers of layers and heads are pretty TIGHT upper bound – reducing a bit would lead to significant performance drop.
RASP: Powerful Tools to Analyze Empirical Observations
Tay, Yi, et al. "Efficient transformers: A survey." ACM Computing Surveys 55.6 (2022): 1-28.
Press, Ofir, Noah A. Smith, and Omer Levy. "Improving Transformer Models by Reordering their Sublayers." ACL 2020.
Interim Discussions
Learning Transformer Program (NeurIPS’23)
Materials for slides from now on are heavily based on Dan Friedman’s NeurIPS 2023 Presentation: https://nips.cc/virtual/2023/oral/73853
Friedman, Dan, Alexander Wettig, and Danqi Chen. "Learning transformer programs." NeurIPS 2023.
Constraint-1: Disentangled Residual Stream
Vanilla Transformer (Interpreted as “Circuit”):
Different Attention Heads can “write” to the same subspace of the residual stream, making it difficult to disentangle information flows from one component to another.
Disentangled Residual Stream:
1. The token embeddings encode a fixed set of discrete variables in orthogonal subspaces.
2. Each module reads a fixed set of variables and writes a new variable to a dedicated address.
Discrete Feature Extractor
Continuous
Feature Extractor
Constraint-2: Interpretable Sublayers
Optimization: Continuous Relaxation
Jang, Eric, Shixiang Gu, and Ben Poole. "Categorical reparameterization with gumbel-softmax." arXiv preprint arXiv:1611.01144 (2016).
Evaluation of Transformer Program
Simple In-Context Learning
RASP Task
The learned program achieves perfect accuracy on held-out test set.
NLP Task:
Evaluation Results
Generate Programs
However, NOT ALL codes are easy to understand.
Final Comments