1 of 20

Mechanistic Interpretability: Compiling Transformer Models into Code

Chenghao Yang

University of Chicago

2 of 20

Agenda

Background

Transformer-based LLMs
Mechanistic Interpretability

RASP (ICML’21, Code -> Model)
Transformer Program (NeurIPS’23, Model -> Code)

3 of 20

Background: Transformer-based LLMs

Vaswani, Ashish, et al. "Attention is all you need." NeurIPS 2017.

Elhage, Nelson, et al. "A mathematical framework for transformer circuits." Transformer Circuits Thread 1 (2021): 1.

Encoder

Decoder

Two key components: Attention Blocks and MLP
Architecture:

Decoder-only: LLaMA-1/2, GPT-1/2/3/4,…
Encoder-only: BERT, DeBERTa,…
Encoder-Decoder: T5, BART,…

Multiple Components in one Single Model.

4 of 20

Background: Mechanistic Interpretability

Mechanistic Interpretability (MI) seeks to demystify the internal computations of neural networks, akin to reverse-engineering a compiled binary in computer programming.

Elhage, Nelson, et al. "A mathematical framework for transformer circuits." Transformer Circuits Thread 1 (2021): 1.

Meng, Kevin, et al. "Locating and editing factual associations in GPT." NeurIPS 2022.

Transformer “Circuit”

Causal Intervention

Challenges

Still need a lot of manual efforts to “interpret”.
Usually limited to specific domains, lack a holistic understanding of machinery.

5 of 20

A Promising New Direction: Code/Algorithmic Equivalence for Transformers

https://en.wikipedia.org/wiki/Recurrent_neural_network#/media/File:Recurrent_neural_network_unfold.svg

https://en.wikipedia.org/wiki/Finite-state_machine#/media/File:Turnstile_state_machine_colored.svg

Recurrent Neural Network (RNN)

Finite-State Automata (FSA)

Transformer

RASP: Restricted Access Sequence Processing Language

6 of 20

RASP: Basics

Input Operators (“built-in s-ops”): Tokens, Indices, Length

Select: input two lists (keys and queries), compute a selection matrix describing for each key, query pair (k, q) whether p(k,q) holds.

Attention Map

7 of 20

RASP: Basics (Cont’d)

Aggregate: takes a selection matrix and “values” list, averaging for each row of the matrix the values of the list in its selected columns.

Different Selectors * Different Aggregators Combinations?

Reused Selectors?

8 of 20

Additional Operators

Elementwise Operations: Reflect Feed-Forward Sublayers. (MLPs can approximate any borel-measurable functions, provided sufficiently large input and hidden dimensions)

9 of 20

Put it Together!

Reverse the String

Histogram (with BOS token)

Powerful Selector_Width()

(“Count the selected elements for each row”)

Sorting

Dyck-k Language Recognition

10 of 20

RASP: Upper Bound the Architecture Design

1. RASP-forecasted numbers of layers and heads are sufficient to solve the given task!

2. Also, as all the task evaluated may exist multiple solutions, it is not always the case that the attention matches the predicted attention. For example, the machine may learn to do bucket sorting.

The predicted numbers of layers and heads are pretty TIGHT upper bound – reducing a bit would lead to significant performance drop.

11 of 20

RASP: Powerful Tools to Analyze Empirical Observations

Tay, Yi, et al. "Efficient transformers: A survey." ACM Computing Surveys 55.6 (2022): 1-28.

Press, Ofir, Noah A. Smith, and Omer Levy. "Improving Transformer Models by Reordering their Sublayers." ACL 2020.

12 of 20

Interim Discussions

So far we have seen how we can think transformer as codes, and we can write codes that can be compiled to transformer models.
But this does not conform with how we usually develop the model: we never write the program first. Instead, we prepare the data and train the model.
Can we “decompile” a trained model to human-readable programs?

13 of 20

Learning Transformer Program (NeurIPS’23)

Materials for slides from now on are heavily based on Dan Friedman’s NeurIPS 2023 Presentation: https://nips.cc/virtual/2023/oral/73853

Friedman, Dan, Alexander Wettig, and Danqi Chen. "Learning transformer programs." NeurIPS 2023.

Prepare the data, and define a constrained Transformer model to assure there exists a mapping to a discrete, rule-based program.
Train a continuous relaxation.
Discretize the weights.
Decompile the discrete model into a Python program.

14 of 20

Constraint-1: Disentangled Residual Stream

Vanilla Transformer (Interpreted as “Circuit”):

Different Attention Heads can “write” to the same subspace of the residual stream, making it difficult to disentangle information flows from one component to another.

Disentangled Residual Stream:

1. The token embeddings encode a fixed set of discrete variables in orthogonal subspaces.

2. Each module reads a fixed set of variables and writes a new variable to a dedicated address.

Discrete Feature Extractor

Continuous

Feature Extractor

15 of 20

Constraint-2: Interpretable Sublayers

16 of 20

Optimization: Continuous Relaxation

Jang, Eric, Shixiang Gu, and Ben Poole. "Categorical reparameterization with gumbel-softmax." arXiv preprint arXiv:1611.01144 (2016).

17 of 20

Evaluation of Transformer Program

Simple In-Context Learning

RASP Task

The learned program achieves perfect accuracy on held-out test set.

NLP Task:

Named Entity Recognition
Text Classification (topic/sentiment)

18 of 20

Evaluation Results

19 of 20

Generate Programs

However, NOT ALL codes are easy to understand.

20 of 20

Final Comments

It is exciting and inspiring to see we can find code/algorithmic equivalence to Transformer models. In this way, our interpretation can go beyond fixed datasets and move towards real “machinery understanding” at scale.
Most models that can be compiled are still discrete Transformer models, which we did not use in our real tasks.
None of these methods can address very large models yet. Also, it is still not clear how pretraining and instruction tuning will work – are they working by injecting more functions into parameters?