PROGRESS MEASURES FOR GROKKING VIA
MECHANISTIC INTERPRETABILITY
Yorguin-José Mantilla-Ramos and Yousef Kotp
The overall problem…
Abrupt Emergent Behaviours
[6]
order parameter: quantifies the degree of organization of the system
Methodological Argument of the paper
Progress Measures: ��“metrics that precede and are causally linked to the phase transition, and which vary more smoothly”
The concept seems to have originated from this paper, and to be more empirically motivated than the usual dynamical systems’ concepts from physics.
[2]
By finding “progress measures” through mechanistic interpretability we can gain insights on discontinuous emergent behavior.
Methodological Argument
By finding “progress measures” through mechanistic interpretability we can gain insights on discontinuous emergent behavior.
A case study: Grokking in the Modular Addition Task
How to defend
this claim?
Illustrate that on an exemplar of emergent behavior (e.g. grokking), this methodology arrives at meaningful insights.
Grokking serves as a case study of the overall methodological argument of the paper.
The specific problem
How does grokking occur in the modular addition task learned by transformers?
What is Grokking?
The phenomenon where:
“models abruptly transition to a generalizing solution after a large number of training steps, despite initially overfitting” [1]
“... sudden generalization after delayed memorization” [3]
TLDR: generalization after overfitting
Follow-up: How can we attempt to understand grokking?
What motivates the model to do this if it already had the presumably best training accuracy?
→ regularization in the cost function?
�(this as a case study of understanding “emergent phenomena” in ML)
[1]
?
What is modular addition?
How do we know that 17:00h is 5pm
(17) mod 12 = 5
So basically,
A mod B = rem. of A/B
Before that: What is Modular Arithmetic?
⇒ Going around the clock
What is then ( A + B ) mod C ? → Modular Addition
It is possible for humans to find the relation:
How do transformers do it?
Do they implement it in a similar way?
Do they grokk to find this?
Core idea:� �we can compose �modular addition…
17 mod 12 =
( 15 + 2 ) mod 12
15 mod 12 = 3
2 mod 12 = 2
( 3 + 2 ) mod 12 =
5 mod 12 =
5
* this is obviously not the best way to decompose this addition (e.g. 12 + 5 is better)...
What is the model used in this paper?
Usual schematic of a transformer
[4]
Too complex to study
Let’s simplify it
[5]
The simplified transformer circuit
[5]
Residual Stream
sum of the output of
all the previous layers
and the original embedding
includes token and positional embedding (t is the vector of tokens)
Each head extracts relevant information for a token based on the context from other tokens via self-attention
(may be repeated)
refines the embedding space of the model without interaction between tokens
maps the embedding space to a probability distribution over the vocabulary
The simplified transformer circuit
[5]
Residual Stream
sum of the output of
all the previous layers
and the original embedding
includes token and positional embedding (t is the vector of tokens)
Each head extracts relevant information for a token based on the context from other tokens via self-attention
(may be repeated)
refines the embedding space of the model without interaction between tokens
maps the embedding space to a probability distribution over the vocabulary
The simplified transformer circuit
[5]
Residual Stream
sum of the output of
all the previous layers
and the original embedding
includes token and positional embedding (t is the vector of tokens)
Each head extracts relevant information for a token based on the context from other tokens via self-attention
(may be repeated)
refines the embedding space of the model without interaction between tokens
maps the embedding space to a probability distribution over the vocabulary
The simplified transformer circuit
[5]
Residual Stream
sum of the output of
all the previous layers
and the original embedding
includes token and positional embedding (t is the vector of tokens)
Each head extracts relevant information for a token based on the context from other tokens via self-attention
(may be repeated)
refines the embedding space of the model without interaction between tokens
maps the embedding space to a probability distribution over the vocabulary
The simplified transformer circuit
[5]
Residual Stream
sum of the output of
all the previous layers
and the original embedding
includes token and positional embedding (t is the vector of tokens)
Each head extracts relevant information for a token based on the context from other tokens via self-attention
(may be repeated)
refines the embedding space of the model without interaction between tokens
maps the embedding space to a probability distribution over the vocabulary.
In the paper is also referred as WL
The simplified transformer circuit
[5]
Residual Stream
sum of the output of
all the previous layers
and the original embedding
includes token and positional embedding (t is the vector of tokens)
Each head extracts relevant information for a token based on the context from other tokens via self-attention
(may be repeated)
refines the embedding space of the model without interaction between tokens
maps the embedding space to a probability distribution over the vocabulary.
In the paper is also referred as WL
The Specific Model and Task
[5,7,113]
[a,b, P ]
“(a+b) mod P =”
Embedding of dim. 128
(5 and 7 are used as indexes of a token-to-embedding vector matrix)
P is held constant,
It actually encodes “=”
4 heads
of dim. 32
single layer, 512 units
Vocabulary of size P
We get the logits of the final token (always the “=”) to get the predicted answer
Task
get (a+b) mod P,
where P is prime
e.g. P=113 and constant.
Data Split :
30% train
70% test
From all known possible pairs up to P-1.
Hypothesis of how the algorithm works
Based on the insights gained through mechinterp
Hypothesis algorithm: Addition through Fourier Multiplication
Hypothesis algorithm: Addition through Fourier Multiplication
Here we dont have the answer, just its cosine…
Hypothesis algorithm: Addition through Fourier Multiplication
Here we dont have the answer, just its cosine…
We can get the “hour” by subtracting all possible hours and seeing in which place (e.g. c=number in the vocab=hour) we land at 0 ⇒ There cos(w(a+b-c)) is 1!
Note: c in yellow here is not the answer, just an arbitrary one.
Correct c �in green
-c
Problem: The cosine is periodic, lands at 1 in many places…
Solution found by the transformer → Use wave interference.
If you test cosines of different frequencies (wk) they all have in common the correct token c as a maximum (1) of the function.
= (a + b - c)
Constructive interference at the answer → big logit there
Ok, so how did they reverse engineer it to arrive at this hypothesis?
Lines of evidence for the Fourier Algorithm
How to provide evidence to the hypothesis algorithm
Surprising Periodicity: Embedding Layer
Surprising Periodicity: Neuron-Logits map
Surprising Periodicity
Surprising Periodicity: Attention Heads and MLP
Surprising Periodicity: Logits
Trigonometric Identity Learning
u and v are learned weight vectors*
Trigonometric Identity Learning
Logits Approximation by Weighted Sum
Ablation: Effect of Ablating Frequencies for Logits
Ablation: Direction of WL
Understanding Grokking using Progress Measures
Grokking Results
1- Restricted loss
2- Excluded loss
“The idea is that the memorizing solution should be spread out in the Fourier domain, so that ablating a few directions will leave it mostly unaffected, while the generalizing solution will be hurt significantly”
3- Gini Coefficient of L2 Norm of WE and WL
4- L2 Norm of Weights
Phases of Grokking
M CF C SS
Steady State
Progress Measures of Grokking: Memorization
Progress Measures of Grokking: Circuit Formation
This suggests that the model’s
behavior on the train set transitions smoothly from the memorizing solution to the Fourier multiplication algorithm
Progress Measure of Grokking: Cleanup
L2 norm Analysis
Data fraction Analysis
Outro
Conclusion
Take-home messages
Original Grokking Paper
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin @ OpenAI
Vedant Misra @ Google
Will be presented by: Yuxing Tian Zibo Shang
4/21/2025
References
[0] N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt, “Progress measures for grokking via mechanistic interpretability.” arXiv, 2023. doi: 10.48550/ARXIV.2301.05217. Available: https://arxiv.org/abs/2301.05217��[1] A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra, “Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.” arXiv, 2022. doi: 10.48550/ARXIV.2201.02177. Available: https://arxiv.org/abs/2201.02177��[2] B. Barak, B. L. Edelman, S. Goel, S. Kakade, E. Malach, and C. Zhang, “Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit.” arXiv, 2022. doi: 10.48550/ARXIV.2207.08799. Available: https://arxiv.org/abs/2207.08799��[3] K. Clauw, S. Stramaglia, and D. Marinazzo, “Information-Theoretic Progress Measures reveal Grokking is an Emergent Phase Transition.” arXiv, 2024. doi: 10.48550/ARXIV.2408.08944. Available: https://arxiv.org/abs/2408.08944��[4] A. Vaswani et al., “Attention Is All You Need.” arXiv, 2017. doi: 10.48550/ARXIV.1706.03762. Available: https://arxiv.org/abs/1706.03762��[5] Nelson Elhage et al., “A mathematical framework for transformer circuits.” Transformer Circuits Thread, 2021. Available: https://transformer-circuits.pub/2021/framework/index.html��[6] E. F. W. Heffern, H. Huelskamp, S. Bahar, and R. F. Inglis, “Phase transitions in biology: from bird flocks to population dynamics,” Proceedings of the Royal Society B: Biological Sciences, vol. 288, no. 1961. The Royal Society, Oct. 20, 2021. doi: 10.1098/rspb.2021.1111. Available: http://dx.doi.org/10.1098/rspb.2021.1111�
�
Q & A