ABCDEFGHIJKLMNOPQRSTUVWXYZAAABAC
1
soKeywords I've searched: mechanistic, mechanism, mechanistic interpretability, interpretability, interpretable, oversight, scaling law, feature, sparse, sae. in total, skimmed ~4000 titles so you dont have toAlice Rigg twitter:@woog09 discord:woogTo search: grokking, manifold, explainability
2
https://openreview.net/group?id=ICLR.cc/2025/Conference&referrer=%5BHomepage%5D(%2F)#tab-active-submissionsNotes: I'll try to rearrange the list. The first ~120 rows are better than the ones after that, I'm just going through a long list of keywords and picking up things that seem interesting to me. having any excitedness score is good. no score ~ wild west but likely not earth shattering✅ means i know who wrote it
3
How excited am I? (1-3)ICLR ratingsmeanstdTitleLinkComments (undoxxed)
4
0.93,3,6,64.51.73Transformers Struggle to Learn to Search Without In-context Explorationhttps://openreview.net/forum?id=9cQB1HwrtwCheck with Jannik checked and doxxed. Interesting but still toy and not generalizable. They should discuss relation to Jannik's work more. Not 1 bc circuit analysis is so 202376.41
5
0.86,8,3,86.252.36On the Role of Attention Heads in Large Language Model Safetyhttps://openreview.net/forum?id=h0Ak8A5yqwReminds me of Nina
6
1.58,5,5,661.41Mechanistic Permutability: Match Features Across Layershttps://openreview.net/forum?id=MDvecs7EvOFeatures across layers
7
26,3,54.671.53Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencodershttps://openreview.net/forum?id=ghH6YYDs15
8
15,3,3,33.51Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMshttps://openreview.net/forum?id=wI5uHZLeCZrepost, isnt this already archival?
9
3,3,53.671.15Task Vectors are Cross-Modalhttps://openreview.net/forum?id=McqeEcMSzy
10
15,5,3,54.51A Causal Study on The Learnability of Formal Languageshttps://openreview.net/forum?id=Oz9FTPINRethey re-derive generalized backprop, did not cite it, independently did the same thing
11
6,10,3,562.94Linear Representations of Political Perspective Emerge in Large Language Modelshttps://openreview.net/forum?id=rwqShzb9libad
12
5,3,5,341.15From Feature Visualization to Visual Circuits: Effect of Model Perturbationhttps://openreview.net/forum?id=YomQ3llPD2
13
15,5,5,550Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencodershttps://openreview.net/forum?id=TIjBKgLyPNmats
14
6,5,5,14.252.22Complexity of Injectivity and Verification of ReLU Neural Networkshttps://openreview.net/forum?id=Vz5HgVwcduprobably bad
15
38,6,8,56.751.5Bilinear MLPs enable weight-based mechanistic interpretabilityhttps://openreview.net/forum?id=gI0kPklUKSMeta-defining. mild activation fn assumptions make previously unanswerable mechint questions answerable. finds an AND gate within an mlp layer operating on sae features in a 6 layer transformer language model, without input statistics
16
not interp5,6,5,65.50.58Animate Your Thoughts: Reconstruction of Dynamic Natural Vision from Human Brain Activityhttps://openreview.net/forum?id=BpfsxFqhGanon-interp
17
not interp5,5,3,341.15Can Transformers Do Enumerative Geometry?https://openreview.net/forum?id=4X9RpKH4Ls
18
15,5,5,65.250.5Approaching Deep Learning through the Spectral Dynamics of Weightshttps://openreview.net/forum?id=PJjHILiQHC
19
5,3,3,33.51Planning in a recurrent neural network that plays Sokobanhttps://openreview.net/forum?id=ORxjH9kTp8repost, bad
20
6,5,5,3,34.41.34A mechanistically interpretable neural network for regulatory genomicshttps://openreview.net/forum?id=eR9C6c76j5sus
21
16,5,3,54.751.26
Interpreting and Steering LLM Representations with Mutual Information-based Explanations on Sparse Autoencoders
https://openreview.net/forum?id=vc1i3a4O99Uses N2G
22
not interp3,3,6,852.45Lines of Thought in Large Language Modelshttps://openreview.net/forum?id=zjAEa4s3sH100% reject
23
5,8,6,8,362.12Identifying and Tuning Safety Neurons in Large Language Modelshttps://openreview.net/forum?id=yR47RmND1mProbably bad
24
6,5,3,3,54.41.34Interpretable Patterns in Random Initialization Unveil Final Representationhttps://openreview.net/forum?id=bWT6OBJ71xglorified lottery tickets
25
8,6,8,6,56.61.34Arithmetic Without Algorithms: Language Models Solve Math with a Bag of Heuristicshttps://openreview.net/forum?id=O9YTt26r2PProbably bad
26
2.93,5,3,33.51Decomposing The Dark Matter of Sparse Autoencodershttps://openreview.net/forum?id=5IZfo98rqrsimple idea and approach to a hard problem. methodology and quantitative numbers are nothing to write home about, but they nailed the novelty and relevance. theres good room for improvement and extensions
27
not interp5,5,5,34.51BrainCodec: Neural fMRI codec for the decoding of cognitive brain stateshttps://openreview.net/forum?id=o6ddWvoyjK
28
not interp5,8,6,5,55.81.3Learning Color Equivariant Representationshttps://openreview.net/forum?id=IXyfbaGlps
29
5,6,34.671.53Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neuronshttps://openreview.net/forum?id=1NkrxqY4jK
30
8,3,5,5,65.41.82
Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models
https://openreview.net/forum?id=uDIiL89ViXrepost, kinda bad
31
23,6,3,54.251.5Teaching LLMs to Decode Activations Into Natural Languagehttps://openreview.net/forum?id=cselR6Jne3Patchscopes-like + inversion-view-like. Updated to 2 for now. Note to self to read in more detail when i get time. Score may change up to +-0.5
32
Desk reject#DIV/0!#DIV/0!Evaluating Synthetic Activations composed of SAE Latents in GPT-2https://openreview.net/forum?id=U0y32WKeOdrepost
33
3,5,1,3,331.41Revisiting the expressiveness of CNNs: a mathematical framework for feature extractionhttps://openreview.net/forum?id=laKmMbx6x4Probably bad
34
5,6,5,65.50.58Understanding the learned look-ahead behavior of chess neural networkshttps://openreview.net/forum?id=Tl8EzmgsEprepost, kinda mid
35
15,5,5,550Rethinking The Reliability of Representation Engineering in Large Language Modelshttps://openreview.net/forum?id=sYJQEgkkaICausal RepE
36
5,5,3,13.51.91Outcome-based Semifactual Explanation For Reinforcement Learninghttps://openreview.net/forum?id=qhfZL46nPV
37
5,5,6,3,34.41.34Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencodershttps://openreview.net/forum?id=Ch8s4FdUXS
38
16,8,8,87.51Retrieval Head Mechanistically Explains Long-Context Factualityhttps://openreview.net/forum?id=EytBpUGB1ZKV-conclusion
39
5,8,6,35.52.08Interpreting and Editing Vision-Language Representations to Mitigate Hallucinationshttps://openreview.net/forum?id=94kQgWXojHSus, maybe
40
0.55,3,5,64.751.26Applying Sparse Autoencoders to Unlearn Knowledge in Language Modelshttps://openreview.net/forum?id=ZtvRqm6oBu
41
13,5,3,33.51Information Structure in Large Language Modelshttps://openreview.net/forum?id=VB8xHF1RdlVector entropy, message femi
42
5,6,8,561.41Towards Universality: Studying Mechanistic Similarity Across Language Model Architectureshttps://openreview.net/forum?id=2J18i8T0oImamba interp
43
1,1,8,12.753.5
Enhancing Integrated Gradients Using Emphasis Factors and Attention for Effective Explainability of Large Language Models
https://openreview.net/forum?id=IRvx66cxipDecent or horrible
44
not interp3,5,5,341.15Audio Prototypical Network for Controllable Music Recommendationhttps://openreview.net/forum?id=pKDmt7pc6h
45
5,6,3,3,54.41.34Words in Motion: Extracting Interpretable Control Vectors for Motion Transformershttps://openreview.net/forum?id=J9eKm7j6KD
46
0.53,5,64.671.53Shapeshifters: Auditory cortical neurons switch from polysemantic to monosemantic under anesthesiahttps://openreview.net/forum?id=i4jHy0ewke
47
5,6,6,351.41Learning positional encodings in transformers depends on initializationhttps://openreview.net/forum?id=fn0mjkZopf
48
not interp3,5,3,3,33.40.89Understanding Gradient Descent through the Training Jacobianhttps://openreview.net/forum?id=kkVTeMvC9D
49
5,8,5,661.41Efficient Dictionary Learning with Switch Sparse Autoencodershttps://openreview.net/forum?id=k2ZVAzVeMPmats
50
1.26,6,351.73Interpreting Attention Layer Outputs with Sparse Autoencodershttps://openreview.net/forum?id=LphpWGimIarepost, cleaned up, IOI
51
5,3,5,64.751.26Controlling Large Language Model Agents with Entropic Activation Steeringhttps://openreview.net/forum?id=YCu7H0kFS3
52
3,1,1,52.51.91pSAE-chiatry: Utilizing Sparse Autoencoders to Uncover Mental-Health-Related Features in Language Modelshttps://openreview.net/forum?id=LQdaXixB0gTerrible
53
2.23,10,8,8,107.82.86Scaling and evaluating sparse autoencodershttps://openreview.net/forum?id=tcsZt9ZNKDrepost
54
15,6,8,86.751.5Interpreting the Second-Order Effects of Neurons in CLIPhttps://openreview.net/forum?id=GPDcvoFGOL
55
5,5,6,861.41How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanationshttps://openreview.net/forum?id=57NfyYxh5fprobably bad
56
18,5,66.331.53Forking Paths in Neural Text Generationhttps://openreview.net/forum?id=8RCmNLeeXx
57
26,5,6,5,55.40.55An X-Ray Is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generationhttps://openreview.net/forum?id=ZLAQ6Pjf9yActually good sae application
58
3,6,8,3,14.22.77Enabling Sparse Autoencoders for Topic Alignment in Large Language Modelshttps://openreview.net/forum?id=uinsufj5TRHas code
59
0.53,3,8,13.752.99Mechanistic Insights: Circuit Transformations Across Input and Fine-Tuning Landscapeshttps://openreview.net/forum?id=JZjW3k4Kyc36 pages, many circuits
60
3,3,3,330KAE: Kolmogorov-Arnold Auto-Encoder for Representation Learninghttps://openreview.net/forum?id=K9xuqsaP0RMax Tegmark Copium
61
Desk reject#DIV/0!#DIV/0!Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Modelshttps://openreview.net/forum?id=u4XyECA6ZdPCFG
62
1.56,6,6,86.51Interpreting Emergent Planning in Model-Free Reinforcement Learninghttps://openreview.net/forum?id=DzGe40glxsCausal sokoban interp
63
1.73,8,6,65.752.06Scaling Sparse Feature Circuits For Studying In-Context Learninghttps://openreview.net/forum?id=Pa1vr1Prwwmats
64
8,3,3,3,54.42.19Which Attention Heads Matter for In-Context Learning?https://openreview.net/forum?id=KadOFOsUpQFunction Vectors
65
6,5,6,8,561.22Selective induction Heads: How Transformers Select Causal Structures in Contexthttps://openreview.net/forum?id=bnJgzAQjWfToy Attn-only, pretty figs
66
5,6,86.331.53Unifying and Verifying Mechanistic Interpretations: A Case Study with Group Operationshttps://openreview.net/forum?id=8xxEBAtD7ymats
67
23,8,5,34.752.36Automatically Interpreting Millions of Features in Large Language Modelshttps://openreview.net/forum?id=5lIXRf8Lnwupgraded repost with new things in it
68
8,5,5,86.51.73Interpretable Language Modeling via Induction-head Ngram Modelshttps://openreview.net/forum?id=Zq8wylMZ8A
69
0.53,5,5,5,64.81.1Locating Information in Large Language Models via Random Matrix Theoryhttps://openreview.net/forum?id=MmWkNmeDNE
70
3,8,3,652.45Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integrationhttps://openreview.net/forum?id=yBhSORdXqqrepost
71
8,3,8,66.252.36A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencodershttps://openreview.net/forum?id=LC2KxRwC3nrepost
72
3,1,11.671.15Sparsity beyond TopK: A Novel Cosine Loss for Sparse Binary Representationshttps://openreview.net/pdf?id=UbLvSPMvMA
73
1.58,6,5,66.251.26Look Before You Leap: Universal Emergent Mechanism for Retrieval in Language Modelshttps://openreview.net/forum?id=eIB1UZFcFgrepost, 1 year old
74
5,6,86.331.53Interpretable Pre-Trained Transformers for Heart Time-Series Datahttps://openreview.net/forum?id=eciCtsqGc8
75
2.68,3,10,87.252.99
Differential learning kinetics govern the transition from memorization to generalization during in-context learning
https://openreview.net/forum?id=INyi7qUdjZamazing, if only we still cared about ICL
76
3,3,3,330Generalization from Starvation: Hints of Universality in LLM Knowledge Graph Learninghttps://openreview.net/forum?id=f7aWmxgSN4
77
3,3,5,33.51Transformers Use Causal World Models in Maze-Solving Taskshttps://openreview.net/forum?id=aE6QjMJ1mNAISC v2: interesting, probably bad
78
not interp6,5,5,55.250.5TopInG: Topologically Interpretable Graph Learning via Persistent Rationale Filtrationhttps://openreview.net/forum?id=ZaSOGF8OjqProbably bad
79
18,6,5,35.52.08Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Controlhttps://openreview.net/forum?id=1Njl73JKjBmats 5, ioi eval
80
3,5,3,3,33.40.89From Logits to Hierarchies: Hierarchical Clustering made Simplehttps://openreview.net/forum?id=PmV9oPAtU9
81
3,5,3,33.51Understanding Reasoning in Chain-of-Thought from the Hopfieldian Viewhttps://openreview.net/forum?id=OclSRDktp3
82
8,8,6,56.751.5What should a neuron aim for? Designing local objective functions based on information theoryhttps://openreview.net/forum?id=CLE09ESvulBrain stuff, sus
83
21,3,8,5,54.42.61Loss in the Crowd: Hidden Breakthroughs in Language Model Traininghttps://openreview.net/forum?id=pK4Z6NZ2DB
84
6,8,6,56.251.26Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?https://openreview.net/forum?id=5IWJBStfU7
85
8,3,8,66.252.36Interpretability of Language Models for Learning Hierarchical Structureshttps://openreview.net/forum?id=J6qrIjTzoMrepost
86
0.58,5,5,86.51.73HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworkshttps://openreview.net/forum?id=6fDjUoEQvmravel / causal abstractions sequence
87
26,8,6,56.251.26The Computational Complexity of Circuit Discovery for Inner Interpretabilityhttps://openreview.net/forum?id=QogcGNXJVwLooks great if only we had this last year
88
3,5,33.671.15Unveiling Language Skills under Circuitshttps://openreview.net/forum?id=VwyKSnMmrr
89
3,3,3,330Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencodershttps://openreview.net/forum?id=NB8qn8iIW9
90
6,5,65.670.58Local vs distributed representations: What is the right basis for interpretability?https://openreview.net/forum?id=fmWVPbRGC4
91
3,3,6,33.751.5Towards Meta-Models for Automated Interpretabilityhttps://openreview.net/forum?id=1zDOkoZAtl
92
13,5,64.671.53Incidental Polysemanticity: A New Obstacle for Mechanistic Interpretabilityhttps://openreview.net/forum?id=OeHSkJ58TG
93
3,6,5,5,34.41.34Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustnesshttps://openreview.net/forum?id=g6Qc3p7JH5
94
5,3,5,64.751.26Towards Unifying Interpretability and Control: Evaluation via Interventionhttps://openreview.net/forum?id=uOrfve3prk
95
1.23,5,54.331.15Improving Neuron-level Interpretability with White-box Language Modelshttps://openreview.net/forum?id=6X7HaOEpZSCRATE for language
96
5,6,34.671.53Sparse Attention Decomposition Applied to Circuit Tracinghttps://openreview.net/forum?id=A2rfALKFBgIOI
97
2.45,6,3,54.751.26Gradient Routing: Masking Gradients to Localize Computation in Neural Networkshttps://openreview.net/forum?id=z1mLNhWFyYnot interp
98
5,5,5,34.51Understanding and Enhancing Context-Augmented Language Models Through Mechanistic Circuits
https://openreview.net/forum?id=sqsGBW8zQx
99
8,6,5,35.52.08Sparse Autoencoders Reveal Temporal Difference Learning in Large Language Modelshttps://openreview.net/forum?id=2tIyA5cri8
100
3,5,3,6,3,54.171.33Causal Abstraction Finds Universal Representation of Race in Large Language Modelshttps://openreview.net/forum?id=jyjfRLnfww