woog interp paper review

	A	B	C	D	E	F	H	I	M
1	so				Keywords I've searched: mechanistic, mechanism, mechanistic interpretability, interpretability, interpretable, oversight, scaling law, feature, sparse, sae. in total, skimmed ~4000 titles so you dont have to	Alice Rigg twitter:@woog09 discord:woog		To search: grokking, manifold, explainability
2	https://openreview.net/group?id=ICLR.cc/2025/Conference&referrer=%5BHomepage%5D(%2F)#tab-active-submissions				Notes: I'll try to rearrange the list. The first ~120 rows are better than the ones after that, I'm just going through a long list of keywords and picking up things that seem interesting to me. having any excitedness score is good. no score ~ wild west but likely not earth shattering			✅ means i know who wrote it
3	How excited am I? (1-3)	ICLR ratings	mean	std	Title	Link		Comments (undoxxed)
4	0.9	3,3,6,6	4.5	1.73	Transformers Struggle to Learn to Search Without In-context Exploration	https://openreview.net/forum?id=9cQB1Hwrtw	✅	Check with Jannik checked and doxxed. Interesting but still toy and not generalizable. They should discuss relation to Jannik's work more. Not 1 bc circuit analysis is so 2023	76.41
5	0.8	6,8,3,8	6.25	2.36	On the Role of Attention Heads in Large Language Model Safety	https://openreview.net/forum?id=h0Ak8A5yqw		Reminds me of Nina
6	1.5	8,5,5,6	6	1.41	Mechanistic Permutability: Match Features Across Layers	https://openreview.net/forum?id=MDvecs7EvO		Features across layers
7	2	6,3,5	4.67	1.53	Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders	https://openreview.net/forum?id=ghH6YYDs15
8	1	5,3,3,3	3.5	1	Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs	https://openreview.net/forum?id=wI5uHZLeCZ	✅	repost, isnt this already archival?
9		3,3,5	3.67	1.15	Task Vectors are Cross-Modal	https://openreview.net/forum?id=McqeEcMSzy
10	1	5,5,3,5	4.5	1	A Causal Study on The Learnability of Formal Languages	https://openreview.net/forum?id=Oz9FTPINRe		they re-derive generalized backprop, did not cite it, independently did the same thing
11		6,10,3,5	6	2.94	Linear Representations of Political Perspective Emerge in Large Language Models	https://openreview.net/forum?id=rwqShzb9li		bad
12		5,3,5,3	4	1.15	From Feature Visualization to Visual Circuits: Effect of Model Perturbation	https://openreview.net/forum?id=YomQ3llPD2
13	1	5,5,5,5	5	0	Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders	https://openreview.net/forum?id=TIjBKgLyPN	✅	mats
14		6,5,5,1	4.25	2.22	Complexity of Injectivity and Verification of ReLU Neural Networks	https://openreview.net/forum?id=Vz5HgVwcdu		probably bad
15	3	8,6,8,5	6.75	1.5	Bilinear MLPs enable weight-based mechanistic interpretability	https://openreview.net/forum?id=gI0kPklUKS	✅	Meta-defining. mild activation fn assumptions make previously unanswerable mechint questions answerable. finds an AND gate within an mlp layer operating on sae features in a 6 layer transformer language model, without input statistics
16	not interp	5,6,5,6	5.5	0.58	Animate Your Thoughts: Reconstruction of Dynamic Natural Vision from Human Brain Activity	https://openreview.net/forum?id=BpfsxFqhGa		non-interp
17	not interp	5,5,3,3	4	1.15	Can Transformers Do Enumerative Geometry?	https://openreview.net/forum?id=4X9RpKH4Ls
18	1	5,5,5,6	5.25	0.5	Approaching Deep Learning through the Spectral Dynamics of Weights	https://openreview.net/forum?id=PJjHILiQHC
19		5,3,3,3	3.5	1	Planning in a recurrent neural network that plays Sokoban	https://openreview.net/forum?id=ORxjH9kTp8	✅	repost, bad
20		6,5,5,3,3	4.4	1.34	A mechanistically interpretable neural network for regulatory genomics	https://openreview.net/forum?id=eR9C6c76j5		sus
21	1	6,5,3,5	4.75	1.26	Interpreting and Steering LLM Representations with Mutual Information-based Explanations on Sparse Autoencoders	https://openreview.net/forum?id=vc1i3a4O99		Uses N2G
22	not interp	3,3,6,8	5	2.45	Lines of Thought in Large Language Models	https://openreview.net/forum?id=zjAEa4s3sH		100% reject
23		5,8,6,8,3	6	2.12	Identifying and Tuning Safety Neurons in Large Language Models	https://openreview.net/forum?id=yR47RmND1m		Probably bad
24		6,5,3,3,5	4.4	1.34	Interpretable Patterns in Random Initialization Unveil Final Representation	https://openreview.net/forum?id=bWT6OBJ71x		glorified lottery tickets
25		8,6,8,6,5	6.6	1.34	Arithmetic Without Algorithms: Language Models Solve Math with a Bag of Heuristics	https://openreview.net/forum?id=O9YTt26r2P		Probably bad
26	2.9	3,5,3,3	3.5	1	Decomposing The Dark Matter of Sparse Autoencoders	https://openreview.net/forum?id=5IZfo98rqr	✅	simple idea and approach to a hard problem. methodology and quantitative numbers are nothing to write home about, but they nailed the novelty and relevance. theres good room for improvement and extensions
27	not interp	5,5,5,3	4.5	1	BrainCodec: Neural fMRI codec for the decoding of cognitive brain states	https://openreview.net/forum?id=o6ddWvoyjK
28	not interp	5,8,6,5,5	5.8	1.3	Learning Color Equivariant Representations	https://openreview.net/forum?id=IXyfbaGlps
29		5,6,3	4.67	1.53	Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons	https://openreview.net/forum?id=1NkrxqY4jK
30		8,3,5,5,6	5.4	1.82	Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models	https://openreview.net/forum?id=uDIiL89ViX		repost, kinda bad
31	2	3,6,3,5	4.25	1.5	Teaching LLMs to Decode Activations Into Natural Language	https://openreview.net/forum?id=cselR6Jne3		Patchscopes-like + inversion-view-like. Updated to 2 for now. Note to self to read in more detail when i get time. Score may change up to +-0.5
32		Desk reject	#DIV/0!	#DIV/0!	Evaluating Synthetic Activations composed of SAE Latents in GPT-2	https://openreview.net/forum?id=U0y32WKeOd	✅	repost
33		3,5,1,3,3	3	1.41	Revisiting the expressiveness of CNNs: a mathematical framework for feature extraction	https://openreview.net/forum?id=laKmMbx6x4		Probably bad
34		5,6,5,6	5.5	0.58	Understanding the learned look-ahead behavior of chess neural networks	https://openreview.net/forum?id=Tl8EzmgsEp	✅	repost, kinda mid
35	1	5,5,5,5	5	0	Rethinking The Reliability of Representation Engineering in Large Language Models	https://openreview.net/forum?id=sYJQEgkkaI		Causal RepE
36		5,5,3,1	3.5	1.91	Outcome-based Semifactual Explanation For Reinforcement Learning	https://openreview.net/forum?id=qhfZL46nPV
37		5,5,6,3,3	4.4	1.34	Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders	https://openreview.net/forum?id=Ch8s4FdUXS
38	1	6,8,8,8	7.5	1	Retrieval Head Mechanistically Explains Long-Context Factuality	https://openreview.net/forum?id=EytBpUGB1Z		KV-conclusion
39		5,8,6,3	5.5	2.08	Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations	https://openreview.net/forum?id=94kQgWXojH		Sus, maybe
40	0.5	5,3,5,6	4.75	1.26	Applying Sparse Autoencoders to Unlearn Knowledge in Language Models	https://openreview.net/forum?id=ZtvRqm6oBu	✅
41	1	3,5,3,3	3.5	1	Information Structure in Large Language Models	https://openreview.net/forum?id=VB8xHF1Rdl		Vector entropy, message femi
42		5,6,8,5	6	1.41	Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures	https://openreview.net/forum?id=2J18i8T0oI	✅	mamba interp
43		1,1,8,1	2.75	3.5	Enhancing Integrated Gradients Using Emphasis Factors and Attention for Effective Explainability of Large Language Models	https://openreview.net/forum?id=IRvx66cxip		Decent or horrible
44	not interp	3,5,5,3	4	1.15	Audio Prototypical Network for Controllable Music Recommendation	https://openreview.net/forum?id=pKDmt7pc6h
45		5,6,3,3,5	4.4	1.34	Words in Motion: Extracting Interpretable Control Vectors for Motion Transformers	https://openreview.net/forum?id=J9eKm7j6KD
46	0.5	3,5,6	4.67	1.53	Shapeshifters: Auditory cortical neurons switch from polysemantic to monosemantic under anesthesia	https://openreview.net/forum?id=i4jHy0ewke
47		5,6,6,3	5	1.41	Learning positional encodings in transformers depends on initialization	https://openreview.net/forum?id=fn0mjkZopf
48	not interp	3,5,3,3,3	3.4	0.89	Understanding Gradient Descent through the Training Jacobian	https://openreview.net/forum?id=kkVTeMvC9D	✅
49		5,8,5,6	6	1.41	Efficient Dictionary Learning with Switch Sparse Autoencoders	https://openreview.net/forum?id=k2ZVAzVeMP	✅	mats
50	1.2	6,6,3	5	1.73	Interpreting Attention Layer Outputs with Sparse Autoencoders	https://openreview.net/forum?id=LphpWGimIa	✅	repost, cleaned up, IOI
51		5,3,5,6	4.75	1.26	Controlling Large Language Model Agents with Entropic Activation Steering	https://openreview.net/forum?id=YCu7H0kFS3
52		3,1,1,5	2.5	1.91	pSAE-chiatry: Utilizing Sparse Autoencoders to Uncover Mental-Health-Related Features in Language Models	https://openreview.net/forum?id=LQdaXixB0g		Terrible
53	2.2	3,10,8,8,10	7.8	2.86	Scaling and evaluating sparse autoencoders	https://openreview.net/forum?id=tcsZt9ZNKD	✅	repost
54	1	5,6,8,8	6.75	1.5	Interpreting the Second-Order Effects of Neurons in CLIP	https://openreview.net/forum?id=GPDcvoFGOL
55		5,5,6,8	6	1.41	How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations	https://openreview.net/forum?id=57NfyYxh5f		probably bad
56	1	8,5,6	6.33	1.53	Forking Paths in Neural Text Generation	https://openreview.net/forum?id=8RCmNLeeXx
57	2	6,5,6,5,5	5.4	0.55	An X-Ray Is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation	https://openreview.net/forum?id=ZLAQ6Pjf9y	✅	Actually good sae application
58		3,6,8,3,1	4.2	2.77	Enabling Sparse Autoencoders for Topic Alignment in Large Language Models	https://openreview.net/forum?id=uinsufj5TR		Has code
59	0.5	3,3,8,1	3.75	2.99	Mechanistic Insights: Circuit Transformations Across Input and Fine-Tuning Landscapes	https://openreview.net/forum?id=JZjW3k4Kyc		36 pages, many circuits
60		3,3,3,3	3	0	KAE: Kolmogorov-Arnold Auto-Encoder for Representation Learning	https://openreview.net/forum?id=K9xuqsaP0R	✅	Max Tegmark Copium
61		Desk reject	#DIV/0!	#DIV/0!	Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models	https://openreview.net/forum?id=u4XyECA6Zd		PCFG
62	1.5	6,6,6,8	6.5	1	Interpreting Emergent Planning in Model-Free Reinforcement Learning	https://openreview.net/forum?id=DzGe40glxs		Causal sokoban interp
63	1.7	3,8,6,6	5.75	2.06	Scaling Sparse Feature Circuits For Studying In-Context Learning	https://openreview.net/forum?id=Pa1vr1Prww	✅	mats
64		8,3,3,3,5	4.4	2.19	Which Attention Heads Matter for In-Context Learning?	https://openreview.net/forum?id=KadOFOsUpQ		Function Vectors
65		6,5,6,8,5	6	1.22	Selective induction Heads: How Transformers Select Causal Structures in Context	https://openreview.net/forum?id=bnJgzAQjWf		Toy Attn-only, pretty figs
66		5,6,8	6.33	1.53	Unifying and Verifying Mechanistic Interpretations: A Case Study with Group Operations	https://openreview.net/forum?id=8xxEBAtD7y	✅	mats
67	2	3,8,5,3	4.75	2.36	Automatically Interpreting Millions of Features in Large Language Models	https://openreview.net/forum?id=5lIXRf8Lnw	✅	upgraded repost with new things in it
68		8,5,5,8	6.5	1.73	Interpretable Language Modeling via Induction-head Ngram Models	https://openreview.net/forum?id=Zq8wylMZ8A
69	0.5	3,5,5,5,6	4.8	1.1	Locating Information in Large Language Models via Random Matrix Theory	https://openreview.net/forum?id=MmWkNmeDNE
70		3,8,3,6	5	2.45	Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration	https://openreview.net/forum?id=yBhSORdXqq	✅	repost
71		8,3,8,6	6.25	2.36	A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders	https://openreview.net/forum?id=LC2KxRwC3n	✅	repost
72		3,1,1	1.67	1.15	Sparsity beyond TopK: A Novel Cosine Loss for Sparse Binary Representations	https://openreview.net/pdf?id=UbLvSPMvMA
73	1.5	8,6,5,6	6.25	1.26	Look Before You Leap: Universal Emergent Mechanism for Retrieval in Language Models	https://openreview.net/forum?id=eIB1UZFcFg	✅	repost, 1 year old
74		5,6,8	6.33	1.53	Interpretable Pre-Trained Transformers for Heart Time-Series Data	https://openreview.net/forum?id=eciCtsqGc8
75	2.6	8,3,10,8	7.25	2.99	Differential learning kinetics govern the transition from memorization to generalization during in-context learning	https://openreview.net/forum?id=INyi7qUdjZ	✅	amazing, if only we still cared about ICL
76		3,3,3,3	3	0	Generalization from Starvation: Hints of Universality in LLM Knowledge Graph Learning	https://openreview.net/forum?id=f7aWmxgSN4
77		3,3,5,3	3.5	1	Transformers Use Causal World Models in Maze-Solving Tasks	https://openreview.net/forum?id=aE6QjMJ1mN	✅	AISC v2: interesting, probably bad
78	not interp	6,5,5,5	5.25	0.5	TopInG: Topologically Interpretable Graph Learning via Persistent Rationale Filtration	https://openreview.net/forum?id=ZaSOGF8Ojq		Probably bad
79	1	8,6,5,3	5.5	2.08	Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control	https://openreview.net/forum?id=1Njl73JKjB	✅	mats 5, ioi eval
80		3,5,3,3,3	3.4	0.89	From Logits to Hierarchies: Hierarchical Clustering made Simple	https://openreview.net/forum?id=PmV9oPAtU9
81		3,5,3,3	3.5	1	Understanding Reasoning in Chain-of-Thought from the Hopfieldian View	https://openreview.net/forum?id=OclSRDktp3
82		8,8,6,5	6.75	1.5	What should a neuron aim for? Designing local objective functions based on information theory	https://openreview.net/forum?id=CLE09ESvul		Brain stuff, sus
83	2	1,3,8,5,5	4.4	2.61	Loss in the Crowd: Hidden Breakthroughs in Language Model Training	https://openreview.net/forum?id=pK4Z6NZ2DB	✅
84		6,8,6,5	6.25	1.26	Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?	https://openreview.net/forum?id=5IWJBStfU7
85		8,3,8,6	6.25	2.36	Interpretability of Language Models for Learning Hierarchical Structures	https://openreview.net/forum?id=J6qrIjTzoM	✅	repost
86	0.5	8,5,5,8	6.5	1.73	HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks	https://openreview.net/forum?id=6fDjUoEQvm	✅	ravel / causal abstractions sequence
87	2	6,8,6,5	6.25	1.26	The Computational Complexity of Circuit Discovery for Inner Interpretability	https://openreview.net/forum?id=QogcGNXJVw		Looks great if only we had this last year
88		3,5,3	3.67	1.15	Unveiling Language Skills under Circuits	https://openreview.net/forum?id=VwyKSnMmrr
89		3,3,3,3	3	0	Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders	https://openreview.net/forum?id=NB8qn8iIW9
90		6,5,6	5.67	0.58	Local vs distributed representations: What is the right basis for interpretability?	https://openreview.net/forum?id=fmWVPbRGC4
91		3,3,6,3	3.75	1.5	Towards Meta-Models for Automated Interpretability	https://openreview.net/forum?id=1zDOkoZAtl
92	1	3,5,6	4.67	1.53	Incidental Polysemanticity: A New Obstacle for Mechanistic Interpretability	https://openreview.net/forum?id=OeHSkJ58TG
93		3,6,5,5,3	4.4	1.34	Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness	https://openreview.net/forum?id=g6Qc3p7JH5
94		5,3,5,6	4.75	1.26	Towards Unifying Interpretability and Control: Evaluation via Intervention	https://openreview.net/forum?id=uOrfve3prk
95	1.2	3,5,5	4.33	1.15	Improving Neuron-level Interpretability with White-box Language Models	https://openreview.net/forum?id=6X7HaOEpZS	✅	CRATE for language
96		5,6,3	4.67	1.53	Sparse Attention Decomposition Applied to Circuit Tracing	https://openreview.net/forum?id=A2rfALKFBg		IOI
97	2.4	5,6,3,5	4.75	1.26	Gradient Routing: Masking Gradients to Localize Computation in Neural Networks	https://openreview.net/forum?id=z1mLNhWFyY	✅	not interp
98		5,5,5,3	4.5	1	Understanding and Enhancing Context-Augmented Language Models Through Mechanistic Circuits	https://openreview.net/forum?id=sqsGBW8zQx
99		8,6,5,3	5.5	2.08	Sparse Autoencoders Reveal Temporal Difference Learning in Large Language Models	https://openreview.net/forum?id=2tIyA5cri8
100		3,5,3,6,3,5	4.17	1.33	Causal Abstraction Finds Universal Representation of Race in Large Language Models	https://openreview.net/forum?id=jyjfRLnfww