A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | AA | AB | AC | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | so | Keywords I've searched: mechanistic, mechanism, mechanistic interpretability, interpretability, interpretable, oversight, scaling law, feature, sparse, sae. in total, skimmed ~4000 titles so you dont have to | Alice Rigg twitter:@woog09 discord:woog | To search: grokking, manifold, explainability | |||||||||||||||||||||||||
2 | https://openreview.net/group?id=ICLR.cc/2025/Conference&referrer=%5BHomepage%5D(%2F)#tab-active-submissions | Notes: I'll try to rearrange the list. The first ~120 rows are better than the ones after that, I'm just going through a long list of keywords and picking up things that seem interesting to me. having any excitedness score is good. no score ~ wild west but likely not earth shattering | ✅ means i know who wrote it | ||||||||||||||||||||||||||
3 | How excited am I? (1-3) | ICLR ratings | mean | std | Title | Link | Comments (undoxxed) | ||||||||||||||||||||||
4 | 0.9 | 3,3,6,6 | 4.5 | 1.73 | Transformers Struggle to Learn to Search Without In-context Exploration | https://openreview.net/forum?id=9cQB1Hwrtw | ✅ | Check with Jannik checked and doxxed. Interesting but still toy and not generalizable. They should discuss relation to Jannik's work more. Not 1 bc circuit analysis is so 2023 | 76.41 | ||||||||||||||||||||
5 | 0.8 | 6,8,3,8 | 6.25 | 2.36 | On the Role of Attention Heads in Large Language Model Safety | https://openreview.net/forum?id=h0Ak8A5yqw | Reminds me of Nina | ||||||||||||||||||||||
6 | 1.5 | 8,5,5,6 | 6 | 1.41 | Mechanistic Permutability: Match Features Across Layers | https://openreview.net/forum?id=MDvecs7EvO | Features across layers | ||||||||||||||||||||||
7 | 2 | 6,3,5 | 4.67 | 1.53 | Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders | https://openreview.net/forum?id=ghH6YYDs15 | |||||||||||||||||||||||
8 | 1 | 5,3,3,3 | 3.5 | 1 | Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs | https://openreview.net/forum?id=wI5uHZLeCZ | ✅ | repost, isnt this already archival? | |||||||||||||||||||||
9 | 3,3,5 | 3.67 | 1.15 | Task Vectors are Cross-Modal | https://openreview.net/forum?id=McqeEcMSzy | ||||||||||||||||||||||||
10 | 1 | 5,5,3,5 | 4.5 | 1 | A Causal Study on The Learnability of Formal Languages | https://openreview.net/forum?id=Oz9FTPINRe | they re-derive generalized backprop, did not cite it, independently did the same thing | ||||||||||||||||||||||
11 | 6,10,3,5 | 6 | 2.94 | Linear Representations of Political Perspective Emerge in Large Language Models | https://openreview.net/forum?id=rwqShzb9li | bad | |||||||||||||||||||||||
12 | 5,3,5,3 | 4 | 1.15 | From Feature Visualization to Visual Circuits: Effect of Model Perturbation | https://openreview.net/forum?id=YomQ3llPD2 | ||||||||||||||||||||||||
13 | 1 | 5,5,5,5 | 5 | 0 | Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders | https://openreview.net/forum?id=TIjBKgLyPN | ✅ | mats | |||||||||||||||||||||
14 | 6,5,5,1 | 4.25 | 2.22 | Complexity of Injectivity and Verification of ReLU Neural Networks | https://openreview.net/forum?id=Vz5HgVwcdu | probably bad | |||||||||||||||||||||||
15 | 3 | 8,6,8,5 | 6.75 | 1.5 | Bilinear MLPs enable weight-based mechanistic interpretability | https://openreview.net/forum?id=gI0kPklUKS | ✅ | Meta-defining. mild activation fn assumptions make previously unanswerable mechint questions answerable. finds an AND gate within an mlp layer operating on sae features in a 6 layer transformer language model, without input statistics | |||||||||||||||||||||
16 | not interp | 5,6,5,6 | 5.5 | 0.58 | Animate Your Thoughts: Reconstruction of Dynamic Natural Vision from Human Brain Activity | https://openreview.net/forum?id=BpfsxFqhGa | non-interp | ||||||||||||||||||||||
17 | not interp | 5,5,3,3 | 4 | 1.15 | Can Transformers Do Enumerative Geometry? | https://openreview.net/forum?id=4X9RpKH4Ls | |||||||||||||||||||||||
18 | 1 | 5,5,5,6 | 5.25 | 0.5 | Approaching Deep Learning through the Spectral Dynamics of Weights | https://openreview.net/forum?id=PJjHILiQHC | |||||||||||||||||||||||
19 | 5,3,3,3 | 3.5 | 1 | Planning in a recurrent neural network that plays Sokoban | https://openreview.net/forum?id=ORxjH9kTp8 | ✅ | repost, bad | ||||||||||||||||||||||
20 | 6,5,5,3,3 | 4.4 | 1.34 | A mechanistically interpretable neural network for regulatory genomics | https://openreview.net/forum?id=eR9C6c76j5 | sus | |||||||||||||||||||||||
21 | 1 | 6,5,3,5 | 4.75 | 1.26 | Interpreting and Steering LLM Representations with Mutual Information-based Explanations on Sparse Autoencoders | https://openreview.net/forum?id=vc1i3a4O99 | Uses N2G | ||||||||||||||||||||||
22 | not interp | 3,3,6,8 | 5 | 2.45 | Lines of Thought in Large Language Models | https://openreview.net/forum?id=zjAEa4s3sH | 100% reject | ||||||||||||||||||||||
23 | 5,8,6,8,3 | 6 | 2.12 | Identifying and Tuning Safety Neurons in Large Language Models | https://openreview.net/forum?id=yR47RmND1m | Probably bad | |||||||||||||||||||||||
24 | 6,5,3,3,5 | 4.4 | 1.34 | Interpretable Patterns in Random Initialization Unveil Final Representation | https://openreview.net/forum?id=bWT6OBJ71x | glorified lottery tickets | |||||||||||||||||||||||
25 | 8,6,8,6,5 | 6.6 | 1.34 | Arithmetic Without Algorithms: Language Models Solve Math with a Bag of Heuristics | https://openreview.net/forum?id=O9YTt26r2P | Probably bad | |||||||||||||||||||||||
26 | 2.9 | 3,5,3,3 | 3.5 | 1 | Decomposing The Dark Matter of Sparse Autoencoders | https://openreview.net/forum?id=5IZfo98rqr | ✅ | simple idea and approach to a hard problem. methodology and quantitative numbers are nothing to write home about, but they nailed the novelty and relevance. theres good room for improvement and extensions | |||||||||||||||||||||
27 | not interp | 5,5,5,3 | 4.5 | 1 | BrainCodec: Neural fMRI codec for the decoding of cognitive brain states | https://openreview.net/forum?id=o6ddWvoyjK | |||||||||||||||||||||||
28 | not interp | 5,8,6,5,5 | 5.8 | 1.3 | Learning Color Equivariant Representations | https://openreview.net/forum?id=IXyfbaGlps | |||||||||||||||||||||||
29 | 5,6,3 | 4.67 | 1.53 | Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons | https://openreview.net/forum?id=1NkrxqY4jK | ||||||||||||||||||||||||
30 | 8,3,5,5,6 | 5.4 | 1.82 | Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models | https://openreview.net/forum?id=uDIiL89ViX | repost, kinda bad | |||||||||||||||||||||||
31 | 2 | 3,6,3,5 | 4.25 | 1.5 | Teaching LLMs to Decode Activations Into Natural Language | https://openreview.net/forum?id=cselR6Jne3 | Patchscopes-like + inversion-view-like. Updated to 2 for now. Note to self to read in more detail when i get time. Score may change up to +-0.5 | ||||||||||||||||||||||
32 | Desk reject | #DIV/0! | #DIV/0! | Evaluating Synthetic Activations composed of SAE Latents in GPT-2 | https://openreview.net/forum?id=U0y32WKeOd | ✅ | repost | ||||||||||||||||||||||
33 | 3,5,1,3,3 | 3 | 1.41 | Revisiting the expressiveness of CNNs: a mathematical framework for feature extraction | https://openreview.net/forum?id=laKmMbx6x4 | Probably bad | |||||||||||||||||||||||
34 | 5,6,5,6 | 5.5 | 0.58 | Understanding the learned look-ahead behavior of chess neural networks | https://openreview.net/forum?id=Tl8EzmgsEp | ✅ | repost, kinda mid | ||||||||||||||||||||||
35 | 1 | 5,5,5,5 | 5 | 0 | Rethinking The Reliability of Representation Engineering in Large Language Models | https://openreview.net/forum?id=sYJQEgkkaI | Causal RepE | ||||||||||||||||||||||
36 | 5,5,3,1 | 3.5 | 1.91 | Outcome-based Semifactual Explanation For Reinforcement Learning | https://openreview.net/forum?id=qhfZL46nPV | ||||||||||||||||||||||||
37 | 5,5,6,3,3 | 4.4 | 1.34 | Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders | https://openreview.net/forum?id=Ch8s4FdUXS | ||||||||||||||||||||||||
38 | 1 | 6,8,8,8 | 7.5 | 1 | Retrieval Head Mechanistically Explains Long-Context Factuality | https://openreview.net/forum?id=EytBpUGB1Z | KV-conclusion | ||||||||||||||||||||||
39 | 5,8,6,3 | 5.5 | 2.08 | Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations | https://openreview.net/forum?id=94kQgWXojH | Sus, maybe | |||||||||||||||||||||||
40 | 0.5 | 5,3,5,6 | 4.75 | 1.26 | Applying Sparse Autoencoders to Unlearn Knowledge in Language Models | https://openreview.net/forum?id=ZtvRqm6oBu | ✅ | ||||||||||||||||||||||
41 | 1 | 3,5,3,3 | 3.5 | 1 | Information Structure in Large Language Models | https://openreview.net/forum?id=VB8xHF1Rdl | Vector entropy, message femi | ||||||||||||||||||||||
42 | 5,6,8,5 | 6 | 1.41 | Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures | https://openreview.net/forum?id=2J18i8T0oI | ✅ | mamba interp | ||||||||||||||||||||||
43 | 1,1,8,1 | 2.75 | 3.5 | Enhancing Integrated Gradients Using Emphasis Factors and Attention for Effective Explainability of Large Language Models | https://openreview.net/forum?id=IRvx66cxip | Decent or horrible | |||||||||||||||||||||||
44 | not interp | 3,5,5,3 | 4 | 1.15 | Audio Prototypical Network for Controllable Music Recommendation | https://openreview.net/forum?id=pKDmt7pc6h | |||||||||||||||||||||||
45 | 5,6,3,3,5 | 4.4 | 1.34 | Words in Motion: Extracting Interpretable Control Vectors for Motion Transformers | https://openreview.net/forum?id=J9eKm7j6KD | ||||||||||||||||||||||||
46 | 0.5 | 3,5,6 | 4.67 | 1.53 | Shapeshifters: Auditory cortical neurons switch from polysemantic to monosemantic under anesthesia | https://openreview.net/forum?id=i4jHy0ewke | |||||||||||||||||||||||
47 | 5,6,6,3 | 5 | 1.41 | Learning positional encodings in transformers depends on initialization | https://openreview.net/forum?id=fn0mjkZopf | ||||||||||||||||||||||||
48 | not interp | 3,5,3,3,3 | 3.4 | 0.89 | Understanding Gradient Descent through the Training Jacobian | https://openreview.net/forum?id=kkVTeMvC9D | ✅ | ||||||||||||||||||||||
49 | 5,8,5,6 | 6 | 1.41 | Efficient Dictionary Learning with Switch Sparse Autoencoders | https://openreview.net/forum?id=k2ZVAzVeMP | ✅ | mats | ||||||||||||||||||||||
50 | 1.2 | 6,6,3 | 5 | 1.73 | Interpreting Attention Layer Outputs with Sparse Autoencoders | https://openreview.net/forum?id=LphpWGimIa | ✅ | repost, cleaned up, IOI | |||||||||||||||||||||
51 | 5,3,5,6 | 4.75 | 1.26 | Controlling Large Language Model Agents with Entropic Activation Steering | https://openreview.net/forum?id=YCu7H0kFS3 | ||||||||||||||||||||||||
52 | 3,1,1,5 | 2.5 | 1.91 | pSAE-chiatry: Utilizing Sparse Autoencoders to Uncover Mental-Health-Related Features in Language Models | https://openreview.net/forum?id=LQdaXixB0g | Terrible | |||||||||||||||||||||||
53 | 2.2 | 3,10,8,8,10 | 7.8 | 2.86 | Scaling and evaluating sparse autoencoders | https://openreview.net/forum?id=tcsZt9ZNKD | ✅ | repost | |||||||||||||||||||||
54 | 1 | 5,6,8,8 | 6.75 | 1.5 | Interpreting the Second-Order Effects of Neurons in CLIP | https://openreview.net/forum?id=GPDcvoFGOL | |||||||||||||||||||||||
55 | 5,5,6,8 | 6 | 1.41 | How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations | https://openreview.net/forum?id=57NfyYxh5f | probably bad | |||||||||||||||||||||||
56 | 1 | 8,5,6 | 6.33 | 1.53 | Forking Paths in Neural Text Generation | https://openreview.net/forum?id=8RCmNLeeXx | |||||||||||||||||||||||
57 | 2 | 6,5,6,5,5 | 5.4 | 0.55 | An X-Ray Is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation | https://openreview.net/forum?id=ZLAQ6Pjf9y | ✅ | Actually good sae application | |||||||||||||||||||||
58 | 3,6,8,3,1 | 4.2 | 2.77 | Enabling Sparse Autoencoders for Topic Alignment in Large Language Models | https://openreview.net/forum?id=uinsufj5TR | Has code | |||||||||||||||||||||||
59 | 0.5 | 3,3,8,1 | 3.75 | 2.99 | Mechanistic Insights: Circuit Transformations Across Input and Fine-Tuning Landscapes | https://openreview.net/forum?id=JZjW3k4Kyc | 36 pages, many circuits | ||||||||||||||||||||||
60 | 3,3,3,3 | 3 | 0 | KAE: Kolmogorov-Arnold Auto-Encoder for Representation Learning | https://openreview.net/forum?id=K9xuqsaP0R | ✅ | Max Tegmark Copium | ||||||||||||||||||||||
61 | Desk reject | #DIV/0! | #DIV/0! | Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models | https://openreview.net/forum?id=u4XyECA6Zd | PCFG | |||||||||||||||||||||||
62 | 1.5 | 6,6,6,8 | 6.5 | 1 | Interpreting Emergent Planning in Model-Free Reinforcement Learning | https://openreview.net/forum?id=DzGe40glxs | Causal sokoban interp | ||||||||||||||||||||||
63 | 1.7 | 3,8,6,6 | 5.75 | 2.06 | Scaling Sparse Feature Circuits For Studying In-Context Learning | https://openreview.net/forum?id=Pa1vr1Prww | ✅ | mats | |||||||||||||||||||||
64 | 8,3,3,3,5 | 4.4 | 2.19 | Which Attention Heads Matter for In-Context Learning? | https://openreview.net/forum?id=KadOFOsUpQ | Function Vectors | |||||||||||||||||||||||
65 | 6,5,6,8,5 | 6 | 1.22 | Selective induction Heads: How Transformers Select Causal Structures in Context | https://openreview.net/forum?id=bnJgzAQjWf | Toy Attn-only, pretty figs | |||||||||||||||||||||||
66 | 5,6,8 | 6.33 | 1.53 | Unifying and Verifying Mechanistic Interpretations: A Case Study with Group Operations | https://openreview.net/forum?id=8xxEBAtD7y | ✅ | mats | ||||||||||||||||||||||
67 | 2 | 3,8,5,3 | 4.75 | 2.36 | Automatically Interpreting Millions of Features in Large Language Models | https://openreview.net/forum?id=5lIXRf8Lnw | ✅ | upgraded repost with new things in it | |||||||||||||||||||||
68 | 8,5,5,8 | 6.5 | 1.73 | Interpretable Language Modeling via Induction-head Ngram Models | https://openreview.net/forum?id=Zq8wylMZ8A | ||||||||||||||||||||||||
69 | 0.5 | 3,5,5,5,6 | 4.8 | 1.1 | Locating Information in Large Language Models via Random Matrix Theory | https://openreview.net/forum?id=MmWkNmeDNE | |||||||||||||||||||||||
70 | 3,8,3,6 | 5 | 2.45 | Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration | https://openreview.net/forum?id=yBhSORdXqq | ✅ | repost | ||||||||||||||||||||||
71 | 8,3,8,6 | 6.25 | 2.36 | A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders | https://openreview.net/forum?id=LC2KxRwC3n | ✅ | repost | ||||||||||||||||||||||
72 | 3,1,1 | 1.67 | 1.15 | Sparsity beyond TopK: A Novel Cosine Loss for Sparse Binary Representations | https://openreview.net/pdf?id=UbLvSPMvMA | ||||||||||||||||||||||||
73 | 1.5 | 8,6,5,6 | 6.25 | 1.26 | Look Before You Leap: Universal Emergent Mechanism for Retrieval in Language Models | https://openreview.net/forum?id=eIB1UZFcFg | ✅ | repost, 1 year old | |||||||||||||||||||||
74 | 5,6,8 | 6.33 | 1.53 | Interpretable Pre-Trained Transformers for Heart Time-Series Data | https://openreview.net/forum?id=eciCtsqGc8 | ||||||||||||||||||||||||
75 | 2.6 | 8,3,10,8 | 7.25 | 2.99 | Differential learning kinetics govern the transition from memorization to generalization during in-context learning | https://openreview.net/forum?id=INyi7qUdjZ | ✅ | amazing, if only we still cared about ICL | |||||||||||||||||||||
76 | 3,3,3,3 | 3 | 0 | Generalization from Starvation: Hints of Universality in LLM Knowledge Graph Learning | https://openreview.net/forum?id=f7aWmxgSN4 | ||||||||||||||||||||||||
77 | 3,3,5,3 | 3.5 | 1 | Transformers Use Causal World Models in Maze-Solving Tasks | https://openreview.net/forum?id=aE6QjMJ1mN | ✅ | AISC v2: interesting, probably bad | ||||||||||||||||||||||
78 | not interp | 6,5,5,5 | 5.25 | 0.5 | TopInG: Topologically Interpretable Graph Learning via Persistent Rationale Filtration | https://openreview.net/forum?id=ZaSOGF8Ojq | Probably bad | ||||||||||||||||||||||
79 | 1 | 8,6,5,3 | 5.5 | 2.08 | Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control | https://openreview.net/forum?id=1Njl73JKjB | ✅ | mats 5, ioi eval | |||||||||||||||||||||
80 | 3,5,3,3,3 | 3.4 | 0.89 | From Logits to Hierarchies: Hierarchical Clustering made Simple | https://openreview.net/forum?id=PmV9oPAtU9 | ||||||||||||||||||||||||
81 | 3,5,3,3 | 3.5 | 1 | Understanding Reasoning in Chain-of-Thought from the Hopfieldian View | https://openreview.net/forum?id=OclSRDktp3 | ||||||||||||||||||||||||
82 | 8,8,6,5 | 6.75 | 1.5 | What should a neuron aim for? Designing local objective functions based on information theory | https://openreview.net/forum?id=CLE09ESvul | Brain stuff, sus | |||||||||||||||||||||||
83 | 2 | 1,3,8,5,5 | 4.4 | 2.61 | Loss in the Crowd: Hidden Breakthroughs in Language Model Training | https://openreview.net/forum?id=pK4Z6NZ2DB | ✅ | ||||||||||||||||||||||
84 | 6,8,6,5 | 6.25 | 1.26 | Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable? | https://openreview.net/forum?id=5IWJBStfU7 | ||||||||||||||||||||||||
85 | 8,3,8,6 | 6.25 | 2.36 | Interpretability of Language Models for Learning Hierarchical Structures | https://openreview.net/forum?id=J6qrIjTzoM | ✅ | repost | ||||||||||||||||||||||
86 | 0.5 | 8,5,5,8 | 6.5 | 1.73 | HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks | https://openreview.net/forum?id=6fDjUoEQvm | ✅ | ravel / causal abstractions sequence | |||||||||||||||||||||
87 | 2 | 6,8,6,5 | 6.25 | 1.26 | The Computational Complexity of Circuit Discovery for Inner Interpretability | https://openreview.net/forum?id=QogcGNXJVw | Looks great if only we had this last year | ||||||||||||||||||||||
88 | 3,5,3 | 3.67 | 1.15 | Unveiling Language Skills under Circuits | https://openreview.net/forum?id=VwyKSnMmrr | ||||||||||||||||||||||||
89 | 3,3,3,3 | 3 | 0 | Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders | https://openreview.net/forum?id=NB8qn8iIW9 | ||||||||||||||||||||||||
90 | 6,5,6 | 5.67 | 0.58 | Local vs distributed representations: What is the right basis for interpretability? | https://openreview.net/forum?id=fmWVPbRGC4 | ||||||||||||||||||||||||
91 | 3,3,6,3 | 3.75 | 1.5 | Towards Meta-Models for Automated Interpretability | https://openreview.net/forum?id=1zDOkoZAtl | ||||||||||||||||||||||||
92 | 1 | 3,5,6 | 4.67 | 1.53 | Incidental Polysemanticity: A New Obstacle for Mechanistic Interpretability | https://openreview.net/forum?id=OeHSkJ58TG | |||||||||||||||||||||||
93 | 3,6,5,5,3 | 4.4 | 1.34 | Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness | https://openreview.net/forum?id=g6Qc3p7JH5 | ||||||||||||||||||||||||
94 | 5,3,5,6 | 4.75 | 1.26 | Towards Unifying Interpretability and Control: Evaluation via Intervention | https://openreview.net/forum?id=uOrfve3prk | ||||||||||||||||||||||||
95 | 1.2 | 3,5,5 | 4.33 | 1.15 | Improving Neuron-level Interpretability with White-box Language Models | https://openreview.net/forum?id=6X7HaOEpZS | ✅ | CRATE for language | |||||||||||||||||||||
96 | 5,6,3 | 4.67 | 1.53 | Sparse Attention Decomposition Applied to Circuit Tracing | https://openreview.net/forum?id=A2rfALKFBg | IOI | |||||||||||||||||||||||
97 | 2.4 | 5,6,3,5 | 4.75 | 1.26 | Gradient Routing: Masking Gradients to Localize Computation in Neural Networks | https://openreview.net/forum?id=z1mLNhWFyY | ✅ | not interp | |||||||||||||||||||||
98 | 5,5,5,3 | 4.5 | 1 | Understanding and Enhancing Context-Augmented Language Models Through Mechanistic Circuits | https://openreview.net/forum?id=sqsGBW8zQx | ||||||||||||||||||||||||
99 | 8,6,5,3 | 5.5 | 2.08 | Sparse Autoencoders Reveal Temporal Difference Learning in Large Language Models | https://openreview.net/forum?id=2tIyA5cri8 | ||||||||||||||||||||||||
100 | 3,5,3,6,3,5 | 4.17 | 1.33 | Causal Abstraction Finds Universal Representation of Race in Large Language Models | https://openreview.net/forum?id=jyjfRLnfww |