ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
2
3
TaskNames
4
TaskAerithBobCharlie
5
High-Value - try the exercises before looking at the solutions!
6
ARENA: Transformer from Scratch (Sections 1 & 2)
7
ARENA: Basic Techniques in Mech Interp (Indirect Object Identification) TutorialIsaiahEdEmilIuliiaUzayOliJakeChuqiaoAnnaAntonAtticusDenis
8
ARENA: Induction Heads in a 2L Model TutorialEdIuliiaUzayOliJakeVictorAnnaAntonAtticusDenis
9
10
11
Has tutorial - try the exercises before looking at the solutions!
12
ARENA: Superposition & SAEsShivamIuliiaEdDenis
13
ARENA: Function Vectors & Model SteeringShivamUzayChuqiaoIsaiahOliJakeJasmineEdMatt LIuliiaJimDevon
14
ARENA: Training and Sampling Tutorial (Sections 3 & 4)
15
ARENA: Interpreting a parenthesis balancing algorithmic modelAdamOliDenisIsaiahJakeMatt LEdIuliiaJim
16
ARENA: Reverse-Engineer Modular AdditionEmilChuqiaoOliTim HMatt LEdVictorIuliia
17
ARENA: OthelloGPTAdamChuqiaoOliJakeMatt LEdIuliia
18
19
20
More free-form
21
Approachable-ish
22
Look for interesting SAE features in Neuronpedia. Load the model and SAE in a colab and check you get consistent results, and that the feature does what you think it does.DenisEdSiyu
23
Reverse-Engineer an SAE latent. Start with Direct Feature Attribution to see what contributed to it, if you feel ambitious look for a circuit, or use attribution patching to see how latents in earlier layers affect itOliIsaiahMatt LVictorwei jieChuqiaoSigurd
24
Train an SAE on a language model (eg gelu-1l) with SAELens
25
Train an SAE on a toy model, writing the SAE from scratch (eg replicate Sharkey et al)Shivam
26
Look for a circuit on an interesting behaviour in GPT-2 Small (use the ARENA IOI tutorial as a base)OliVictorAnnaAnton
27
Replicate Activation Steering with SAEs using steering for a simple concept (eg anger or positive sentiment) on Gemma 2 2B (try a residual stream SAE in a mid-late layer, accessible from SAE Lens)EdAnna
28
Investigate how Gemma 2 2B does addition (I think it can?), eg "105+321="->"426". Can you find important components or SAE features? Possible inspirationShivam (might also want to look into helix paper from Tegmark)AdamOliArmelMatt LEdVictorwei jieIsaiahChuqiaoPaulSiyu
29
Write your own hooks to do path patching between a pair of heads in TransformerLens (query, key and value are each separate patches)OliMatt LVictor
30
Use few shot prompting / ICL to demonstrate a capability in Gemma 2 2B or 9B (eg: like table 1 in the task vectors paper). What is the most complex behaviour you can elicit?AdamOliArmelMatt LAnna
31
32
More ambitious
33
Train an SAE on a language model (eg gelu-1l) writing the SAE from scratch (i.e. replicate Towards Monosemanticity)UzayChuqiao
34
Write your own JumpReLU SAE implementation and train it on a model (gelu-1l should be easiest, an early layer of a larger model should be doable and more interesting - can you get comparable performance to Gemma Scope?!)
35
Find a circuit where nodes are SAE features (I recommend approaching this like the excellent Sparse Feature Circuits paper, though you can simplify and eg only use attribution patching, not integrated gradients; and just find nodes rather than nodes+edges (ie assume the circuit is the complete subgraph))Emil (i would be happy to
also just do a deep dive
on the Sparse Feature
Circuits paper)
OliVictorAnnaSiyuIsaiahLily
36
Replicate the Chinchilla Multiple Choice Circuit on Gemma 2 9B with attribution patching. Can you find important features?OliJim
37
38
39
40
41
Feel free to add your own!
42
Shivam (@jager.bomber on discord): Toy models of crosscoders for model diffing. Take two different toy model setups with different configurations (same #features and feature directions, diff #features and feature directions, etc). Train a crosscoder to reconstruct model 2 latent from model 1 latent. Could be a testbed for many interesting experiments
43
Shivam: Train a small transformer on a resonably sized dataset (eg. 2-4 layer transformer on tinystories/equivalent dataset). Train a cross-layer transcoder, test out variants like BatchTopK or replicate some results from the recent Anthropic paper and opensource the codebase.Oli
https://github.com/oli-clive-griffin/model-diffing
44
Oli: I'd be interested in trying to replicate APD (Attribution based Parameter Decomposition)JakeAnnaSiyu
45
Emil: 1-day trial sprint - train a linear probe for deception/deceptive alignment on Llama 405B (using nnsight, presumably?).
1. We make the model deceptively aligned following the alignment faking paper
2. Train a linear probe to find this deceptive alignment (idea needs some refining/clarification)
3. Can we use this probe to find deceptive alignment OOD? I'd be very excited about that.
Idea graciously stolen from Neel.
ShivamAnnaSiyu
46
Tim: Go through at least parts of the PPO section in ARENA, and maybe do some sort of replication of GRPO (see e.g., this colab notebook, this video)Isaiah
47
Tim: Implement various architectural improvements from Deepseek-v3 (e.g., MLA)
48
Jasmine: work through guide on modifying LLM architectures to understand how various components function(Shivam) Found a llama 3.2 from scratch tutorial, might be helpful
49
Jasmine: writing some cleaner functions to implement logit lens
50
Victor: Explore data distributions in the residual stream of large models (norms, cossims, clusters) across properties (model size, layers, checkpoints, code vs non-code)Santi
51
Bartosz: Fine-tune a small model (gpt2?) using LoRA on some well-defined task. Then, analyze what changed in a model during fine-tuning (e.g., by training a crosscoder or just try patching)Jim
52
Jim: Train Unsupervised Steering Vectors on a small/medium reasoning model e.g. R1-Distilled-Qwen14B, and look reasoning related vectors (e.g. a backtracking vector). I did that with R1-Distilled-Qwen1.5B in my application, but the code had a bug and I want to do this with a somewhat larger model. (Another thing I want to test is whether some steering vectors only have a large effect at the end of a thought, this seems intuitive to me, because my theory is that after completeting a thought the model decides whether to continue this line of thought or backtrack)JimEmil
Anton - I actually did smth similar for my application and found this bizzare vector which seems to supress/enhance reasonin in R1-Llama-8B depending on the strenght of the hook - would be curious to explore this further together
SiyuShivamwei jie
53
Anna: Replicate the emergent misalignment work on the one of the smaller models (probably Qwen2.5-Coder-32B-Instruct) and attempt model diffing or stage-wise model diffing to compare the base and finetuned models. JacobEmil0Ed
54
Jacob: Better model diffng: can we identify circuits (rather than just features) which are exclusive to the original or finetuned model? We could study this by patching activations from one model into the other. E.g. use AtP* but instead of the corrupt activations being taken from a different prompt, they are taken from a different model but same prompt.IdaEd
55
Ida: Replicate part of faithfulness of chain-of-thought reasoning and use TransformerLens to patch activations step-by-step between correct and incorrect completions on short reasoning tasks with single-token answers (e.g., GSM8K). Checking for any possible specific step in CoT important for wrong answers.JacobJasmine
56
Denis (don't hesitate to DM!): Follow up on the planning in poems findings from the Biology paper. First, establish which models can even produce rhymed lines, second, simplify the setup by trying to find a steering vector that induces a specific ending of a line, finally, apply SAEs or transcoders if viable.Denis
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100