| A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | ||||||||||||||||||||||||||
2 | ||||||||||||||||||||||||||
3 | Task | Names | ||||||||||||||||||||||||
4 | Task | Aerith | Bob | Charlie | ||||||||||||||||||||||
5 | High-Value - try the exercises before looking at the solutions! | |||||||||||||||||||||||||
6 | ARENA: Transformer from Scratch (Sections 1 & 2) | |||||||||||||||||||||||||
7 | ARENA: Basic Techniques in Mech Interp (Indirect Object Identification) Tutorial | Isaiah | Ed | Emil | Iuliia | Uzay | Oli | Jake | Chuqiao | Anna | Anton | Atticus | Denis | |||||||||||||
8 | ARENA: Induction Heads in a 2L Model Tutorial | Ed | Iuliia | Uzay | Oli | Jake | Victor | Anna | Anton | Atticus | Denis | |||||||||||||||
9 | ||||||||||||||||||||||||||
10 | ||||||||||||||||||||||||||
11 | Has tutorial - try the exercises before looking at the solutions! | |||||||||||||||||||||||||
12 | ARENA: Superposition & SAEs | Shivam | Iuliia | Ed | Denis | |||||||||||||||||||||
13 | ARENA: Function Vectors & Model Steering | Shivam | Uzay | Chuqiao | Isaiah | Oli | Jake | Jasmine | Ed | Matt L | Iuliia | Jim | Devon | |||||||||||||
14 | ARENA: Training and Sampling Tutorial (Sections 3 & 4) | |||||||||||||||||||||||||
15 | ARENA: Interpreting a parenthesis balancing algorithmic model | Adam | Oli | Denis | Isaiah | Jake | Matt L | Ed | Iuliia | Jim | ||||||||||||||||
16 | ARENA: Reverse-Engineer Modular Addition | Emil | Chuqiao | Oli | Tim H | Matt L | Ed | Victor | Iuliia | |||||||||||||||||
17 | ARENA: OthelloGPT | Adam | Chuqiao | Oli | Jake | Matt L | Ed | Iuliia | ||||||||||||||||||
18 | ||||||||||||||||||||||||||
19 | ||||||||||||||||||||||||||
20 | More free-form | |||||||||||||||||||||||||
21 | Approachable-ish | |||||||||||||||||||||||||
22 | Look for interesting SAE features in Neuronpedia. Load the model and SAE in a colab and check you get consistent results, and that the feature does what you think it does. | Denis | Ed | Siyu | ||||||||||||||||||||||
23 | Reverse-Engineer an SAE latent. Start with Direct Feature Attribution to see what contributed to it, if you feel ambitious look for a circuit, or use attribution patching to see how latents in earlier layers affect it | Oli | Isaiah | Matt L | Victor | wei jie | Chuqiao | Sigurd | ||||||||||||||||||
24 | Train an SAE on a language model (eg gelu-1l) with SAELens | |||||||||||||||||||||||||
25 | Train an SAE on a toy model, writing the SAE from scratch (eg replicate Sharkey et al) | Shivam | ||||||||||||||||||||||||
26 | Look for a circuit on an interesting behaviour in GPT-2 Small (use the ARENA IOI tutorial as a base) | Oli | Victor | Anna | Anton | |||||||||||||||||||||
27 | Replicate Activation Steering with SAEs using steering for a simple concept (eg anger or positive sentiment) on Gemma 2 2B (try a residual stream SAE in a mid-late layer, accessible from SAE Lens) | Ed | Anna | |||||||||||||||||||||||
28 | Investigate how Gemma 2 2B does addition (I think it can?), eg "105+321="->"426". Can you find important components or SAE features? Possible inspiration | Shivam (might also want to look into helix paper from Tegmark) | Adam | Oli | Armel | Matt L | Ed | Victor | wei jie | Isaiah | Chuqiao | Paul | Siyu | |||||||||||||
29 | Write your own hooks to do path patching between a pair of heads in TransformerLens (query, key and value are each separate patches) | Oli | Matt L | Victor | ||||||||||||||||||||||
30 | Use few shot prompting / ICL to demonstrate a capability in Gemma 2 2B or 9B (eg: like table 1 in the task vectors paper). What is the most complex behaviour you can elicit? | Adam | Oli | Armel | Matt L | Anna | ||||||||||||||||||||
31 | ||||||||||||||||||||||||||
32 | More ambitious | |||||||||||||||||||||||||
33 | Train an SAE on a language model (eg gelu-1l) writing the SAE from scratch (i.e. replicate Towards Monosemanticity) | Uzay | Chuqiao | |||||||||||||||||||||||
34 | Write your own JumpReLU SAE implementation and train it on a model (gelu-1l should be easiest, an early layer of a larger model should be doable and more interesting - can you get comparable performance to Gemma Scope?!) | |||||||||||||||||||||||||
35 | Find a circuit where nodes are SAE features (I recommend approaching this like the excellent Sparse Feature Circuits paper, though you can simplify and eg only use attribution patching, not integrated gradients; and just find nodes rather than nodes+edges (ie assume the circuit is the complete subgraph)) | Emil (i would be happy to also just do a deep dive on the Sparse Feature Circuits paper) | Oli | Victor | Anna | Siyu | Isaiah | Lily | ||||||||||||||||||
36 | Replicate the Chinchilla Multiple Choice Circuit on Gemma 2 9B with attribution patching. Can you find important features? | Oli | Jim | |||||||||||||||||||||||
37 | ||||||||||||||||||||||||||
38 | ||||||||||||||||||||||||||
39 | ||||||||||||||||||||||||||
40 | ||||||||||||||||||||||||||
41 | Feel free to add your own! | |||||||||||||||||||||||||
42 | Shivam (@jager.bomber on discord): Toy models of crosscoders for model diffing. Take two different toy model setups with different configurations (same #features and feature directions, diff #features and feature directions, etc). Train a crosscoder to reconstruct model 2 latent from model 1 latent. Could be a testbed for many interesting experiments | |||||||||||||||||||||||||
43 | Shivam: Train a small transformer on a resonably sized dataset (eg. 2-4 layer transformer on tinystories/equivalent dataset). Train a cross-layer transcoder, test out variants like BatchTopK or replicate some results from the recent Anthropic paper and opensource the codebase. | Oli | https://github.com/oli-clive-griffin/model-diffing | |||||||||||||||||||||||
44 | Oli: I'd be interested in trying to replicate APD (Attribution based Parameter Decomposition) | Jake | Anna | Siyu | ||||||||||||||||||||||
45 | Emil: 1-day trial sprint - train a linear probe for deception/deceptive alignment on Llama 405B (using nnsight, presumably?). 1. We make the model deceptively aligned following the alignment faking paper 2. Train a linear probe to find this deceptive alignment (idea needs some refining/clarification) 3. Can we use this probe to find deceptive alignment OOD? I'd be very excited about that. Idea graciously stolen from Neel. | Shivam | Anna | Siyu | ||||||||||||||||||||||
46 | Tim: Go through at least parts of the PPO section in ARENA, and maybe do some sort of replication of GRPO (see e.g., this colab notebook, this video) | Isaiah | ||||||||||||||||||||||||
47 | Tim: Implement various architectural improvements from Deepseek-v3 (e.g., MLA) | |||||||||||||||||||||||||
48 | Jasmine: work through guide on modifying LLM architectures to understand how various components function | (Shivam) Found a llama 3.2 from scratch tutorial, might be helpful | ||||||||||||||||||||||||
49 | Jasmine: writing some cleaner functions to implement logit lens | |||||||||||||||||||||||||
50 | Victor: Explore data distributions in the residual stream of large models (norms, cossims, clusters) across properties (model size, layers, checkpoints, code vs non-code) | Santi | ||||||||||||||||||||||||
51 | Bartosz: Fine-tune a small model (gpt2?) using LoRA on some well-defined task. Then, analyze what changed in a model during fine-tuning (e.g., by training a crosscoder or just try patching) | Jim | ||||||||||||||||||||||||
52 | Jim: Train Unsupervised Steering Vectors on a small/medium reasoning model e.g. R1-Distilled-Qwen14B, and look reasoning related vectors (e.g. a backtracking vector). I did that with R1-Distilled-Qwen1.5B in my application, but the code had a bug and I want to do this with a somewhat larger model. (Another thing I want to test is whether some steering vectors only have a large effect at the end of a thought, this seems intuitive to me, because my theory is that after completeting a thought the model decides whether to continue this line of thought or backtrack) | Jim | Emil | Anton - I actually did smth similar for my application and found this bizzare vector which seems to supress/enhance reasonin in R1-Llama-8B depending on the strenght of the hook - would be curious to explore this further together | Siyu | Shivam | wei jie | |||||||||||||||||||
53 | Anna: Replicate the emergent misalignment work on the one of the smaller models (probably Qwen2.5-Coder-32B-Instruct) and attempt model diffing or stage-wise model diffing to compare the base and finetuned models. | Jacob | Emil | 0 | Ed | |||||||||||||||||||||
54 | Jacob: Better model diffng: can we identify circuits (rather than just features) which are exclusive to the original or finetuned model? We could study this by patching activations from one model into the other. E.g. use AtP* but instead of the corrupt activations being taken from a different prompt, they are taken from a different model but same prompt. | Ida | Ed | |||||||||||||||||||||||
55 | Ida: Replicate part of faithfulness of chain-of-thought reasoning and use TransformerLens to patch activations step-by-step between correct and incorrect completions on short reasoning tasks with single-token answers (e.g., GSM8K). Checking for any possible specific step in CoT important for wrong answers. | Jacob | Jasmine | |||||||||||||||||||||||
56 | Denis (don't hesitate to DM!): Follow up on the planning in poems findings from the Biology paper. First, establish which models can even produce rhymed lines, second, simplify the setup by trying to find a steering vector that induces a specific ending of a line, finally, apply SAEs or transcoders if viable. | Denis | ||||||||||||||||||||||||
57 | ||||||||||||||||||||||||||
58 | ||||||||||||||||||||||||||
59 | ||||||||||||||||||||||||||
60 | ||||||||||||||||||||||||||
61 | ||||||||||||||||||||||||||
62 | ||||||||||||||||||||||||||
63 | ||||||||||||||||||||||||||
64 | ||||||||||||||||||||||||||
65 | ||||||||||||||||||||||||||
66 | ||||||||||||||||||||||||||
67 | ||||||||||||||||||||||||||
68 | ||||||||||||||||||||||||||
69 | ||||||||||||||||||||||||||
70 | ||||||||||||||||||||||||||
71 | ||||||||||||||||||||||||||
72 | ||||||||||||||||||||||||||
73 | ||||||||||||||||||||||||||
74 | ||||||||||||||||||||||||||
75 | ||||||||||||||||||||||||||
76 | ||||||||||||||||||||||||||
77 | ||||||||||||||||||||||||||
78 | ||||||||||||||||||||||||||
79 | ||||||||||||||||||||||||||
80 | ||||||||||||||||||||||||||
81 | ||||||||||||||||||||||||||
82 | ||||||||||||||||||||||||||
83 | ||||||||||||||||||||||||||
84 | ||||||||||||||||||||||||||
85 | ||||||||||||||||||||||||||
86 | ||||||||||||||||||||||||||
87 | ||||||||||||||||||||||||||
88 | ||||||||||||||||||||||||||
89 | ||||||||||||||||||||||||||
90 | ||||||||||||||||||||||||||
91 | ||||||||||||||||||||||||||
92 | ||||||||||||||||||||||||||
93 | ||||||||||||||||||||||||||
94 | ||||||||||||||||||||||||||
95 | ||||||||||||||||||||||||||
96 | ||||||||||||||||||||||||||
97 | ||||||||||||||||||||||||||
98 | ||||||||||||||||||||||||||
99 | ||||||||||||||||||||||||||
100 |