1 of 10

Minor sparse coding update

The updated work presented here was carried out primarily in

January and February 2023.

Lee Sharkey, Dan Braun, Beren Millidge

This presentation includes a few scrappy results that continue the work discussed in our post [Interim research report] Taking features out of superposition with sparse autoencoders��https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition��We assume familiarity with that post, but include a brief recap of the relevant results in the first few slides of this presentations.

Neither the report nor this presentation should be considered a polished research product. Especially this presentation! Unfortunately we have had very little time to devote to this work due to changing institutional priorities at Conjecture. As a result, there are some obvious questions and controls that we have had to leave unaddressed here. The authors hope to continue work on this topic in future.

2 of 10

Toy data experimental setup (recap)

Train sparse autoencoders on toy data
Ground truth features are known
Compare features learned by autoencoders with ground truth features
Train many autoencoders with different L1 coefficients and dictionary sizes

Data

Reconstruction

Dictionary

3 of 10

Toy data experimental setup (recap)

Yellow = Ground truth features recovered

Pink/purple = Ground truth features not recovered

Mean Max Cosine Similarity (MMCS) between learned dictionary and ground truth features:

4 of 10

Previously….

Interim report showed substantial differences between toy data and language model data

1: Num of dead neurons

2: Loss stickiness

3: Similarity of learned dictionaries with dicts larger than it

Toy data

Language model data

5 of 10

A few new results

A few changes compared with interim report

Toy data

Now 8x as many ground truth features as data dimensions instead of only 2x

Now 512 ground truth features
Will explore higher multiplication factors when I can devote more time to this project

Power law decay instead of exponential decay

Language model

Smaller: Residual dimension = 16 instead of 256
(Still 6 layers)

6 of 10

A small update (Toy data)

Yellow = Ground truth features recovered

Pink/purple = Ground truth features not recovered

Mean Max Cosine Similarity (MMCS) between learned dictionary and ground truth features:

7 of 10

New results

1: Num of dead neurons

2: Loss stickiness (deprecated)

3: Similarity of learned dictionaries with dicts larger than it

Toy data

Language model data

8 of 10

What are the features?

Dataset examples that maximally activate particular dictionary elements�

�Example: Feature 324

9 of 10

What are the features? (Random sample)

721: LikeLiked (a word from copypasting chat forum/reddit text?)
782: Mostly proper nouns for people (names)
792: Patients/individuals in a medical context
805: A colon after the word ‘question’ spent in any way
811: Polysemantic. Zero score/upvotes. OR the word ‘my’
849: Units of time or distance
869: Another token for score, but not associated with upvotes. Scientific ‘score’s or musical ‘score’s
890: Polysemantic?
897: Tokens that cap the end off, like closing brackets but not brackets exactly.
903: Polysemantic?
924: Polysemantic?
947: The token “‘t” as in contracted negation.
984: The tokens “Today” and “Now”. But also “Yes” and “yes”

60: Greek letters
62: Full stop after “blogger” in url
73: Some kind of quantifiers/modifier-like thing?
240: August and march (but also "section")
324: Commas after numbers
470: Either full stop in the word "U.S."
541: (Kind of) scientific nouns?
570: The 9 in 19xx dates
579: Decimal points in numbers in inequalities
580: Broken/dead neuron? Unclear.
583: Appears polysemantic for “Eqs” (i.e. equations) and brain words (medull- and mening-)
591: Period after Mr, Mrs, Dr
593: Polysemantic? Mostly capitalized.
620: The word ‘parking’ and token ‘Im’
621: Small units of distance (cm, mm) or the word ‘minimum’.
657: The words ‘score’ and ‘upvotes’. Some notion of ‘points’ that are won?
677: The token ‘rational’ and the concept of ‘terrorist’ (both ‘terror’ and ‘error’ are highlighted.
696: Hyphens between words in a list

(See interface for more dataset examples)

We chose a particular auto encoder that had 1024 features, since the plots seemed to suggest that this was roughly the number of features represented in the MLP activations of the layer that we looked at. ��We labelled a roughly random sample of dictionary elements (the roughly random procedure here was: Choose a dictionary element from the scroll down menu, then attempt to label the feature, and then scroll down a bit, label that one, etc). There was no cherry picking of the dictionary elements included in the above lists.

�We found a lot (roughly 60-80%) of dictionary elements that were apparently monosemantic,which suggests that there are roughly 600-800 monosemantic features in the MLP activations (which are 64 dimensional) of this layer of the miniscule language model.�

Caveats: We’re not very confident that 1024 is the right number of features; if it is too large, then we may have labelled some dictionary elements that are actually ‘dead’ and should have been excluded from the labels. We also want to remind readers that the dictionary was probably undertrained (explained in a previous slide). We think both these things led to a larger amount of polysemanticity than we would find in fully trained autoencoders. We also need to confirm that random directions don’t also yield comparable amounts of monosemanticity in order to confirm that the autoencoders are actually doing something useful.

10 of 10

Next steps

Repeat for larger LMs
More comprehensive toy data investigations

Increase number of ground truth features to more than 8 x data dimension

Get even more consistent features on LM data

Training on even more data; combine similar features in different autoencoders.
Demonstrate greater monosemanticity than either a) neurons or b) random directions

Advance beyond grid search

Find optimal L1 coefficient and dict size parameters for any LM using Bayesian optimization

Model editing

Causal tracing a la ROME to select features
Less-than-rank-one edits (silence, activate, or swap features)

Speeding up learning

Better initialization (DONE by Pierre Peigné at SERI-MATS)

Evaluate feasibility for future models

Identify scaling law for num_features with respect to data dimension

Analyse features

What is the distribution of features?

Power law? What is the correlation structure? What magnitudes?

Automated labelling?
Semantic geometry?
Validation

Less-than-rank-one model editing?
Automated causal scrubbing with features instead of neurons

Circuit construction

Connect features to weights, automatically identify all the circuits
Extract only the circuits for a particular bounded task.
How does RLHF change circuits?