1 of 10

Minor sparse coding update

The updated work presented here was carried out primarily in

January and February 2023.

Lee Sharkey, Dan Braun, Beren Millidge

2 of 10

Toy data experimental setup (recap)

  • Train sparse autoencoders on toy data
  • Ground truth features are known
  • Compare features learned by autoencoders with ground truth features
  • Train many autoencoders with different L1 coefficients and dictionary sizes

Data

Reconstruction

Dictionary

3 of 10

Toy data experimental setup (recap)

Yellow = Ground truth features recovered

Pink/purple = Ground truth features not recovered

Mean Max Cosine Similarity (MMCS) between learned dictionary and ground truth features:

4 of 10

Previously….

Interim report showed substantial differences between toy data and language model data

1: Num of dead neurons

2: Loss stickiness

3: Similarity of learned dictionaries with dicts larger than it

Toy data

Language model data

5 of 10

A few new results

A few changes compared with interim report

  • Toy data
    • Now 8x as many ground truth features as data dimensions instead of only 2x
      • Now 512 ground truth features
      • Will explore higher multiplication factors when I can devote more time to this project
    • Power law decay instead of exponential decay
  • Language model
    • Smaller: Residual dimension = 16 instead of 256
    • (Still 6 layers)

6 of 10

A small update (Toy data)

Yellow = Ground truth features recovered

Pink/purple = Ground truth features not recovered

Mean Max Cosine Similarity (MMCS) between learned dictionary and ground truth features:

7 of 10

New results

1: Num of dead neurons

2: Loss stickiness (deprecated)

3: Similarity of learned dictionaries with dicts larger than it

Toy data

Language model data

8 of 10

What are the features?

Dataset examples that maximally activate particular dictionary elements�

�Example: Feature 324

9 of 10

What are the features? (Random sample)

  • 721: LikeLiked (a word from copypasting chat forum/reddit text?)
  • 782: Mostly proper nouns for people (names)
  • 792: Patients/individuals in a medical context
  • 805: A colon after the word ‘question’ spent in any way
  • 811: Polysemantic. Zero score/upvotes. OR the word ‘my’
  • 849: Units of time or distance
  • 869: Another token for score, but not associated with upvotes. Scientific ‘score’s or musical ‘score’s
  • 890: Polysemantic?
  • 897: Tokens that cap the end off, like closing brackets but not brackets exactly.
  • 903: Polysemantic?
  • 924: Polysemantic?
  • 947: The token “‘t” as in contracted negation.
  • 984: The tokens “Today” and “Now”. But also “Yes” and “yes”
  • 60: Greek letters
  • 62: Full stop after “blogger” in url
  • 73: Some kind of quantifiers/modifier-like thing?
  • 240: August and march (but also "section")
  • 324: Commas after numbers
  • 470: Either full stop in the word "U.S."
  • 541: (Kind of) scientific nouns?
  • 570: The 9 in 19xx dates
  • 579: Decimal points in numbers in inequalities
  • 580: Broken/dead neuron? Unclear.
  • 583: Appears polysemantic for “Eqs” (i.e. equations) and brain words (medull- and mening-)
  • 591: Period after Mr, Mrs, Dr
  • 593: Polysemantic? Mostly capitalized.
  • 620: The word ‘parking’ and token ‘Im’
  • 621: Small units of distance (cm, mm) or the word ‘minimum’.
  • 657: The words ‘score’ and ‘upvotes’. Some notion of ‘points’ that are won?
  • 677: The token ‘rational’ and the concept of ‘terrorist’ (both ‘terror’ and ‘error’ are highlighted.
  • 696: Hyphens between words in a list

(See interface for more dataset examples)

10 of 10

Next steps

  • Repeat for larger LMs
  • More comprehensive toy data investigations
    • Increase number of ground truth features to more than 8 x data dimension
  • Get even more consistent features on LM data
    • Training on even more data; combine similar features in different autoencoders.
    • Demonstrate greater monosemanticity than either a) neurons or b) random directions
  • Advance beyond grid search
    • Find optimal L1 coefficient and dict size parameters for any LM using Bayesian optimization
  • Model editing
    • Causal tracing a la ROME to select features
    • Less-than-rank-one edits (silence, activate, or swap features)
  • Speeding up learning
    • Better initialization (DONE by Pierre Peigné at SERI-MATS)
  • Evaluate feasibility for future models
    • Identify scaling law for num_features with respect to data dimension
  • Analyse features
    • What is the distribution of features?
      • Power law? What is the correlation structure? What magnitudes?
    • Automated labelling?
    • Semantic geometry?
    • Validation
      • Less-than-rank-one model editing?
      • Automated causal scrubbing with features instead of neurons
  • Circuit construction
    • Connect features to weights, automatically identify all the circuits
    • Extract only the circuits for a particular bounded task.
    • How does RLHF change circuits?