1 of 1

Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon

USVSN Sai Prashanth^🍀, Alvin Deng^🍀, Kyle O'Brien^🍀, Jyothir S V^🍀,

Mohammad Aflah Khan, Jaydeep Borkar, Christopher A. Choquette-Choo, Jacob Ray Fuehne,

Stella Biderman, Tracy Ke^🌿, Katherine Lee^🌿, Naomi Saphra^🌿

Our Taxonomy

Analysis Across Scale

Taxonomy Validation

Which Properties Lead to Memorization?

Analysis Across Training Time

Recitation: Highly-duplicated sequences

Reconstruction: Sequences with trivial continuations

Recollection: Memories which can’t be explained by other categories

Takeaway:

Larger models tend to memorize rarer text that can’t be reconstructed.

Takeaway:

Increase in memorization isn’t solely explained by duplication.

Is our taxonomy useful for predicting memorization?

Takeaway:

Leveraging our taxonomy for memorization classification outperforms the homogeneous phenomenon baseline.