Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon
USVSN Sai Prashanth๐, Alvin Deng๐, Kyle O'Brien๐, Jyothir S V๐,
Mohammad Aflah Khan, Jaydeep Borkar, Christopher A. Choquette-Choo, Jacob Ray Fuehne,
Stella Biderman, Tracy Ke๐ฟ, Katherine Lee๐ฟ, Naomi Saphra๐ฟ
Our Taxonomy
Analysis Across Scale
Taxonomy Validation
Which Properties Lead to Memorization?
Analysis Across Training Time
Recitation: Highly-duplicated sequences
Reconstruction: Sequences with trivial continuations
Recollection: Memories which canโt be explained by other categories
Takeaway:
Larger models tend to memorize rarer text that canโt be reconstructed.
Takeaway:
Increase in memorization isnโt solely explained by duplication.
Is our taxonomy useful for predicting memorization?
Takeaway:
Leveraging our taxonomy for memorization classification outperforms the homogeneous phenomenon baseline.