Off-Topic Memento Toolkit (OTMT) to Identify Topical Outliers in Web Archive Collections
Shawn M. Jones
Martin Klein
Michele C. Weigle
Michael L. Nelson
smjones@lanl.gov
mklein@lanl.gov
mweigle@cs.odu.edu
mln@cs.odu.edu
@shawnmjones
@mart1nkle1n
@weiglemc
@phonedude_mln
Technical problems
Page gone
Hacking
Moving on from topic
Pages in web archive collections go off-topic for many reasons
Of 8 similarity measures, word count worked best
The OTMT detects off-topic mementos by processing TimeMaps
First�memento
Considered memento
The OTMT’s general algorithm
For each TimeMap in a collection:
Archivists create web archive collections with a topic in mind
The sheer number of mementos to process means that researchers will need to quickly identify mementos with low information value.
Off-topic mementos have low information value.
We want to identify, not delete, these for further decision-making.
We identify them to not consider them for selection as exemplars for storytelling.
Each capture of a seed becomes a memento
Some collections have thousands of seeds.
Archivists select seeds for their collection.
For more information
OTMT and Hypercane are part of the Dark and Stormy Archives Project
OTMT is best run via Hypercane:
Our thanks to IIPC and IMLS for funding the Dark and Stormy Archives Project!
Sørensen-Dice
F1: 0.649
TF Simhash�F1: 0.523
Raw Simhash�F1: 0.578
Jaccard Distance�F1: 0.651
LSI Topic Modeling�F1: 0.711
TF-IDF Cosine Similarity�F1: 0.766
Byte Count
F1: 0.756
Word Count
F1: 0.788