1 of 1

Off-Topic Memento Toolkit (OTMT) to Identify Topical Outliers in Web Archive Collections

Shawn M. Jones

Martin Klein

Michele C. Weigle

Michael L. Nelson

smjones@lanl.gov

mklein@lanl.gov

mweigle@cs.odu.edu

mln@cs.odu.edu

@shawnmjones

@mart1nkle1n

@weiglemc

@phonedude_mln

Technical problems

Page gone

Hacking

Moving on from topic

Pages in web archive collections go off-topic for many reasons

Of 8 similarity measures, word count worked best

The OTMT detects off-topic mementos by processing TimeMaps

First�memento

Considered memento

The OTMT’s general algorithm

For each TimeMap in a collection:

Get the first memento
Preprocess it
For each memento in the TimeMap

Get the memento
Preprocess it
Compute the similarity to the first memento using a given measure
Save the score
A threshold value determines if a memento is on- or off-topic

Archivists create web archive collections with a topic in mind

The sheer number of mementos to process means that researchers will need to quickly identify mementos with low information value.

Off-topic mementos have low information value.

We want to identify, not delete, these for further decision-making.

We identify them to not consider them for selection as exemplars for storytelling.

Each capture of a seed becomes a memento

Some collections have thousands of seeds.

Archivists select seeds for their collection.

For more information

OTMT and Hypercane are part of the Dark and Stormy Archives Project

https://oduwsdl.github.io/dsa/

OTMT Preprint:

https://arxiv.org/abs/1806.06870

OTMT GitHub:

https://github.com/oduwsdl/off-topic-memento-toolkit

OTMT is best run via Hypercane:

https://oduwsdl.github.io/hypercane/

Our thanks to IIPC and IMLS for funding the Dark and Stormy Archives Project!

Sørensen-Dice

F₁: 0.649

TF Simhash�F₁: 0.523

Raw Simhash�F₁: 0.578

Jaccard Distance�F₁: 0.651

LSI Topic Modeling�F₁: 0.711

TF-IDF Cosine Similarity�F₁: 0.766

Byte Count

F₁: 0.756

Word Count

F₁: 0.788