1 of 1

Off-Topic Memento Toolkit (OTMT) to Identify Topical Outliers in Web Archive Collections

Shawn M. Jones

Martin Klein

Michele C. Weigle

Michael L. Nelson

smjones@lanl.gov

mklein@lanl.gov

mweigle@cs.odu.edu

mln@cs.odu.edu

@shawnmjones

@mart1nkle1n

@weiglemc

@phonedude_mln

Technical problems

Page gone

Hacking

Moving on from topic

Pages in web archive collections go off-topic for many reasons

Of 8 similarity measures, word count worked best

The OTMT detects off-topic mementos by processing TimeMaps

First�memento

Considered memento

The OTMT’s general algorithm

For each TimeMap in a collection:

  1. Get the first memento
  2. Preprocess it
  3. For each memento in the TimeMap
    1. Get the memento
    2. Preprocess it
    3. Compute the similarity to the first memento using a given measure
    4. Save the score
    5. A threshold value determines if a memento is on- or off-topic

Archivists create web archive collections with a topic in mind

The sheer number of mementos to process means that researchers will need to quickly identify mementos with low information value.

Off-topic mementos have low information value.

We want to identify, not delete, these for further decision-making.

We identify them to not consider them for selection as exemplars for storytelling.

Each capture of a seed becomes a memento

Some collections have thousands of seeds.

Archivists select seeds for their collection.

For more information

OTMT and Hypercane are part of the Dark and Stormy Archives Project

https://oduwsdl.github.io/dsa/

OTMT is best run via Hypercane:

https://oduwsdl.github.io/hypercane/

Our thanks to IIPC and IMLS for funding the Dark and Stormy Archives Project!

Sørensen-Dice

F1: 0.649

TF Simhash�F1: 0.523

Raw Simhash�F1: 0.578

Jaccard Distance�F1: 0.651

LSI Topic Modeling�F1: 0.711

TF-IDF Cosine Similarity�F1: 0.766

Byte Count

F1: 0.756

Word Count

F1: 0.788