Two Different Approaches for Collecting, Analysing and Selecting Primary Sources from Web Archives

Federico Nanni

Data and Web Science Group

University of Mannheim

The starting point of my research is the transition from analouge to digital materials that we have been

From Analogue to Born-Digital Materials

Traditional!!! Definisci born digital

Impact on the Historical Method

Creating a narrative

Interpreting the evidence

Defining a subject of investigation

Identifying the evidence

Interpreting the evidence

Creating a narrative

Identifying the evidence

Abundance

Scarcity

Research Question

Which methodologies should we combine

with the historical method in order to

re-collect, analyse and select

born-digital materials?

Dealing with Scarcity

Check portale ateneo alta qualita

Metti finestra opaca sopra alla cosa di cui non parli ora

Dealing with Scarcity

The website

Check portale ateneo alta qualita

Metti finestra opaca sopra alla cosa di cui non parli ora

Archival Research

  • University archives
  • Newspapers archives
  • Forums, blogs, usenet

Web Archival Research

Oral Interviews

Creation and role of:

  • Portale d’Ateneo
  • Department sub-domains
  • Alma-Net
  • The website

A Combination of Sources and Methods

F. Nanni, “Reconstructing a website’s lost past – Methodological issues concerning the history of www.unibo.it”, in Digital Humanities Quarterly, 2017.

Dealing with Abundance

Events

Events

Event

collection

Events

The Orange Revolution

Event

collection

And the early stages?

Building Entity-Centric Event Collections

Using related concepts and entities to retrieve a comprehensive set of relevant documents.

Evaluation

(on NYTimes Corpus)

Many documents are not retrieved when using only the event-name approach.

F. Nanni, S. P. Ponzetto and L. Dietz, “Building Entity-Centric Event Collections”, JCDL 2017.

Conclusion

  • For dealing with born-digital materials, a new form of criticism is necessary

  • Combining methodologies from the fields of internet studies and natural language processing is the key

  • Offering this interdisciplinary preparation to the new generations of historians is fundamental

questions?

Federico Nanni

Data and Web Science Group

University of Mannheim

federico@informatik.uni-mannheim.de

Final Application

Quantifying Attention to Foreign Elections with Text Analysis (together with A. Elshehawy, N. Marinov).

Pre-print available on SSRN.

Case Studies

Types of event:

  • 15 unexpected elections
  • 15 political crises
  • 15 civil wars

Datasets:

  • New York Times Corpus
  • US Congress Dataset
  • TREC KBA Stream Corpus

For each case study we created the evaluation dataset.

Collecting Entities

Collect relevant entities from:

  • An initial pool of relevant documents (all entities mentioned in the context of the event-name).
  • Wikipedia page of the event, as outlinks.

Ranking Entities

We use RDF2Vec (Ristoski et al., 2016) and rank by the cosine similarity of vector representation for each entity and the event.

Getting Contextual Passages

For each relevant entity, we collect a text passage from the Wikipedia page of the event by retrieving the first passage (i.e., three sentences) that contains a mention of the entity.

Query Expansion Solutions

Place

Entities

Passages (i.e., entities in context)

GloVe vector representation of Entities

GloVe vector representation of Passages (our-light)

Query Expansion Solutions

Place

Entities

Passages (i.e., entities in context)

GloVe vector representation of Entities

GloVe vector representation of Passages (our-light)

Cosine similarity between expanded query and document

Query Expansion Solutions

Place

Entities

Passages (i.e., entities in context)

GloVe vector representation of Entities

GloVe vector representation of Passages (our-light)

All combined with Learning to Rank (our-full)

Cosine similarity between expanded query and document

Evaluation on NY Times

Evaluation on US Congressional Record

% of Missing Docs Using the Event-Name

Resaw-Nanni - Google Slides