Slow, Painful, and Expensive: Current Challenges in Text-Mining Corpus Construction for the Digital Humanities
Matt Warner, Nichole Nomura1, Carmen Thong1, Alix Keener, Alex Sherman2, Gabi Keane2, Maciej Kurzynski2, Mark Algee-Hewitt
12 Equal Contributions
Presentation Overview
Project Background
Text and Data Mining: Demonstrating Fair Use
Mellon Funded Public Knowledge Project
2 part grant:
Project Background
Text and Data Mining: Demonstrating Fair Use
Project progress:
Text and Data Mining: Demonstrating Fair Use
Project depends on 2 corpora
Corpus Building: Literary Lab Antecedents
The Mark and Mark corpus of 20th century fiction: 2014-15
Corpus Building: Literary Lab Antecedents
The Mark and Mark corpus of 20th century fiction: 2014-15
Corpus Building: Literary Lab Antecedents
Canon/Archive: 2014-17
Corpus Building: Literary Lab Antecedents
Canon/Archive: 2014-17
Corpus Building: Literary Lab Antecedents
Theory Project
Methods
Corpus Selection
title | source_type | source | google_books_return | unique_id | auth1_first | auth1_last | field | first_pub_date | born_digital |
Anarchism and Other Essays | Ebook | PREVIOUSLY_OWNED_EBOOK | FALSE | 1 | Emma | Goldman | Feminism | 1910 | Born Digital |
The Morality of Birth Control | Ebook | PUBLIC_DOMAIN | FALSE | 2 | Margaret | Sanger | Feminism | 1921 | Born Digital |
The Second Sex | Ebook | INDIVIDUAL_EBOOK_PURCHASE | FALSE | 3 | Simone | de Beauvoir | Feminism | 1949 | Born Digital |
"Women as a Minority Group" | Article | ARTICLE | FALSE | 4 | Helen Mayer | Hacker | Feminism | 1951 | Born Digital |
The Feminine Mystique | LIBRARY_SCAN | FALSE | 5 | Betty | Friedan | Feminism | 1963 | Digitized | |
Scenes of Subjection | DESTRUCTIVE_SCAN | FALSE | 24 | Saidiya V. | Hartman | Feminism | Black Studies | 1997 | Digitized |
Acquisition Process: Overview
Acquisition Process: Overview
Acquisitions: Electronic Books
Acquisitions: Google Books Return
Acquisitions: Google Books Return
Acquisitions: Scanned Books
Acquisitions: Open Access and Public Domain
Data Management
Data Management: Security
Data Management: DRM-Removal
Data Management: Metadata and Databases
Analysis of Sources
Analysis of Sources: Overview
Analysis of Sources: By Field
Analysis of Sources: By Field
Analysis of Sources: Dates
Analysis of Sources: Articles and Monographs
Conclusions
Born-digital corpora remain difficult and expensive to compile