1 of 30

Slow, Painful, and Expensive: Current Challenges in Text-Mining Corpus Construction for the Digital Humanities

Matt Warner, Nichole Nomura1, Carmen Thong1, Alix Keener, Alex Sherman2, Gabi Keane2, Maciej Kurzynski2, Mark Algee-Hewitt

12 Equal Contributions

2 of 30

Presentation Overview

  1. Project Background
  2. Methods
    1. Corpus Selection
    2. Acquisition Process
    3. Data Management
  3. Analysis of Sources

3 of 30

Project Background

Text and Data Mining: Demonstrating Fair Use

Mellon Funded Public Knowledge Project

2 part grant:

  1. Map the relationship between literary theory and criticism in the 20th century
  2. Demonstrate the importance of access to in-copyright materials via the TDM exemption to the DMCA

4 of 30

Project Background

Text and Data Mining: Demonstrating Fair Use

Project progress:

  1. Corpus acquisition phase coming to a close (article submitted to DHQ)
  2. Analysis phase ongoing throughout 2024 (we are submitting an abstract to ADHO and are aiming for an article on the results of the project late 2024)

5 of 30

Text and Data Mining: Demonstrating Fair Use

Project depends on 2 corpora

  1. Corpus of Literary Criticism (theoretical models applied to text): ~25K recovered Google Books
  2. Corpus of Literary Theory (works on the theoretical models themselves): ~500 hand-assembled texts acquired via the DMCA exemption (to demonstrate fair use)

6 of 30

Corpus Building: Literary Lab Antecedents

The Mark and Mark corpus of 20th century fiction: 2014-15

  • One of the first attempts to build an in-copyright 20th century corpus
  • Built from 5 lists of 20th century fiction (3 “best of”, 1 experimental fiction, 1 bestselling fiction)
  • Supplemented with list built from survey of MELUS/PSA/Feminist Press members

7 of 30

Corpus Building: Literary Lab Antecedents

The Mark and Mark corpus of 20th century fiction: 2014-15

  • Overall project was a success.
  • All works from corpus 1.0 (based on the 5 lists) were acquired
  • Most challenging works were bestsellers from the early 20th century
  • ~80% of PSA list acquired

8 of 30

Corpus Building: Literary Lab Antecedents

Canon/Archive: 2014-17

  • Early project work in 2014 to assemble a random sample of 674 works of 18th century fiction (based on Raven/Garside bibliography)
  • 215 works were already held by the Literary Lab
  • 300 works were held by ProQuest/Gale/HathiTrust
  • 30 were held on microfiche only
  • 100 were only in print
  • 10 were unfindable

9 of 30

Corpus Building: Literary Lab Antecedents

Canon/Archive: 2014-17

  • 50 print only books were held by British Library and 50 were held by Harvard/UCLA ($1000-$20,000 per novel for digitized copies)
  • 6 were held by ProQuest ($25,000 per novel)
  • HathiTrust sent 70% of their holdings of the sample 6 months later
  • Overall, we were only able to retrieve ~70% of the sample

10 of 30

Corpus Building: Literary Lab Antecedents

Theory Project

  • Working with a corpus of 20th century non-fiction could we do better in 2023 than in either 2015 or 2016?
  • Does the DMCA make a difference in our ability to legally acquire texts?
  • What other challenges remain that the DMCA does not help with?

11 of 30

Methods

  1. Corpus Design
  2. Acquisition Overview
  3. Acquisition Types
  4. Data Management

12 of 30

Corpus Selection

title

source_type

source

google_books_return

unique_id

auth1_first

auth1_last

field

first_pub_date

born_digital

Anarchism and Other Essays

Ebook

PREVIOUSLY_OWNED_EBOOK

FALSE

1

Emma

Goldman

Feminism

1910

Born Digital

The Morality of Birth Control

Ebook

PUBLIC_DOMAIN

FALSE

2

Margaret

Sanger

Feminism

1921

Born Digital

The Second Sex

Ebook

INDIVIDUAL_EBOOK_PURCHASE

FALSE

3

Simone

de Beauvoir

Feminism

1949

Born Digital

"Women as a Minority Group"

Article

ARTICLE

FALSE

4

Helen Mayer

Hacker

Feminism

1951

Born Digital

The Feminine Mystique

Print

LIBRARY_SCAN

FALSE

5

Betty

Friedan

Feminism

1963

Digitized

Scenes of Subjection

Print

DESTRUCTIVE_SCAN

FALSE

24

Saidiya V.

Hartman

Feminism | Black Studies

1997

Digitized

13 of 30

Acquisition Process: Overview

  1. Prioritize born-digital texts whenever possible, ideally from SUL holdings
  2. Purchase any lacking ebooks
  3. Fill small number of gaps in collection via digitization

14 of 30

Acquisition Process: Overview

15 of 30

Acquisitions: Electronic Books

  • Of our 402 titles, 116 (~28%) were available as purchased/non-subscription ebooks via SUL (DMCA). Of these, 51 full-text download; remainder DRM
  • Of these 65 DRM-protected or fragmented texts, 35 were acquired via publisher request (little-used SUL license terms)
  • In cases where SUL did not already own ebook, we requested purchase if possible (thanks, Rebecca Wingfield!)
  • For many texts, the library has held multiple print copies (especially for titles published before the digital era), but has had little reason/capacity to duplicate with ebook copy
  • Non-SUL ebooks…

16 of 30

Acquisitions: Google Books Return

  • Texts from SUL scanned by Google Books
  • Coverage is unpredictable
  • OCR is out-of-date
  • Access requires submitting requests and data should not be copied to local machines

17 of 30

Acquisitions: Google Books Return

18 of 30

Acquisitions: Scanned Books

  • Library funding and labor
    • Pro: institutional proof we own the book
    • Con: Library timelines and funding
  • Google Books
    • Super unique to us, not reproducible at other institutions
    • OCR quality is vintage
  • Destructive Scan
    • Method: bandsaw and feed-scanner
    • Pro: cheap, relatively fast
    • Con: book storage, book destruction
  • Overhead Scan
    • For books we couldn’t bear to destroy

19 of 30

Acquisitions: Open Access and Public Domain

  • Open Access titles have to be manually requested for cataloging and inclusion in SearchWorks (crowdsourced, in a way)
  • Ebooks that are available both for purchase and OA don’t display OA status in library ordering platform
  • What is public domain can seem clear, but it varies from place to place, and historical changes further complicate evaluations (some copyright changes are retroactive, some are not)
    • Often means prioritizing older editions over newer ones
    • Works released into the public domain by their authors are also not always clearly marked as such

20 of 30

Data Management

  1. Data Security
  2. Ebook Processing
  3. Metadata and Data Format

21 of 30

Data Management: Security

  • Data obtained using the DMCA-exemption must be kept secure, using "those measures that the institution uses to keep its own highly confidential information secure."
  • Stanford classified our data as "high risk"—equivalent to SSNs, PHI, credit card numbers, donor info, etc
    • Not compatible with OAK / Sherlock
    • High-risk setup for a personal computer is invasive and requires extensive monitoring

22 of 30

Data Management: DRM-Removal

  • Almost all textual DRM is Adobe Digital Editions
  • Tool-makers are engaged in an arms-race with DRM-makers (tools aren't reliable)
  • Tool-making is not protected by the DMCA exemption
  • Workflow requires manual use of Adobe Digital Editions
    • Does not run on linux and has not command line interface
    • Hard to scale to large corpora

23 of 30

Data Management: Metadata and Databases

  • Clash of library (edition / book) data model and DH (work or text)
  • Needed data from library MARC records but did not systematically collect them
  • Library record lack certain scholarly info (e.g. first pub date)
  • Would have been better to link MARC records to metadata and manage it all in a database

24 of 30

Analysis of Sources

  1. Overview
  2. By field
  3. By Date

25 of 30

Analysis of Sources: Overview

  • About 60% of our text were available as ebooks or articles
  • Deliberately few articles in our corpus, but very easy to locate
  • Scanning was a vastly more important source than we expected
  • Google books was nice to have (saved scanning) but not crucial for this project

26 of 30

Analysis of Sources: By Field

27 of 30

Analysis of Sources: By Field

28 of 30

Analysis of Sources: Dates

29 of 30

Analysis of Sources: Articles and Monographs

  • Monographs are important for Humanities citation and tenure
  • Monograph citation network research has been done by hand for a number of logistical reasons

  • Previous DH work has mostly used articles
    • (Goldstone and Underwood, 2012); (Riddell, 2014) (Feeney, 2017); (Ambrosino et al., 2018); (Piper, 2020)
  • Articles are easier to get

30 of 30

Conclusions

Born-digital corpora remain difficult and expensive to compile

  1. The DMCA exemption is challenging to navigate
  2. For academic writing, ebook availability remains patchy
  3. Scanning books is logistically simple and affordable