JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 30

Slow, Painful, and Expensive: Current Challenges in Text-Mining Corpus Construction for the Digital Humanities

Matt Warner, Nichole Nomura¹, Carmen Thong¹, Alix Keener, Alex Sherman², Gabi Keane², Maciej Kurzynski², Mark Algee-Hewitt

¹² Equal Contributions

2 of 30

Presentation Overview

Project Background
Methods

Corpus Selection
Acquisition Process
Data Management

Analysis of Sources

3 of 30

Project Background

Text and Data Mining: Demonstrating Fair Use

Mellon Funded Public Knowledge Project

2 part grant:

Map the relationship between literary theory and criticism in the 20th century
Demonstrate the importance of access to in-copyright materials via the TDM exemption to the DMCA

4 of 30

Project Background

Text and Data Mining: Demonstrating Fair Use

Project progress:

Corpus acquisition phase coming to a close (article submitted to DHQ)
Analysis phase ongoing throughout 2024 (we are submitting an abstract to ADHO and are aiming for an article on the results of the project late 2024)

5 of 30

Text and Data Mining: Demonstrating Fair Use

Project depends on 2 corpora

Corpus of Literary Criticism (theoretical models applied to text): ~25K recovered Google Books
Corpus of Literary Theory (works on the theoretical models themselves): ~500 hand-assembled texts acquired via the DMCA exemption (to demonstrate fair use)

6 of 30

Corpus Building: Literary Lab Antecedents

The Mark and Mark corpus of 20th century fiction: 2014-15

One of the first attempts to build an in-copyright 20th century corpus
Built from 5 lists of 20th century fiction (3 “best of”, 1 experimental fiction, 1 bestselling fiction)
Supplemented with list built from survey of MELUS/PSA/Feminist Press members

7 of 30

Corpus Building: Literary Lab Antecedents

The Mark and Mark corpus of 20th century fiction: 2014-15

Overall project was a success.
All works from corpus 1.0 (based on the 5 lists) were acquired
Most challenging works were bestsellers from the early 20th century
~80% of PSA list acquired

8 of 30

Corpus Building: Literary Lab Antecedents

Canon/Archive: 2014-17

Early project work in 2014 to assemble a random sample of 674 works of 18th century fiction (based on Raven/Garside bibliography)
215 works were already held by the Literary Lab
300 works were held by ProQuest/Gale/HathiTrust
30 were held on microfiche only
100 were only in print
10 were unfindable

9 of 30

Corpus Building: Literary Lab Antecedents

Canon/Archive: 2014-17

50 print only books were held by British Library and 50 were held by Harvard/UCLA ($1000-$20,000 per novel for digitized copies)
6 were held by ProQuest ($25,000 per novel)
HathiTrust sent 70% of their holdings of the sample 6 months later
Overall, we were only able to retrieve ~70% of the sample

10 of 30

Corpus Building: Literary Lab Antecedents

Theory Project

Working with a corpus of 20th century non-fiction could we do better in 2023 than in either 2015 or 2016?
Does the DMCA make a difference in our ability to legally acquire texts?
What other challenges remain that the DMCA does not help with?

11 of 30

Methods

Corpus Design
Acquisition Overview
Acquisition Types
Data Management

12 of 30

Corpus Selection

title	source_type	source	google_books_return	unique_id	auth1_first	auth1_last	field	first_pub_date	born_digital
Anarchism and Other Essays	Ebook	PREVIOUSLY_OWNED_EBOOK	FALSE	1	Emma	Goldman	Feminism	1910	Born Digital
The Morality of Birth Control	Ebook	PUBLIC_DOMAIN	FALSE	2	Margaret	Sanger	Feminism	1921	Born Digital
The Second Sex	Ebook	INDIVIDUAL_EBOOK_PURCHASE	FALSE	3	Simone	de Beauvoir	Feminism	1949	Born Digital
"Women as a Minority Group"	Article	ARTICLE	FALSE	4	Helen Mayer	Hacker	Feminism	1951	Born Digital
The Feminine Mystique	Print	LIBRARY_SCAN	FALSE	5	Betty	Friedan	Feminism	1963	Digitized
Scenes of Subjection	Print	DESTRUCTIVE_SCAN	FALSE	24	Saidiya V.	Hartman	Feminism \| Black Studies	1997	Digitized

13 of 30

Acquisition Process: Overview

Prioritize born-digital texts whenever possible, ideally from SUL holdings
Purchase any lacking ebooks
Fill small number of gaps in collection via digitization

14 of 30

Acquisition Process: Overview

15 of 30

Acquisitions: Electronic Books

Of our 402 titles, 116 (~28%) were available as purchased/non-subscription ebooks via SUL (DMCA). Of these, 51 full-text download; remainder DRM
Of these 65 DRM-protected or fragmented texts, 35 were acquired via publisher request (little-used SUL license terms)
In cases where SUL did not already own ebook, we requested purchase if possible (thanks, Rebecca Wingfield!)
For many texts, the library has held multiple print copies (especially for titles published before the digital era), but has had little reason/capacity to duplicate with ebook copy
Non-SUL ebooks…

16 of 30

Acquisitions: Google Books Return

Texts from SUL scanned by Google Books
Coverage is unpredictable
OCR is out-of-date
Access requires submitting requests and data should not be copied to local machines

17 of 30

Acquisitions: Google Books Return

18 of 30

Acquisitions: Scanned Books

Library funding and labor

Pro: institutional proof we own the book
Con: Library timelines and funding

Google Books

Super unique to us, not reproducible at other institutions
OCR quality is vintage

Destructive Scan

Method: bandsaw and feed-scanner
Pro: cheap, relatively fast
Con: book storage, book destruction

Overhead Scan

For books we couldn’t bear to destroy

19 of 30

Acquisitions: Open Access and Public Domain

Open Access titles have to be manually requested for cataloging and inclusion in SearchWorks (crowdsourced, in a way)
Ebooks that are available both for purchase and OA don’t display OA status in library ordering platform
What is public domain can seem clear, but it varies from place to place, and historical changes further complicate evaluations (some copyright changes are retroactive, some are not)

Often means prioritizing older editions over newer ones
Works released into the public domain by their authors are also not always clearly marked as such

20 of 30

Data Management

Data Security
Ebook Processing
Metadata and Data Format

21 of 30

Data Management: Security

Data obtained using the DMCA-exemption must be kept secure, using "those measures that the institution uses to keep its own highly confidential information secure."
Stanford classified our data as "high risk"—equivalent to SSNs, PHI, credit card numbers, donor info, etc

Not compatible with OAK / Sherlock
High-risk setup for a personal computer is invasive and requires extensive monitoring

22 of 30

Data Management: DRM-Removal

Almost all textual DRM is Adobe Digital Editions
Tool-makers are engaged in an arms-race with DRM-makers (tools aren't reliable)
Tool-making is not protected by the DMCA exemption
Workflow requires manual use of Adobe Digital Editions

Does not run on linux and has not command line interface
Hard to scale to large corpora

23 of 30

Data Management: Metadata and Databases

Clash of library (edition / book) data model and DH (work or text)
Needed data from library MARC records but did not systematically collect them
Library record lack certain scholarly info (e.g. first pub date)
Would have been better to link MARC records to metadata and manage it all in a database

24 of 30

Analysis of Sources

Overview
By field
By Date

25 of 30

Analysis of Sources: Overview

About 60% of our text were available as ebooks or articles
Deliberately few articles in our corpus, but very easy to locate
Scanning was a vastly more important source than we expected
Google books was nice to have (saved scanning) but not crucial for this project

26 of 30

Analysis of Sources: By Field

27 of 30

Analysis of Sources: By Field

28 of 30

Analysis of Sources: Dates

29 of 30

Analysis of Sources: Articles and Monographs

Monographs are important for Humanities citation and tenure
Monograph citation network research has been done by hand for a number of logistical reasons

Previous DH work has mostly used articles

(Goldstone and Underwood, 2012); (Riddell, 2014) (Feeney, 2017); (Ambrosino et al., 2018); (Piper, 2020)

Articles are easier to get

30 of 30

Conclusions

Born-digital corpora remain difficult and expensive to compile

The DMCA exemption is challenging to navigate
For academic writing, ebook availability remains patchy
Scanning books is logistically simple and affordable