1 of 23

Expanding

Digital

Assyriology

with

Machine Learning

Émilie Pagé-Perron

University of Toronto

Leipzig, March 21st 2017

2 of 23

Outline

  • State of NLP for "cuneiform languages"
  • Roadblocks to the expansion of methodologies
  • Possible solutions
  • Concrete actions at the CDLI

2

Global Philology Open Conference - February 2017

© Michel Royon / Wikimedia Commons

3 of 23

Way back CDLI

3

Global Philology Open Conference - February 2017

1998

2003

2010

4 of 23

4

Global Philology Open Conference - February 2017

http://cdli.ucla.edu

5 of 23

Access: Licenses and technical barriers

Restrictive licenses

Scraper required

5

Global Philology Open Conference - February 2017

6 of 23

CDLI in numbers

  • 320,000 individual cuneiform texts
  • 2.65 million of lines of transliteration
  • 300,000 archival images (TIFF and some RTI)

6

Global Philology Open Conference - February 2017

7 of 23

7

Global Philology Open Conference - February 2017

8 of 23

Encoding standards

8

Global Philology Open Conference - February 2017

CDLI-ATF

ORACC-ATF

Steve Tinney & Eleanor Robson, Oracc: The Open Richly Annotated Cuneiform Corpus

9 of 23

Challenges brought by encoding schemes

  • Their multiplicity
  • Some are not computer readable ready
  • Variation in the level of text structure information they encode

9

Global Philology Open Conference - February 2017

10 of 23

10

11 of 23

Nammu

11

Global Philology Open Conference - February 2017

12 of 23

Limitations in current methods

  • High human involvement requirement
  • Contextual disambiguation not available

12

Global Philology Open Conference - February 2017

13 of 23

Statistical methods

  • Social Network Analysis
  • Semantic Network Analysis
  • Topic Analysis
  • Other colocation and clustering tasks

13

Global Philology Open Conference - February 2017

Sara Brumfield 2013

14 of 23

14

Global Philology Open Conference - February 2017

15 of 23

Machine Translation and Automated Analysis of Cuneiform Languages (MTAAC)

University of Toronto Heather Baker (Assyriology)

Émilie Pagé-Perron (Assyriology)

UCLA Robert K. Englund (Assyriology)

Prashant Rajput (Comp. Sc.)

University of Frankfurt Christian Chiarcos (Comp ling.)

Maria Sukhareva (Comp ling.)

Ilya Khait (Assyriology)

15

Global Philology Open Conference - February 2017

16 of 23

Administrative accounts

  • Represent 90% of all cuneiform inscriptions
  • Remain mostly untranslated
  • Rich in information about society and economy
  • Mostly homogeneous and formulaic

16

Global Philology Open Conference - February 2017

17 of 23

NLP Pipeline

  • Pre-processing
  • Morphological analysis
  • Part of Speech tagging
  • Syntactic parsing

17

Global Philology Open Conference - February 2017

18 of 23

Ontologies

Adopted:

  • Ontologies of Linguistic Annotation

Potential:

  • CIDOC-CRM (ModRef Project)
  • Pleiades
  • PeriodO
  • Snap:drgn

18

Global Philology Open Conference - February 2017

"If Open Access is about removing barriers, Linked Open Data is about creating bridges."

19 of 23

Preservation and sustainability

  • Public repositories for data & code

  • Multiple international backup and service mirrors

  • Codebase maintenance and update�

19

Global Philology Open Conference - February 2017

20 of 23

International Backups

20

Global Philology Open Conference - February 2017

21 of 23

CDLI Framework Update Project "Phoenix"

  • Make the code base more sustainable
  • Document the initiative code and practice throughout
  • Enable graduate students in the humanities to code extensions themselves
  • Renew data displays for usability
  • Become a model of best practice in web accessibility

21

Global Philology Open Conference - February 2017

22 of 23

Salient points

  • Shared encoding scheme for cuneiform languages
  • Shared NLP toolkit
  • Integrating machine learning into research methods
  • Multidisciplinary collaboration
  • Linked Open Data �

22

Global Philology Open Conference - February 2017

23 of 23

Thank you!

23

Global Philology Open Conference - February 2017

Contact

Émilie Pagé-Perron

Near and Middle Eastern Civilizations

University of Toronto

epageperron@gmail.com

@sohnyrin

CDLI

Robert K. Englund, UCLA, Director

Jürgen Renn, MPIWG, Co-Director

Jacob Dahl, University of Oxford, Co-PI

Bertrand Lafont, CNRS, Co-PI

Émilie Pagé-Perron, UofT, Co-PI

Collaborators

Christian Chiarcos, University of Frankfurt

Maria Sukhareva, University of Frankfurt

Ilya Khait, Berlin Free University