1 of 56

ETDSuite: A Toolkit to Mine Electronic Theses and Dissertations To Enrich Scholarly Big Data Using Natural Language Processing and Computer Vision

Presented By:

Muntabir Hasan Choudhury

Ph.D. Candidate

Advisor: Dr. Jian Wu

Committee Members: Michael L. Nelson, Michele C. Weigle, Sampath Jayarathna, Edward A. Fox (Virginia Tech)

Department of Computer Science

Old Dominion University, Norfolk, Virginia

November 6, 2024

@TasinChoudhury @WebSciDL

Dissertation Defense Examination

2 of 56

Outline

  • Overview
  • Research Problem
  • Research Questions
  • Research Goals & Objectives
  • Dataset & Benchmarks
  • Related Work
  • Methodology
  • Evaluation & Results
  • Conclusion

2

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

3 of 56

What are Electronic Theses and Dissertations (ETDs)?

  • Since 1997, pioneered by Virginia Tech, students started submitting thesis electronically.
  • Represents scholarly work of students pursuing higher education and successfully met the partial requirement of the degree.
  • Hosted by university library repositories or ProQuest.
  • Different from regular scholarly papers (journals and conference proceedings):
    • Topics may shift across chapters, exhibits major contribution of a research work.
    • Contains rich metadata, bibliographies, figures, tables, and latest discoveries.
    • Metadata schema (e.g., degree, department, disciplines) is different than regular scholarly papers.
    • Have different citation styles due to various academic disciplines.

3

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

4 of 56

ETDs have Layouts Different from Regular Scholarly Articles

4

Abstract

Acknowledgement

Table of Contents

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

5 of 56

ETDs have Layouts Different from Regular Scholarly Articles

5

Chapter

List of Figures

List of Tables

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

6 of 56

ETDs have Layouts Different from Regular Scholarly Articles

6

Appendix

Chapter Abstract

Curriculum Vitae

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

7 of 56

Mining ETDs is Important

  • Contains detailed domain knowledge, metadata, extensive bibliographies, useful graphs, tables, and figures, representing key document elements of ETDs.
  • Mining ETDs can be beneficial:
    • Improve discoverability and readability of long documents.
    • Enhances graduate education by aligning curricula with academic and industry demands.
    • Enrich and enhance metadata quality for digital libraries (DL).
  • Should be accessible to researchers, students, and librarians.
  • However, ETDs are still understudied because of their length (e.g., 100 – 200 pages).
  • ETD repositories have limited computational tools and services to extract, parse, segment the ETDs.
    • Leading accessibility challenges for discovering the knowledge buried in ETDs.

7

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

8 of 56

Extracting ETD Metadata Automatically is Challenging

  • Metadata represents key aspect of digital objects for discoverability.
  • For developing any scalable DL system, metadata is crucial in retrieving relevant documents.
  • However, found discrepancies in library provided ground truth metadata.
  • Moreover, ETD repositories are accompanied by incomplete, little or no metadata:
    • Posing great challenges to accessibility.
    • Leading improper indexing and accurate retrieval issue.
  • GROBID, CERMINE, ParsCit were designed for extracting metadata from regular scholarly papers, failed to generalize well on ETDs.
  • In addition, the visual layout of the cover page is different across universities.

8

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

9 of 56

9

Title

Author

University

Degree

Year

Program

Degree

Program

Author

Year

University

Title

Author

Title

University

Year

Degree

Advisor

Program

MIT 1965

Stanford 1970

University of Michigan 1971

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Cover Pages of ETD Across Universities – Task for Extracting Metadata is Challenging

10 of 56

ETD Cover Page Layouts are Evolving – Extracting Metadata became Challenging

10

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Figure 1: Measurement of the cosine distance with respect to the ETD in 1970

11 of 56

Metadata Represents a Key Aspect of Digital Objects but Quality Becomes a Concern

  • DL systems have adopted Dublin Core (DC) to standardize metadata formats (e.g., ETD-MS v1.1).
    • Bui and Park et al. [1] provided an analysis of 659 metadata item records which shows:
      • Inaccurate, incomplete, and inconsistent metadata elements.
      • For example: yyyy-mm-dd and mm-dd-yyyy.
    • Upon investigation, found missing values [2] (Figure 2).
    • Existing work:
      • Proposed for assessing quality and mechanisms [3] .
      • Crowdsourcing – let users to correct metadata [4].
      • Drawbacks:
        • Difficult to control the user population
        • Slow and thus not scalable.

11

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Figure 2: Missing Values: Year (87%), Department (55%), University (98%)

12 of 56

Optical Character Recognition (OCR) is Challenging for Scanned ETDs

  • ETDs can be scanned and born-digital.
  • For scanned, OCR is one of the first step in further working with the ETD data.
  • Recognizing text from scanned ETDs is challenging:
    • Poor image resolution.
    • Typewritten text.
    • Imperfection of OCR technology.

12

Figure 3: OCR Challenges for ETDs: (i) scribble, (ii) stamp, (iii) overlapped letters, and (iv) copyright character

Figure 4: Noisy OCR Text using Tesseract

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

13 of 56

Limitations of the Existing Frameworks to Segment ETDs

  • Ahuja et al. [1] proposed an object detection model by fine tuning YOLOv7 [2].
    • A bottom-up approach, which automatically annotate major structural components.
    • Underperformed in detecting minority classes (e.g., date, degree, equation).
  • Multimodal frameworks for visual document understanding:
    • Differ in methodological approach and introduced pre-training tasks.
    • Evaluated on RVL-CDIP dataset [3] for document image classification.
    • Despite the novelty of the architecture, while fine tuning on ETDs:
      • LayoutLMv2-base [4] achieved 9% accuracy.
      • DocFormer [5] achieved 28% accuracy.

13

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

14 of 56

Segmentation of ETDs is Challenging Due to Lack of Labeled Dataset

  • Lacks large labeled training data [1].
  • Available ETD page distribution is skewed:
      • It is easy to find “Chapter” pages, but;
      • Minority classes are scarce (Figure 5)
  • Solution: Adding synthesized labeled samples for the minority classes.
      • Document Domain Randomization [2] is capable of generating synthesized scholarly labeled samples by randomizing layout, font styles, and semantics.
      • Key Randomization Aspects: alters font styles/sizes, non-text elements (tables, figures), and spatial relationships to mimic diverse document layouts.
      • However, it lacks the controlled structural patterns needed to accurately mimic ETD documents, making it less effective for generating realistic ETD pages.

14

Figure 5: Number of training samples for each class

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

15 of 56

Accurately Parsing Citation Strings from ETDs is Challenging

  • Citation parsing is necessary for developing citation graphs representing connections between published works.
  • Academic disciplines have adopted different citation styles in their publications.
  • For example:
    • APA format – Education
    • MLA format – Humanities
  • More than 3,000 different citation styles found in ETDs from various disciplines [1].
  • Due to the varion of citation styles, parsing citations accurately became challenging.
  • Existing framework such as Neural ParsCit [2] used deep learning to overcome the challenge of parsing citations accurately.
    • However, the model trained on focused domains (e.g., Computer Science) with fewer citation styles.

15

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

16 of 56

Research Questions

  • We raised four research questions (RQs) to address the research problems.
  • RQ1: Can we develop an AI method to extract metadata from the cover pages of scanned and born-digital ETDs?
  • RQ2: Library provided metadata often exhibits incomplete, inconsistent, and incorrect values. How can we leverage AI methods to improve metadata quality?
  • RQ3: Will latent features that encode text and vision modalities outperform latent features obtained from a single modality in the ETD page classification?
  • RQ4: Is it possible to design a universal parser that accurately parses metadata from multi-style and multi-type citations as appeared in ETDs?

16

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

17 of 56

Research Goals & Objectives

  • Our goal is to develop a software toolkit called ETDSuite, containing advanced machine learning methods, designed to transform raw ETDs into structured, enriching scholarly data through metadata extraction (i.e., RQ1), metadata enhancement (i.e., RQ2), page segmentation (i.e., RQ3), and citation parsing (i.e., RQ4), leveraging natural language processing and computer vision models.
    • We developed the following frameworks by addressing four RQs:
      • AutoMeta [1], a framework to extract metadata automatically.
      • MetaEnhance [2], a framework to improve metadata quality.
      • ETDPC [3], a framework to classify ETD pages by different types.
      • LMParsCit, a framework to parse citation strings in many styles.

17

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

18 of 56

Datasets & Benchmarks for RQ1 (i.e., AutoMeta), RQ2 (i.e., MetaEnhance), RQ3 (i.e., ETDPC), and RQ4 (i.e., LMParsCit)

18

Dataset

Count

Description

Format

Task

500

Consists of Scanned ETDs and annotated metadata of ETD cover pages

XML / PDF / PNG

Used for extracting metadata from scanned ETDs

92,371

Consists of ETD pages manually labeled into 13 categories. Positions and sizes of the text and the bounding boxes (bbox) extracted by AWS textract, an OCR tool

TXT / JSON / PNG

Used for classifying ETD pages

500

500 ETD benchmark evaluations by combining subsets (i.e., university, year, department, degree fields) using multiple criteria. Each criteria ensure to distribute multiple errors (e.g., missing values, misspellings, and incorrect values)

PDF / CSV

Used for improving and enhancing the quality of metadata

1,653

A benchmark of annotated citation strings, collected from 20 ETDs, representing 5 citation types (e.g., journals, conferences, books), and 18 citation styles (e.g., apa, mla, harvard, chicago, etc.)

JSONL

Used for parsing citation strings in many styles

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Dataset is available on Harvard Dataverse:

AutoMeta-ETD500: https://doi.org/10.7910/DVN/18D6AZ

ETDPC-ETD500: https://doi.org/10.7910/DVN/MSFVLQ

MetaEnhance-ETDQual500: https://doi.org/10.7910/DVN/PI3U1V

ETDCite: https://doi.org/10.7910/DVN/ANU6LM

Table 1: Compiled datasets and benchmarks by crawling 114 US universities, collected over 533,047 ETDs in full text.

19 of 56

RQ1: General Machine Learning Approaches to Extract Metadata Designed for Journal and Conference Articles

  • GROBID [1]:
    • Developed to extract bibliographic metadata from born-digital papers.
    • Based on 11 Conditional Random Field (CRF) models on top of sequence tagging method, lexical information, and layout information.
    • Can extract header and bibliographic metadata (e.g., title, authors, affiliations, etc.)
    • Achieved an accuracy of 74.9% but failed to extract metadata from scanned documents.
  • CERMINE [2]:
    • Developed to extract structured bibliographic data (e.g., title, author, volume, issue, etc.).
    • Hybrid Approachsupport vector machine [3], and simple rule-based models.
    • Achieved an average F1 score of 0.775 for most metadata type.
    • Limitation – scanned pages will not be properly processed.

19

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

[1] Lopez, P. (2009). GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications. ECDL’09.

[2] Tkaczyk, D. et al. CERMINE: Automatic Extraction of Structured Metadata from Scientific Literature. IJDAR.

[3] Shmilovici, A. (2005). Support Vector Machines. In: Maimon, O., Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston.

20 of 56

RQ1: Heuristic Based Rules and Performance on Seven Key Metadata Fields [1]

20

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Figure 6: Rules for extracting metadata, accuracy measures. Acln% and Aocr% are based on TXT-clean and TXT-OCR datasets, respectively.

Figure 7: Example of Degree Extraction from MIT and Virginia Tech Sample

Regular Expression

21 of 56

RQ1: Developed AutoMeta Using CRF Model with Sequence Tagging Method by Incorporating Textual & Visual Features for Scanned ETDs

  • CRF is a discriminative model.
  • Features are dependent on each other and considers future observation.
  • Tagged each token of the annotated fields by following BIO tagging schema.
  • BIO tagging schema was applied in the study of Named Entity Recognition [1] and Key Phrase Extraction [2].

  • We also tagged each token with Parts of Speech (POS).

21

B-title

I-title

I-title

B-author

I-author

I-author

[1] Wu et al. (2017). Hesdk: A hybrid approach to extracting scientific domain knowledge entities. JCDL’17.

[2] Councill et al. (2008). ParsCit: an open-source CRF reference string parsing package. LREC’08.

Figure 8: Sequence tagging using BIO schema

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

22 of 56

RQ1: Textual Features for AutoMeta

  • Whether all the characters in the string uppercase and lowercase.
  • Whether all the characters in the string are numeric.
  • Last three characters of the current token.
  • Last two characters of the current token.
  • Tag tokens with POS based on its context.
  • Last two characters in the POS tag of the current token.
  • Whether the first character of consecutive tokens is uppercase.
  • Whether the first character is uppercase for the token that is not at the beginning or end of the document.

22

Figure 9: TXT-clean - a cleansed version of TXT-OCR

Proper Noun (NNP)

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

23 of 56

RQ1: Visual Features for AutoMeta

  • When annotating a document:
    • Humans leverage visual or text features or both.
    • For example, thesis title usually appear at the top of the cover page.
  • Visual information is represented by corner coordinates of the bounding box (Bbox) of a text span.
  • This information is available from hOCR files (XML file).
  • Visual features:
    • Left margin – x1 as a feature for all tokens in the same line.
    • Upper left corner – y2 as a feature for all tokens.
    • Bottom right corner – y1 as a feature for all tokens.
    • Features have been normalized between 0 and 1.

23

Figure 10: Bounding Box measurement for a token

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

24 of 56

RQ1: AutoMeta, Leveraging CRF Model By Incorporating Text and Visual Features, achieved 0.813 – 0.96 F1, Increased the F1 Significantly by 0.7% to 10.6%

24

Figure 11: Performance (F1) comparison of the models

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

25 of 56

RQ2: Existing Works on Metadata Quality Improvement Rely on Manual Approaches, Which are Slow and Not Scalable

  • Manual Correction:
    • When Microsoft Academic was online:
      • Allowed users to change header information (e.g., title, authors, year, abstract, DOI, URL, etc.)
    • Wu et al. [1] proposed crowdsourcing to improve the metadata quality for CiteSeerX.
    • Showed user correction was reliable source of high-quality metadata.
    • Limitation: the paper only examined authors and titles.
  • Semi-automatic Approach:
    • Existence of inconsistent metadata for not following the DC schema [2].
    • Emphasized on method including accuracy, completeness, and consistency.
    • Compared published methods that proposed guidelines, best practices, and quality assurance.
    • Advocated the development of a framework for assessing the quality and mechanisms.
    • Limitation: no AI-based framework have been proposed and implemented.

25

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

26 of 56

RQ2: Proposed MetaEnhance, Focuses on AI methods and Models to Automatically Improve Metadata Quality, Which is More Scalable Than Manual Approach

26

Figure 12: MetaEnhance Framework – Error Detection, Error Correction, and Canonicalization

1

2

3

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

27 of 56

RQ2: Error Detection (incorrect or missing) and Correction (AutoMeta)

  • Title (dc.title): can consist of incorrect title and missing values.
    • For Example: Incorrect Title – DMA Recitals
      • The correct title is: Expanding Vision: The Music of Alyssa Morris.
    • Solution:
      • Adopted the title classifier developed by Rohatgi et al. [1] to detect errors.
      • The classifier was evaluated on the SciDocs [2] dataset.
      • Achieved F1 score of 0.96.
      • Features: # of tokens, # of special characters, # of capital letters, stop words, consecutive punctuations, and minimum, median, and max TF-IDF.

27

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

28 of 56

RQ2: Error Detection (incorrect or missing) and Correction (AutoMeta)

  • Author and Advisor (dc.creator and dc.contributor): consist of incorrect name values.
  • Solution:
    • Named entity recognition (NER) model implemented in FlairNLP [1] package.
    • The model was pre-trained and evaluated on the OntoNotes dataset.
    • Achieved F1 score of 0.90.
    • Error is detected if the name is classified as a type other than PERSON.
    • We observed that many advisor names contain their respective role:
      • For example: Mark Pankow, Co-Chair (dc.contributor.role)
      • Regular expression to the parse names and then store it in a separate column.

28

[1] Akbik et al. (2018). Contextual String Embeddings for Sequence Labeling. COLING’2018

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

29 of 56

RQ2: Error Detection (acronyms) and Correction (canonicalization)

  • Observed acronyms or colloquials for the following fields:

  • Solution: adopted dictionary based approach.
    • University – compiled 832 names of universities and their acronyms.
    • Degree – compiled 234 degree names and their acronyms.
    • Department232 different academic department names and their acronyms.
    • Performed a fuzzy string matching against the corresponding dictionary.
    • To disambiguate department surface names (e.g., Dept of CS), used Sentence Transformers [1].

29

Field

Acronym/Colloquial

Full Name

University (thesis.degree.generator)

JHU, jhu

Johns Hopkins University

Degree (thesis.degree.name)

M.PHIL, MPHIL

Master of Philosophy

Department (thesis.degree.discipline)

MSE, MSCE

Materials Science and Engineering

[1] Nils Reimers and Iryna Gurevych. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP-IJCNLP’19.

Table 2: Example of acronym/colloquial names and their full names

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

30 of 56

RQ2: Error Detection (may contain misspellings) and Correction

  • We also observed misspellings for the department field.
    • For example: College of Muisc and Graduhte Studies in English.
    • Solution: adopted a Python library – pyspellchecker.
      • Levenshtein distance algorithm and compute minimum edit distance.
      • Provides permutations: insertion, deletion, and substitution.
      • If the editing distance between the original word and the field value >= 2, it is considered as a spelling error.

30

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

31 of 56

RQ2: Error Detection (inconsistency) and Correction (need canonicalization)

  • Year (dc.date.issued): the format is inconsistent across libraries.
    • For example: mm–dd–yyyy or yyyy–mm–dd.
    • Solution:
      • First, verified if the date field is valid using to_datetime from Pandas library.
      • Second, checked against the dictionary listing the year range from 1880 – 2023.
      • If error found, the correction module use AutoMeta to overwrite.
      • To canonicalize the surface value:
        • Used to_datetime from Pandas library to parse date field.
        • Provide outputs of year, month, and date.
        • Finally, stored the value in three separate columns.

31

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

32 of 56

RQ2: Evaluation Benchmark Dataset (i.e., ETDQual500) – Distribution of Errors

32

Field

#Missing

#Acronym

#Spell

#Incorrect

Title

0

0

0

1

Author

2

0

0

0

Advisor

150

35

0

0

University

6

43

0

0

Year

172

1

0

0

Degree

156

82

0

4

Department

269

85

2

0

Table 3: ETD error distributions in 500 benchmark dataset.

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

33 of 56

RQ2: MetaEnhance Achieved Nearly Perfect F1 Scores in Detecting Errors and 0.85 to 1.00 For Correcting Five of Seven Key Metadata Fields

33

Field

PED

RED

F1ED

PECC

RECC

F1ECC

Title

0.997

1.0

0.998

0.0

0.0

0.0

Author

0.996

1.0

0.997

0.0

0.0

0.0

Advisor

0.920

0.990

0.950

1.0

1.0

1.0

University

1.0

1.0

1.0

0.740

1.0

0.850

Year

1.0

1.0

1.0

1.0

1.0

1.0

Degree

1.0

1.0

1.0

0.980

1.0

0.980

Department

0.996

1.0

0.997

0.997

1.0

0.980

Table 4: Performance of Error Detection (ED) and Error Corrections and Canonicalization (ECC)

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

34 of 56

RQ3: Existing Works Applied by Fine Tuning Vision Based Models to Segment ETDs, Yield to Low Performance

  • Fine Tuning YOLOv7:
    • Existing multimodality framework performed well on general document layout analysis.
    • Ahuja et al. proposed to leverage visual features and fine tune YOLOv7.
    • Limitation: lack of training samples led to low performance on minority class.
  • Fine Tuning VGG16 [1]:
    • Fine tuned VGG16 model on the ETDPC-ETD500 dataset.
    • Compared against DocFormer and LayoutLMv2.
    • Surprisingly, the model performed slightly better.

34

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Figure 13: Comparison against baseline models. We report classification accuracy and macro F1 on the test set

35 of 56

35

Figure 14: ETDPC – A Multimodality Framework (I – Image, and T - Text)

RQ3: Proposed ETDPC, A Two Stream Multimodal Classification Model with Cross-Attention That Uses a Vision Encoder (e.g., ResNet-50v2) and a Text Encoder (e.g., BERT with Talking Heads Attention)

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

36 of 56

RQ3: Method – Adopted ResNet50 (Version 2) as a Visual Modality

  • We chose ResNet50 version 2 [1] as our visual encoder:
    • ResNet50v2 [30] is an improvement over original ResNet50 [2].
    • Performed better than the original ResNet on the ImageNet-1k [3] and CIFAR-10/100 datasets.

36

Figure 15: ResNet vs. ResNetv2

Source: https://shorturl.at/xBSZ0

[1] He et al. 2016. Identity Mappings in Deep Residual Networks. ECCV

[2] He et al. 2016. Deep Residual Learning for Image Recognition. CVPR

[3] Russakovsky et al. 2015. ImageNet Large Scale Visual Recognition Challenge. arXiv:1409.0575.

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

37 of 56

RQ3: Method – Adopted BERT with Talking-Heads as a Textual Modality

  • We used BERT with the Talking-Heads Attention [1].
    • Replaced multi-head attention and ordinary dense layer with GELU in BERT [2].
  • To generate the embeddings:
    • First, we used AWS Textract (an OCR tool) to extract text from the ETD images.
    • We used the pre-trained model of BERT with Talking Heads Attention (large).
    • We performed tokenization and extracted the following features:
      • Input type ids: which sequence a token belongs to when there is more than one
      • Input mask: whether a token should be attended or not
      • Input word ids: the indices corresponding to each token in the sentence
    • Finally, these features are fed through a trainable embedding layer.

37

[1] Shazeer et al. 2020. Talking-Heads Attention. arXiv:2003.02436.

[2] Devlin et al. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

38 of 56

RQ3: Method – Multimodal with Cross-Attention

  • The dimension of embeddings from both encoders are unified:
    • Applied linear projection to map the dimension.
    • Our model takes one 256-D projection layer with a dropout rate of 0.8.
    • Later, the model fetches each embedding projection and performs an early fusion.
  • To focus on the most important pixels of an image that relates to the corresponding text:
    • We used cross-attention [1, 2]:
      • Cross-attention combines asymmetrically two separate embedding sequence.
    • Further, we concatenated the early fusion of the projection layer with the attention sequence.
    • Finally, we pass it through the softmax layer for classification.

38

[1] Chen et al. CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification.ICCV

[2] Wei et al. Multi-Modality Cross Attention Network for Image and Sentence Matching. CVPR

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

39 of 56

RQ3: Proposed a Method to Augment Minority ETD Pages

  • To mitigate the data imbalance problem, we generate synthesized training samples.
    • We paraphrase the text extracted by the OCR (i.e., AWS Textract).
    • We convert the text into images.
    • We perform an image based transformation.
  • Adopted Google’s PEGASUS [1] for paraphrasing.
  • Used ImgAug for image based transformation, and adopted the following transformations [2]:
    • Additive Gaussian Noise – adds noise from gaussian distributions elementwise to images.
    • Salt-and-pepper noise – replaces pixels in images with white & black-ish colors.
    • Gaussian blur – tries to smooth the image with gaussian kernel (e.g., σ = 0.5).
    • Linear Contrast – incorporates the scanning effect (e.g., ɑ = 1).

39

[1] Zhang et al. 2020. PEGASUS: Pre-Training with Extracted Gap-Sentences for Abstractive Summarization. ICML

[2] Kahu et al. 2021. ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations. JCDL

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

40 of 56

RQ3: Experiments – Fine Tuning Hyperparameters & Training

  • Hardware: single Tesla BV100-SXM2 GPU for training and 12 core CPU to perform other tasks
  • We heuristically fine tuned the hyperparameters.
    • Optimizer: ADAM with a weight decay of 0.004.
    • Epsilon: 1e-07, clip value: 2.0, learning rate: 0.001, dropout rate: 0.8.
    • Loss: sparse categorical focal loss.
    • To avoid overfitting, setup early stopping while monitoring validation loss.
  • Split the labeled ETD pages into train, validation (25%), and test sets (15%).
  • Our model with around 460M total parameters is trained on the original and augmented ETD pages.
  • Trained the model with a batch size of 32 and 40 epochs, each taking around 1 hour.

40

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

41 of 56

RQ3: ETDPC Outperforms the Previous Models, Achieving F1 of 0.84 – 0.96 for 9 out of 13 Categories

41

Figure 16: Performance on ETD samples in the test set – a) performance of one-level classifier (i.e., Case a), where ETDPC is trained on ETDPC–ETD500; b) performance of two-level classifier (i.e., Case b), training first on chapter vs. non-chapter pages, and test on the remaining categories, including 21,171 manually labeled samples; and c) performance of the two-level classifier (Case c), trained on 21,171 manually labeled samples and 5,984 augmented samples. We highlight the categories with remarkable improvements of F1 scores.

Case a: single-level classifier

Case b: two-level classiifer

  • Level 1: chapter vs. non-chapter
  • Level 2: other categories of non-chapter

Case c: two-level classifier + data augmentation

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

42 of 56

RQ3: Performed an Ablation Study by Changing the Text Encoder in Multimodal Model and of Using Individual Modalities

42

Figure 17: Ablation Study – a) Experiment 1 illustrates the performance increment when changing the original BERT model to BERT with Talking-Heads. b) Experiment 2 illustrates the performance of using the individual modalities vs. the multimodal model with cross-attention.

(a) Experiment 1

(b) Experiment 2

BERT with Talking-Heads (red) beats BERT-large (blue)

Multimodality (yellow) performs single modality (blue and red)

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

43 of 56

RQ4: Existing Methods Employed to Extract Metadata Fields by Open Source Citation Parser

43

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Citation Parser

Approach

Extracted Fields

Regular Expression

author, date, editor, genre, issue, pages, publisher, title, volume, year

BibPro [1]

Template Matching

author, title, venue, volume, issue, page, date, journal, booktitle, techReport

CERMINE [2]

ML (CRF)

author, issue, pages, title, volume, year, DOI, ISSN

GROBID [3]

ML (CRF)

authors, booktitle, date, editor, issue, journal, location, note, pages, publisher, title, volume, web, institution

ParsCit [4]

ML (CRF)

author, booktitle, date, editor, institution, journal, location, title, volume

Neural ParsCit [5]

Deep Learning (BiLSTM-CRF)

author, booktitle, date, editor, institution, journal, location, note, poges, publisher, tech, title, volume

TransParsCit [6]

Deep Learning (BiLSTM-Transformer-CRF)

author, container title, date, page, publisher, punc, title, volume

[1] Chen et al. (2008). BibPro: A Citation Parser Based on Sequence Alignment Techniques. WAINA’08

[2] Tkaczyk, D. et al. CERMINE: Automatic Extraction of Structured Metadata from Scientific Literature. IJDAR.

[3] Lopez, P. (2009). GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications. ECDL’09.

[4] Councill et al. (2008). ParsCit: an open-source CRF reference string parsing package. LREC’08.

[5] Prasad et al. (2018). Neural ParsCit: A Deep Learning Based Reference String Parser. IJDL.

[6] Uddin, MD S. (2022). TransParsCit: A Transformer-Based Citation Parser Trained on Large-Scale Synthesized Data. Master of Science (MS), Thesis, Computer Science, Old Dominion University

44 of 56

RQ4: List of Existing Training Dataset for Citation Parsing Task

44

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Dataset Name

#Instances

Domain

Cora

1,295

Computer Science

CiteSeer

1,563

Artificial Intelligence

Umass

1,829

STEM

FLUX-CiM HS

2,000

Health Science

PubMed Central

Varies

Biomedical

GROTOAP2

6,858

Biomedical and Computer Science

Venice

40,000

Humanities

GIANT

991 million (~1B)

Multi-Domain (~1,500 Citation Styles)

45 of 56

RQ4: Existing Models Trained on Focused Domain, May Lead to a Poor Performance on ETD Citation Strings Consisting of Many Styles from Various Academic Disciplines

  • Machine Learning Approach (i.e., CRF):
    • CERMINE
      • Used 4000 parsed citations from CiteSeer, Cora-Ref, and PubMed Central (PMC).
      • Outperformed the existing tools (e.g., GROBID, ParsCit), achieving 0.933 F1 score.
    • GROBID (current version – 0.7.3)
      • Achieved F1 score of 0.87 against PMC set of 1,943 PDFs, containing ~ 90K references.
      • Achieved F1 score of 0.89 on a set of 2,000 PDF from bioRxiv.
  • Deep Learning Approach (i.e., BiLSTM & CRF)
    • Neural ParsCit
      • Used word embedding, and character based word embeddings as a feature set.
      • Trained on 4.3M reference strings extracted from ACM digital library.
      • Achieved F1 score from 0.86 – 0.99 depending on metadata fields.
      • Performed poorly (e.g., 0.51 – 0.83 F1 score) on GIANT, consisting citation in many styles.

45

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

46 of 56

RQ4: TransParsCit using Recurrent Neural Network (e.g., LSTM, BiLSTM) with Transformer and CRF Led to a Poor Performance to Parse Citations

  • Deep Learning Approach (i.e., BiLSTM-Transformer-CRF):
    • TransParsCit, a deep learning model with Transformer [1] and CRF.
    • Trained on a large-scale GIANT dataset, covering 1500 citation styles.
    • Underpormed (e.g., F1 score 0.62 – 0.98) on CORA dataset in extracting key metadata fields (e.g., title, date, volume, publisher) when compared against Neural ParsCit.
    • Two possible reason:
      • Custom tokenizer (i.e., splitting a sentence into smaller units).
      • Combination of BiLSTM [2] and Transformer – both have different mechanism.

46

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

[1] Vaswani et al. (2017). Attention is all you need. NeruIPS’17.

[2] Schuster et al. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing.

47 of 56

RQ4: We Propose LMParsCit, using a LLM Based Models and Contribute a Challenging Benchmark to Overcome the limitations of Existing Methods

  • RNN (i.e., BiLSTM) requires large training data, more memory, and computationally expensive.
  • Transformer was introduced with the purpose reducing training time and improving performance.
    • Widely adopted for NLP tasks – Q&A, Sequence Classification, and NER tasks.
    • Achieved state of the art performance.
  • Thus, adopting transformer based model and replacing the role of BiLSTM will be a promising method.
    • We propose LMParsCit, where the core architecture rely on LLM (e.g., Llama3-8b-instruct) model.
      • Extract key metadata fields with high accuracy: title, authors, venue (i.e., container-title (e.g., journals, conferences, etc.) and publisher (e.g., institutions, organizations)), and year.
    • We contribute an evaluation benchmark, called ETDCite
      • Consisting of 1,653 citation strings in many styles (e.g., APA, MLA, IEEE, ACM)
      • Consisting of citation strings based on citation types (e.g., journals, conferences, etc.)
      • Representing various academic disciplines (e.g., arts, humanities, engineering, etc.)

47

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

48 of 56

RQ4: ETDCite – A Benchmark compiled from ETDs

48

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Figure 18: ETDCite evaluation benchmark, distribution of the ETD reference across University, Discipline, Year, and Bibliography Types (e.g., J – Journal, Conf. – Conference, B – Book, In-B – In Book, TR – Tech Report, Th – Thesis)

49 of 56

RQ4: ETDCite – A Benchmark compiled from ETDs

49

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Figure 19: Heatmap demonstrating the relationship between citation types vs. citation Styles

50 of 56

RQ4: ETDCite – A Benchmark compiled from ETDs

50

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Figure 20: Heatmap demonstrating the relationship between academic disciplines vs. citation styles

51 of 56

RQ4: ETDCite – A Benchmark compiled from ETDs

51

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Figure 21: Heatmap demonstrating the relationship between academic disciplines vs. citation types

52 of 56

RQ4: Prompt Engineering – Conceptual Overview

  • Design inputs by providing examples (i.e., shots) to guide LLMs to produce desired outputs.
    • Instruction-tuned: trained LLMs to follow instructions in a conversation.

<instruction> Extract Metadata from the following citation: Smith, J. (2021). Deep Learning in Practice. <output> {“author”: “Smith, J.”, “title”: “Deep Learning in Practice”, “year”: “2021”}

    • Few-shot vs. Zero-shot:
      • In few shot, several examples (typically, n < 5) are provided to guide the model.
      • In zero shot, the model is asked to perform a task without seeing any example.
    • Chain-of-Thought: encourages the model to think logically step-by-step, improving the quality of reasoning based tasks.

Q: What is 12 + 15 divided by 3?

A: Let’s solve it step by step. 15 divided by 3 is 5. Now, add 12 and 5. The answer is 17.

52

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

53 of 56

RQ4: LMParsCit – Using GIANT 1B, Fine-tuned Llama3-8b-instruct Incorporating Prompts

53

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Optimization:

  • Batch Size of 2, and Gradient Accumulation of 4 steps.
  • Learning rate is set to 2e-4
  • Employs AdamW optimizer with 8-bit precision and a Weight decay of 0.01 (prevents overfitting).

Hardware:

  • 16 CPU cores, 64GB of RAM, and one A40 GPU with 46GB of video memory.

Figure 22: LMParsCit Framework

54 of 56

RQ4: LMParsCit Performance on CORA, Comparing against Neural ParsCit, Achieved State-of-the-art Performance

54

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Figure 23: Performance comparison of two LLM based methods (LMParsCit – only finetuning, and LMParsCitPrompt – fine tuning & prompt engineering) on CORA

CORA:

  • Figure 23 illustrates LMParsCit achieve F1 score of 0.99 in all fields.
  • When comparing against Neural ParsCit, the performance of LMParsCit significantly improved by 0.6% to 6.94%, depending on the fields.
  • Incorporating prompt engineering improves the performance greatly when comparing against Neural ParsCit. For example:
    • Title field demonstrates an improvement of 0.16% using fine tuning only and incorporating prompt increases to 1.86%.
    • The venue field also sees a 6.94% improvement from Neural ParsCit to LMParsCit.

55 of 56

RQ4: LMParsCit Performance on ETDCite, Comparing against Neural ParsCit, Achieved State-of-the-art Performance

55

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Figure 24: Performance comparison of two LLM based methods (LMParsCit – only fine tuning, and LMParsCitPrompt – fine tuning & prompt engineering) on ETDCite

ETDCite:

  • Figure 24 illustrates LMParsCit achieve F1 score of 0.94 to 0.99 based on the metadata fields.
  • When comparing against Neural ParsCit, the performance of LMParsCit significantly improved by 0.8% to 32.8%, depending on the fields.
  • Incorporating prompt engineering improves the performance greatly when comparing against Neural ParsCit. For example:
    • Title field demonstrates an improvement of 7.2% using fine tuning only and incorporating prompt increases to 9.4%.
    • The author field sees a significant improvement, an improvement by 28.4% using fine tuning only and incorporating prompt increases to 32.8%.

56 of 56

Conclusion

  • We developed ETDSuite, a comprehensive suite consisting of four frameworks: AutoMeta, MetaEnhance, ETDPC, and LMParsCit—each targeting key challenges in the mining of ETDs.
  • AutoMeta, leveraging a CRF model with visual features, achieved F1 scores between 81.3% and 96% across various metadata fields, establishing it as a reliable solution for metadata extraction.
  • MetaEnhance uses AI-driven methods to detect, correct, and standardize metadata, achieving F1 scores ranging from 95% to 99%. This framework corrected errors in 85% to 98% of cases, significantly improving metadata quality and consistency.
  • ETDPC, a cross-attentive, two-stream multimodal model, attained F1 scores between 84% and 96% for 9 out of 13 ETD page categories, demonstrating high accuracy in document classification.
  • LMParsCit, powered by the Llama3-8b-instruct model, set a new standard in citation parsing, achieving an impressive 99.9% F1 score for extracting key metadata fields from diverse citation styles, types, and academic disciplines.
  • Through ETDSuite, we contributed valuable evaluation benchmarks (e.g., MetaEnhance-ETD500, ETDCite), datasets (e.g., ETD-500), and tools that enhance library repositories, addressing critical issues of accessibility, discoverability, and readability for ETDs.

56

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24