1 of 56

ETDSuite: A Toolkit to Mine Electronic Theses and Dissertations To Enrich Scholarly Big Data Using Natural Language Processing and Computer Vision

Presented By:

Muntabir Hasan Choudhury

Ph.D. Candidate

Advisor: Dr. Jian Wu

Committee Members: Michael L. Nelson, Michele C. Weigle, Sampath Jayarathna, Edward A. Fox (Virginia Tech)

Department of Computer Science

Old Dominion University, Norfolk, Virginia

November 6, 2024

@TasinChoudhury @WebSciDL

Dissertation Defense Examination

2 of 56

Outline

Overview
Research Problem
Research Questions
Research Goals & Objectives
Dataset & Benchmarks
Related Work
Methodology
Evaluation & Results
Conclusion

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

3 of 56

What are Electronic Theses and Dissertations (ETDs)?

Since 1997, pioneered by Virginia Tech, students started submitting thesis electronically.
Represents scholarly work of students pursuing higher education and successfully met the partial requirement of the degree.
Hosted by university library repositories or ProQuest.
Different from regular scholarly papers (journals and conference proceedings):

Topics may shift across chapters, exhibits major contribution of a research work.
Contains rich metadata, bibliographies, figures, tables, and latest discoveries.
Metadata schema (e.g., degree, department, disciplines) is different than regular scholarly papers.
Have different citation styles due to various academic disciplines.

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

4 of 56

ETDs have Layouts Different from Regular Scholarly Articles

Abstract

Acknowledgement

Table of Contents

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

5 of 56

ETDs have Layouts Different from Regular Scholarly Articles

Chapter

List of Figures

List of Tables

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

6 of 56

ETDs have Layouts Different from Regular Scholarly Articles

Appendix

Chapter Abstract

Curriculum Vitae

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

7 of 56

Mining ETDs is Important

Contains detailed domain knowledge, metadata, extensive bibliographies, useful graphs, tables, and figures, representing key document elements of ETDs.
Mining ETDs can be beneficial:

Improve discoverability and readability of long documents.
Enhances graduate education by aligning curricula with academic and industry demands.
Enrich and enhance metadata quality for digital libraries (DL).

Should be accessible to researchers, students, and librarians.
However, ETDs are still understudied because of their length (e.g., 100 – 200 pages).
ETD repositories have limited computational tools and services to extract, parse, segment the ETDs.

Leading accessibility challenges for discovering the knowledge buried in ETDs.

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

8 of 56

Extracting ETD Metadata Automatically is Challenging

Metadata represents key aspect of digital objects for discoverability.
For developing any scalable DL system, metadata is crucial in retrieving relevant documents.
However, found discrepancies in library provided ground truth metadata.
Moreover, ETD repositories are accompanied by incomplete, little or no metadata:

Posing great challenges to accessibility.
Leading improper indexing and accurate retrieval issue.

GROBID, CERMINE, ParsCit were designed for extracting metadata from regular scholarly papers, failed to generalize well on ETDs.
In addition, the visual layout of the cover page is different across universities.

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

GROBID: https://grobid.readthedocs.io/en/latest/Install-Grobid/

CERMINE: https://github.com/CeON/CERMINE

ParsCit: https://github.com/knmnyn/ParsCit

9 of 56

Title

Author

University

Degree

Year

Program

Degree

Program

Author

Year

University

Title

Author

Title

University

Year

Degree

Advisor

Program

MIT 1965

Stanford 1970

University of Michigan 1971

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Cover Pages of ETD Across Universities – Task for Extracting Metadata is Challenging

10 of 56

ETD Cover Page Layouts are Evolving – Extracting Metadata became Challenging

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Figure 1: Measurement of the cosine distance with respect to the ETD in 1970

11 of 56

Metadata Represents a Key Aspect of Digital Objects but Quality Becomes a Concern

DL systems have adopted Dublin Core (DC) to standardize metadata formats (e.g., ETD-MS v1.1).

Bui and Park et al. [1] provided an analysis of 659 metadata item records which shows:

Inaccurate, incomplete, and inconsistent metadata elements.
For example: yyyy-mm-dd and mm-dd-yyyy.

Upon investigation, found missing values [2] (Figure 2).
Existing work:

Proposed for assessing quality and mechanisms [3] .
Crowdsourcing – let users to correct metadata [4].
Drawbacks:

Difficult to control the user population
Slow and thus not scalable.

[1] Bui et al. (2006). An Assessment of Metadata Quality: A Case Study of the National Science Digital Library Metadata Repository. CAIS/ACSI’06.

[2] Uddin et al. (2021). Building A Large Collection of Multi-domain Electronic Theses and Dissertations. IEEE Big Data.

[3] Jung-Ran Park (2009). Metadata Quality in Digital Repositories: A Survey of the Current State of the Art. Cataloging & Classification Quarterly.

[4] Wu et al. (2014). The impact of user corrections on a crawl-based digital library: A CiteSeerX perspective. IEEE CollaborateCom’14.

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Figure 2: Missing Values: Year (87%), Department (55%), University (98%)

12 of 56

Optical Character Recognition (OCR) is Challenging for Scanned ETDs

ETDs can be scanned and born-digital.
For scanned, OCR is one of the first step in further working with the ETD data.
Recognizing text from scanned ETDs is challenging:

Poor image resolution.
Typewritten text.
Imperfection of OCR technology.

Figure 3: OCR Challenges for ETDs: (i) scribble, (ii) stamp, (iii) overlapped letters, and (iv) copyright character

Figure 4: Noisy OCR Text using Tesseract

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

13 of 56

Limitations of the Existing Frameworks to Segment ETDs

Ahuja et al. [1] proposed an object detection model by fine tuning YOLOv7 [2].

A bottom-up approach, which automatically annotate major structural components.
Underperformed in detecting minority classes (e.g., date, degree, equation).

Multimodal frameworks for visual document understanding:

Differ in methodological approach and introduced pre-training tasks.
Evaluated on RVL-CDIP dataset [3] for document image classification.
Despite the novelty of the architecture, while fine tuning on ETDs:

LayoutLMv2-base [4] achieved 9% accuracy.
DocFormer [5] achieved 28% accuracy.

[1] Ahuja et al. (2022). Parsing Electronic Theses and Dissertations Using Object Detection. WIESP@ACL’22.

[2] Wang et al. (2022). YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. ACL’22.

[3] Xu et al. (2021). LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. ACL/IJCNLP’21.

[4] Appalaraju et al. (2021). DocFormer: End-to-End Transformer for Document Understanding. ICCV’21.

[5] Harley et al. (2015). Evaluation of deep convolutional nets for document image classification and retrieval. ICDAR’15.

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

14 of 56

Segmentation of ETDs is Challenging Due to Lack of Labeled Dataset

Lacks large labeled training data [1].
Available ETD page distribution is skewed:

It is easy to find “Chapter” pages, but;
Minority classes are scarce (Figure 5)

Solution: Adding synthesized labeled samples for the minority classes.

Document Domain Randomization [2] is capable of generating synthesized scholarly labeled samples by randomizing layout, font styles, and semantics.
Key Randomization Aspects: alters font styles/sizes, non-text elements (tables, figures), and spatial relationships to mimic diverse document layouts.
However, it lacks the controlled structural patterns needed to accurately mimic ETD documents, making it less effective for generating realistic ETD pages.

[1] Kahu et al. (2021). ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations. JCDL’21.

[2] Ling, M. et al. (2021). Document Domain Randomization for Deep Learning Document Layout Extraction. ICDAR’21.

Figure 5: Number of training samples for each class

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

15 of 56

Accurately Parsing Citation Strings from ETDs is Challenging

Citation parsing is necessary for developing citation graphs representing connections between published works.
Academic disciplines have adopted different citation styles in their publications.
For example:

APA format – Education
MLA format – Humanities

More than 3,000 different citation styles found in ETDs from various disciplines [1].
Due to the varion of citation styles, parsing citations accurately became challenging.
Existing framework such as Neural ParsCit [2] used deep learning to overcome the challenge of parsing citations accurately.

However, the model trained on focused domains (e.g., Computer Science) with fewer citation styles.

[1] Park et al. (2012). A hybrid two-stage approach for discipline-independent canonical representation extraction from references. JCDL’12.

[2] Prasad et al. (2018). Neural ParsCit: A Deep Learning Based Reference String Parser. IJDL.

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

16 of 56

Research Questions

We raised four research questions (RQs) to address the research problems.
RQ1: Can we develop an AI method to extract metadata from the cover pages of scanned and born-digital ETDs?
RQ2: Library provided metadata often exhibits incomplete, inconsistent, and incorrect values. How can we leverage AI methods to improve metadata quality?
RQ3: Will latent features that encode text and vision modalities outperform latent features obtained from a single modality in the ETD page classification?
RQ4: Is it possible to design a universal parser that accurately parses metadata from multi-style and multi-type citations as appeared in ETDs?

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

17 of 56

Research Goals & Objectives

Our goal is to develop a software toolkit called ETDSuite, containing advanced machine learning methods, designed to transform raw ETDs into structured, enriching scholarly data through metadata extraction (i.e., RQ1), metadata enhancement (i.e., RQ2), page segmentation (i.e., RQ3), and citation parsing (i.e., RQ4), leveraging natural language processing and computer vision models.

We developed the following frameworks by addressing four RQs:

AutoMeta [1], a framework to extract metadata automatically.
MetaEnhance [2], a framework to improve metadata quality.
ETDPC [3], a framework to classify ETD pages by different types.
LMParsCit, a framework to parse citation strings in many styles.

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

[1] Choudhury et al. (2021). Automatic Metadata Extraction Incorporating Visual Features from Scanned Electronic Theses and Dissertations. JCDL’21.

[2] Choudhury et al. (2023). MetaEnhance: Metadata Quality Improvement for Electronic Theses and Dissertations of University Libraries. JCDL’23. (Best Paper Award)

[3] Choudhury et al. (2024). ETDPC: A Multimodality Framework for Classifying Pages in Electronic Theses and Dissertations. IAAI’24.

18 of 56

Datasets & Benchmarks for RQ1 (i.e., AutoMeta), RQ2 (i.e., MetaEnhance), RQ3 (i.e., ETDPC), and RQ4 (i.e., LMParsCit)

Dataset	Count	Description	Format	Task
AutoMeta – ETD500	500	Consists of Scanned ETDs and annotated metadata of ETD cover pages	XML / PDF / PNG	Used for extracting metadata from scanned ETDs
ETDPC – ETD500	92,371	Consists of ETD pages manually labeled into 13 categories. Positions and sizes of the text and the bounding boxes (bbox) extracted by AWS textract, an OCR tool	TXT / JSON / PNG	Used for classifying ETD pages
MetaEnhance – ETDQual500	500	500 ETD benchmark evaluations by combining subsets (i.e., university, year, department, degree fields) using multiple criteria. Each criteria ensure to distribute multiple errors (e.g., missing values, misspellings, and incorrect values)	PDF / CSV	Used for improving and enhancing the quality of metadata
ETDCite	1,653	A benchmark of annotated citation strings, collected from 20 ETDs, representing 5 citation types (e.g., journals, conferences, books), and 18 citation styles (e.g., apa, mla, harvard, chicago, etc.)	JSONL	Used for parsing citation strings in many styles

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Dataset is available on Harvard Dataverse:

AutoMeta-ETD500: https://doi.org/10.7910/DVN/18D6AZ

ETDPC-ETD500: https://doi.org/10.7910/DVN/MSFVLQ

MetaEnhance-ETDQual500: https://doi.org/10.7910/DVN/PI3U1V

ETDCite: https://doi.org/10.7910/DVN/ANU6LM

Table 1: Compiled datasets and benchmarks by crawling 114 US universities, collected over 533,047 ETDs in full text.

19 of 56

RQ1: General Machine Learning Approaches to Extract Metadata Designed for Journal and Conference Articles

GROBID [1]:

Developed to extract bibliographic metadata from born-digital papers.
Based on 11 Conditional Random Field (CRF) models on top of sequence tagging method, lexical information, and layout information.
Can extract header and bibliographic metadata (e.g., title, authors, affiliations, etc.)
Achieved an accuracy of 74.9% but failed to extract metadata from scanned documents.

CERMINE [2]:

Developed to extract structured bibliographic data (e.g., title, author, volume, issue, etc.).
Hybrid Approach – support vector machine [3], and simple rule-based models.
Achieved an average F1 score of 0.775 for most metadata type.
Limitation – scanned pages will not be properly processed.

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

[1] Lopez, P. (2009). GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications. ECDL’09.

[2] Tkaczyk, D. et al. CERMINE: Automatic Extraction of Structured Metadata from Scientific Literature. IJDAR.

[3] Shmilovici, A. (2005). Support Vector Machines. In: Maimon, O., Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston.

20 of 56

RQ1: Heuristic Based Rules and Performance on Seven Key Metadata Fields [1]

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Figure 6: Rules for extracting metadata, accuracy measures. Acln% and Aocr% are based on TXT-clean and TXT-OCR datasets, respectively.

Figure 7: Example of Degree Extraction from MIT and Virginia Tech Sample

[1] Choudhury et al. (2020). A heuristic baseline method for metadata extraction from scanned electronic theses and dissertations. JCDL’20. (Best Poster Award)

Regular Expression

21 of 56

RQ1: Developed AutoMeta Using CRF Model with Sequence Tagging Method by Incorporating Textual & Visual Features for Scanned ETDs

CRF is a discriminative model.
Features are dependent on each other and considers future observation.
Tagged each token of the annotated fields by following BIO tagging schema.
BIO tagging schema was applied in the study of Named Entity Recognition [1] and Key Phrase Extraction [2].

We also tagged each token with Parts of Speech (POS).

B-title

I-title

B-author

I-author

[1] Wu et al. (2017). Hesdk: A hybrid approach to extracting scientific domain knowledge entities. JCDL’17.

[2] Councill et al. (2008). ParsCit: an open-source CRF reference string parsing package. LREC’08.

Figure 8: Sequence tagging using BIO schema

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

22 of 56

RQ1: Textual Features for AutoMeta

Whether all the characters in the string uppercase and lowercase.
Whether all the characters in the string are numeric.
Last three characters of the current token.
Last two characters of the current token.
Tag tokens with POS based on its context.
Last two characters in the POS tag of the current token.
Whether the first character of consecutive tokens is uppercase.
Whether the first character is uppercase for the token that is not at the beginning or end of the document.

Figure 9: TXT-clean - a cleansed version of TXT-OCR

Proper Noun (NNP)

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

23 of 56

RQ1: Visual Features for AutoMeta

When annotating a document:

Humans leverage visual or text features or both.
For example, thesis title usually appear at the top of the cover page.

Visual information is represented by corner coordinates of the bounding box (Bbox) of a text span.
This information is available from hOCR files (XML file).
Visual features:

Left margin – x1 as a feature for all tokens in the same line.
Upper left corner – y2 as a feature for all tokens.
Bottom right corner – y1 as a feature for all tokens.
Features have been normalized between 0 and 1.

Figure 10: Bounding Box measurement for a token

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

24 of 56

RQ1: AutoMeta, Leveraging CRF Model By Incorporating Text and Visual Features, achieved 0.813 – 0.96 F1, Increased the F1 Significantly by 0.7% to 10.6%

Figure 11: Performance (F1) comparison of the models

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

25 of 56

RQ2: Existing Works on Metadata Quality Improvement Rely on Manual Approaches, Which are Slow and Not Scalable

Manual Correction:

When Microsoft Academic was online:

Allowed users to change header information (e.g., title, authors, year, abstract, DOI, URL, etc.)

Wu et al. [1] proposed crowdsourcing to improve the metadata quality for CiteSeerX.
Showed user correction was reliable source of high-quality metadata.
Limitation: the paper only examined authors and titles.

Semi-automatic Approach:

Existence of inconsistent metadata for not following the DC schema [2].
Emphasized on method including accuracy, completeness, and consistency.
Compared published methods that proposed guidelines, best practices, and quality assurance.
Advocated the development of a framework for assessing the quality and mechanisms.
Limitation: no AI-based framework have been proposed and implemented.

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

CiteSeerX: https://citeseerx.ist.psu.edu/

[1] Wu et al. (2014). The impact of user corrections on a crawl-based digital library: A CiteSeerX perspective. IEEE CollaborateCom’14.

[2] Bui et al. (2006). An Assessment of Metadata Quality: A Case Study of the National Science Digital Library Metadata Repository. CAIS/ACSI’06.

26 of 56

RQ2: Proposed MetaEnhance, Focuses on AI methods and Models to Automatically Improve Metadata Quality, Which is More Scalable Than Manual Approach

Figure 12: MetaEnhance Framework – Error Detection, Error Correction, and Canonicalization

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

27 of 56

RQ2: Error Detection (incorrect or missing) and Correction (AutoMeta)

Title (dc.title): can consist of incorrect title and missing values.

For Example: Incorrect Title – DMA Recitals

The correct title is: Expanding Vision: The Music of Alyssa Morris.

Solution:

Adopted the title classifier developed by Rohatgi et al. [1] to detect errors.
The classifier was evaluated on the SciDocs [2] dataset.
Achieved F1 score of 0.96.
Features: # of tokens, # of special characters, # of capital letters, stop words, consecutive punctuations, and minimum, median, and max TF-IDF.

[1] Rohatgi et al. (2021). What Were People Searching For? A Query Log Analysis of An Academic Search Engine. JCDL’21.

[2] Cohan et al. (2020). SPECTER: Document-level Representation Learning using Citation-informed Transformers. ACL’20.

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

28 of 56

RQ2: Error Detection (incorrect or missing) and Correction (AutoMeta)

Author and Advisor (dc.creator and dc.contributor): consist of incorrect name values.
Solution:

Named entity recognition (NER) model implemented in FlairNLP [1] package.
The model was pre-trained and evaluated on the OntoNotes dataset.
Achieved F1 score of 0.90.
Error is detected if the name is classified as a type other than PERSON.
We observed that many advisor names contain their respective role:

For example: Mark Pankow, Co-Chair (dc.contributor.role)
Regular expression to the parse names and then store it in a separate column.

[1] Akbik et al. (2018). Contextual String Embeddings for Sequence Labeling. COLING’2018

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

OntoNotes dataset : https://catalog.ldc.upenn.edu/LDC2013T19

29 of 56

RQ2: Error Detection (acronyms) and Correction (canonicalization)

Observed acronyms or colloquials for the following fields:

Solution: adopted dictionary based approach.

University – compiled 832 names of universities and their acronyms.
Degree – compiled 234 degree names and their acronyms.
Department – 232 different academic department names and their acronyms.
Performed a fuzzy string matching against the corresponding dictionary.
To disambiguate department surface names (e.g., Dept of CS), used Sentence Transformers [1].

Field	Acronym/Colloquial	Full Name
University (thesis.degree.generator)	JHU, jhu	Johns Hopkins University
Degree (thesis.degree.name)	M.PHIL, MPHIL	Master of Philosophy
Department (thesis.degree.discipline)	MSE, MSCE	Materials Science and Engineering

[1] Nils Reimers and Iryna Gurevych. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP-IJCNLP’19.

Table 2: Example of acronym/colloquial names and their full names

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

30 of 56

RQ2: Error Detection (may contain misspellings) and Correction

We also observed misspellings for the department field.

For example: College of Muisc and Graduhte Studies in English.
Solution: adopted a Python library – pyspellchecker.

Levenshtein distance algorithm and compute minimum edit distance.
Provides permutations: insertion, deletion, and substitution.
If the editing distance between the original word and the field value >= 2, it is considered as a spelling error.

Pyspellchecker: https://pypi.org/project/pyspellchecker/

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

31 of 56

RQ2: Error Detection (inconsistency) and Correction (need canonicalization)

Year (dc.date.issued): the format is inconsistent across libraries.

For example: mm–dd–yyyy or yyyy–mm–dd.
Solution:

First, verified if the date field is valid using to_datetime from Pandas library.
Second, checked against the dictionary listing the year range from 1880 – 2023.
If error found, the correction module use AutoMeta to overwrite.
To canonicalize the surface value:

Used to_datetime from Pandas library to parse date field.
Provide outputs of year, month, and date.
Finally, stored the value in three separate columns.

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

32 of 56

RQ2: Evaluation Benchmark Dataset (i.e., ETDQual500) – Distribution of Errors

Field	#Missing	#Acronym	#Spell	#Incorrect
Title	0	0	0	1
Author	2	0	0	0
Advisor	150	35	0	0
University	6	43	0	0
Year	172	1	0	0
Degree	156	82	0	4
Department	269	85	2	0

Table 3: ETD error distributions in 500 benchmark dataset.

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

33 of 56

RQ2: MetaEnhance Achieved Nearly Perfect F1 Scores in Detecting Errors and 0.85 to 1.00 For Correcting Five of Seven Key Metadata Fields

Field	P_ED	R_ED	F1_ED	P_ECC	R_ECC	F1_ECC
Title	0.997	1.0	0.998	0.0	0.0	0.0
Author	0.996	1.0	0.997	0.0	0.0	0.0
Advisor	0.920	0.990	0.950	1.0	1.0	1.0
University	1.0	1.0	1.0	0.740	1.0	0.850
Year	1.0	1.0	1.0	1.0	1.0	1.0
Degree	1.0	1.0	1.0	0.980	1.0	0.980
Department	0.996	1.0	0.997	0.997	1.0	0.980

Table 4: Performance of Error Detection (ED) and Error Corrections and Canonicalization (ECC)

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

34 of 56

RQ3: Existing Works Applied by Fine Tuning Vision Based Models to Segment ETDs, Yield to Low Performance

Fine Tuning YOLOv7:

Existing multimodality framework performed well on general document layout analysis.
Ahuja et al. proposed to leverage visual features and fine tune YOLOv7.
Limitation: lack of training samples led to low performance on minority class.

Fine Tuning VGG16 [1]:

Fine tuned VGG16 model on the ETDPC-ETD500 dataset.
Compared against DocFormer and LayoutLMv2.
Surprisingly, the model performed slightly better.

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

[1] Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition (2015).

Figure 13: Comparison against baseline models. We report classification accuracy and macro F1 on the test set

35 of 56

Figure 14: ETDPC – A Multimodality Framework (I – Image, and T - Text)

RQ3: Proposed ETDPC, A Two Stream Multimodal Classification Model with Cross-Attention That Uses a Vision Encoder (e.g., ResNet-50v2) and a Text Encoder (e.g., BERT with Talking Heads Attention)

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

36 of 56

RQ3: Method – Adopted ResNet50 (Version 2) as a Visual Modality

We chose ResNet50 version 2 [1] as our visual encoder:

ResNet50v2 [30] is an improvement over original ResNet50 [2].
Performed better than the original ResNet on the ImageNet-1k [3] and CIFAR-10/100 datasets.

Figure 15: ResNet vs. ResNetv2

Source: https://shorturl.at/xBSZ0

[1] He et al. 2016. Identity Mappings in Deep Residual Networks. ECCV

[2] He et al. 2016. Deep Residual Learning for Image Recognition. CVPR

[3] Russakovsky et al. 2015. ImageNet Large Scale Visual Recognition Challenge. arXiv:1409.0575.

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

CIFAR10/100: https://www.cs.toronto.edu/~kriz/cifar.html

37 of 56

RQ3: Method – Adopted BERT with Talking-Heads as a Textual Modality

We used BERT with the Talking-Heads Attention [1].

Replaced multi-head attention and ordinary dense layer with GELU in BERT [2].

To generate the embeddings:

First, we used AWS Textract (an OCR tool) to extract text from the ETD images.
We used the pre-trained model of BERT with Talking Heads Attention (large).
We performed tokenization and extracted the following features:

Input type ids: which sequence a token belongs to when there is more than one
Input mask: whether a token should be attended or not
Input word ids: the indices corresponding to each token in the sentence

Finally, these features are fed through a trainable embedding layer.

[1] Shazeer et al. 2020. Talking-Heads Attention. arXiv:2003.02436.

[2] Devlin et al. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

AWS Textract: https://aws.amazon.com/textract/ocr/

38 of 56

RQ3: Method – Multimodal with Cross-Attention

The dimension of embeddings from both encoders are unified:

Applied linear projection to map the dimension.
Our model takes one 256-D projection layer with a dropout rate of 0.8.
Later, the model fetches each embedding projection and performs an early fusion.

To focus on the most important pixels of an image that relates to the corresponding text:

We used cross-attention [1, 2]:

Cross-attention combines asymmetrically two separate embedding sequence.

Further, we concatenated the early fusion of the projection layer with the attention sequence.
Finally, we pass it through the softmax layer for classification.

[1] Chen et al. CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification.ICCV

[2] Wei et al. Multi-Modality Cross Attention Network for Image and Sentence Matching. CVPR

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

39 of 56

RQ3: Proposed a Method to Augment Minority ETD Pages

To mitigate the data imbalance problem, we generate synthesized training samples.

We paraphrase the text extracted by the OCR (i.e., AWS Textract).
We convert the text into images.
We perform an image based transformation.

Adopted Google’s PEGASUS [1] for paraphrasing.
Used ImgAug for image based transformation, and adopted the following transformations [2]:

Additive Gaussian Noise – adds noise from gaussian distributions elementwise to images.
Salt-and-pepper noise – replaces pixels in images with white & black-ish colors.
Gaussian blur – tries to smooth the image with gaussian kernel (e.g., σ = 0.5).
Linear Contrast – incorporates the scanning effect (e.g., ɑ = 1).

[1] Zhang et al. 2020. PEGASUS: Pre-Training with Extracted Gap-Sentences for Abstractive Summarization. ICML

[2] Kahu et al. 2021. ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations. JCDL

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

ImgAug: https://imgaug.readthedocs.io/en/latest/index.html

40 of 56

RQ3: Experiments – Fine Tuning Hyperparameters & Training

Hardware: single Tesla BV100-SXM2 GPU for training and 12 core CPU to perform other tasks
We heuristically fine tuned the hyperparameters.

Optimizer: ADAM with a weight decay of 0.004.
Epsilon: 1e-07, clip value: 2.0, learning rate: 0.001, dropout rate: 0.8.
Loss: sparse categorical focal loss.
To avoid overfitting, setup early stopping while monitoring validation loss.

Split the labeled ETD pages into train, validation (25%), and test sets (15%).
Our model with around 460M total parameters is trained on the original and augmented ETD pages.
Trained the model with a batch size of 32 and 40 epochs, each taking around 1 hour.

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

41 of 56

RQ3: ETDPC Outperforms the Previous Models, Achieving F1 of 0.84 – 0.96 for 9 out of 13 Categories

Figure 16: Performance on ETD samples in the test set – a) performance of one-level classifier (i.e., Case a), where ETDPC is trained on ETDPC–ETD500; b) performance of two-level classifier (i.e., Case b), training first on chapter vs. non-chapter pages, and test on the remaining categories, including 21,171 manually labeled samples; and c) performance of the two-level classifier (Case c), trained on 21,171 manually labeled samples and 5,984 augmented samples. We highlight the categories with remarkable improvements of F1 scores.

Case a: single-level classifier

Case b: two-level classiifer

Level 1: chapter vs. non-chapter
Level 2: other categories of non-chapter

Case c: two-level classifier + data augmentation

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

42 of 56

RQ3: Performed an Ablation Study by Changing the Text Encoder in Multimodal Model and of Using Individual Modalities

Figure 17: Ablation Study – a) Experiment 1 illustrates the performance increment when changing the original BERT model to BERT with Talking-Heads. b) Experiment 2 illustrates the performance of using the individual modalities vs. the multimodal model with cross-attention.

(a) Experiment 1

(b) Experiment 2

BERT with Talking-Heads (red) beats BERT-large (blue)

Multimodality (yellow) performs single modality (blue and red)

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

43 of 56

RQ4: Existing Methods Employed to Extract Metadata Fields by Open Source Citation Parser

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Citation Parser	Approach	Extracted Fields
Biblio	Regular Expression	author, date, editor, genre, issue, pages, publisher, title, volume, year
BibPro [1]	Template Matching	author, title, venue, volume, issue, page, date, journal, booktitle, techReport
CERMINE [2]	ML (CRF)	author, issue, pages, title, volume, year, DOI, ISSN
GROBID [3]	ML (CRF)	authors, booktitle, date, editor, issue, journal, location, note, pages, publisher, title, volume, web, institution
ParsCit [4]	ML (CRF)	author, booktitle, date, editor, institution, journal, location, title, volume
Neural ParsCit [5]	Deep Learning (BiLSTM-CRF)	author, booktitle, date, editor, institution, journal, location, note, poges, publisher, tech, title, volume
TransParsCit [6]	Deep Learning (BiLSTM-Transformer-CRF)	author, container title, date, page, publisher, punc, title, volume

[1] Chen et al. (2008). BibPro: A Citation Parser Based on Sequence Alignment Techniques. WAINA’08

[2] Tkaczyk, D. et al. CERMINE: Automatic Extraction of Structured Metadata from Scientific Literature. IJDAR.

[3] Lopez, P. (2009). GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications. ECDL’09.

[4] Councill et al. (2008). ParsCit: an open-source CRF reference string parsing package. LREC’08.

[5] Prasad et al. (2018). Neural ParsCit: A Deep Learning Based Reference String Parser. IJDL.

[6] Uddin, MD S. (2022). TransParsCit: A Transformer-Based Citation Parser Trained on Large-Scale Synthesized Data. Master of Science (MS), Thesis, Computer Science, Old Dominion University

44 of 56

RQ4: List of Existing Training Dataset for Citation Parsing Task

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Dataset Name	#Instances	Domain
Cora	1,295	Computer Science
CiteSeer	1,563	Artificial Intelligence
Umass	1,829	STEM
FLUX-CiM HS	2,000	Health Science
PubMed Central	Varies	Biomedical
GROTOAP2	6,858	Biomedical and Computer Science
Venice	40,000	Humanities
GIANT	991 million (~1B)	Multi-Domain (~1,500 Citation Styles)

45 of 56

RQ4: Existing Models Trained on Focused Domain, May Lead to a Poor Performance on ETD Citation Strings Consisting of Many Styles from Various Academic Disciplines

Machine Learning Approach (i.e., CRF):

CERMINE

Used 4000 parsed citations from CiteSeer, Cora-Ref, and PubMed Central (PMC).
Outperformed the existing tools (e.g., GROBID, ParsCit), achieving 0.933 F1 score.

GROBID (current version – 0.7.3)

Achieved F1 score of 0.87 against PMC set of 1,943 PDFs, containing ~ 90K references.
Achieved F1 score of 0.89 on a set of 2,000 PDF from bioRxiv.

Deep Learning Approach (i.e., BiLSTM & CRF)

Neural ParsCit

Used word embedding, and character based word embeddings as a feature set.
Trained on 4.3M reference strings extracted from ACM digital library.
Achieved F1 score from 0.86 – 0.99 depending on metadata fields.
Performed poorly (e.g., 0.51 – 0.83 F1 score) on GIANT, consisting citation in many styles.

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

46 of 56

RQ4: TransParsCit using Recurrent Neural Network (e.g., LSTM, BiLSTM) with Transformer and CRF Led to a Poor Performance to Parse Citations

Deep Learning Approach (i.e., BiLSTM-Transformer-CRF):

TransParsCit, a deep learning model with Transformer [1] and CRF.
Trained on a large-scale GIANT dataset, covering 1500 citation styles.
Underpormed (e.g., F1 score 0.62 – 0.98) on CORA dataset in extracting key metadata fields (e.g., title, date, volume, publisher) when compared against Neural ParsCit.
Two possible reason:

Custom tokenizer (i.e., splitting a sentence into smaller units).
Combination of BiLSTM [2] and Transformer – both have different mechanism.

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

[1] Vaswani et al. (2017). Attention is all you need. NeruIPS’17.

[2] Schuster et al. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing.

47 of 56

RQ4: We Propose LMParsCit, using a LLM Based Models and Contribute a Challenging Benchmark to Overcome the limitations of Existing Methods

RNN (i.e., BiLSTM) requires large training data, more memory, and computationally expensive.
Transformer was introduced with the purpose reducing training time and improving performance.

Widely adopted for NLP tasks – Q&A, Sequence Classification, and NER tasks.
Achieved state of the art performance.

Thus, adopting transformer based model and replacing the role of BiLSTM will be a promising method.

We propose LMParsCit, where the core architecture rely on LLM (e.g., Llama3-8b-instruct) model.

Extract key metadata fields with high accuracy: title, authors, venue (i.e., container-title (e.g., journals, conferences, etc.) and publisher (e.g., institutions, organizations)), and year.

We contribute an evaluation benchmark, called ETDCite

Consisting of 1,653 citation strings in many styles (e.g., APA, MLA, IEEE, ACM)
Consisting of citation strings based on citation types (e.g., journals, conferences, etc.)
Representing various academic disciplines (e.g., arts, humanities, engineering, etc.)

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

48 of 56

RQ4: ETDCite – A Benchmark compiled from ETDs

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Figure 18: ETDCite evaluation benchmark, distribution of the ETD reference across University, Discipline, Year, and Bibliography Types (e.g., J – Journal, Conf. – Conference, B – Book, In-B – In Book, TR – Tech Report, Th – Thesis)

49 of 56

RQ4: ETDCite – A Benchmark compiled from ETDs

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Figure 19: Heatmap demonstrating the relationship between citation types vs. citation Styles

50 of 56

RQ4: ETDCite – A Benchmark compiled from ETDs

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Figure 20: Heatmap demonstrating the relationship between academic disciplines vs. citation styles

51 of 56

RQ4: ETDCite – A Benchmark compiled from ETDs

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Figure 21: Heatmap demonstrating the relationship between academic disciplines vs. citation types

52 of 56

RQ4: Prompt Engineering – Conceptual Overview

Design inputs by providing examples (i.e., shots) to guide LLMs to produce desired outputs.

Instruction-tuned: trained LLMs to follow instructions in a conversation.

<instruction> Extract Metadata from the following citation: Smith, J. (2021). Deep Learning in Practice. <output> {“author”: “Smith, J.”, “title”: “Deep Learning in Practice”, “year”: “2021”}

Few-shot vs. Zero-shot:

In few shot, several examples (typically, n < 5) are provided to guide the model.
In zero shot, the model is asked to perform a task without seeing any example.

Chain-of-Thought: encourages the model to think logically step-by-step, improving the quality of reasoning based tasks.

Q: What is 12 + 15 divided by 3?

A: Let’s solve it step by step. 15 divided by 3 is 5. Now, add 12 and 5. The answer is 17.

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

53 of 56

RQ4: LMParsCit – Using GIANT 1B, Fine-tuned Llama3-8b-instruct Incorporating Prompts

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Optimization:

Batch Size of 2, and Gradient Accumulation of 4 steps.
Learning rate is set to 2e-4
Employs AdamW optimizer with 8-bit precision and a Weight decay of 0.01 (prevents overfitting).

Hardware:

16 CPU cores, 64GB of RAM, and one A40 GPU with 46GB of video memory.

Figure 22: LMParsCit Framework

54 of 56

RQ4: LMParsCit Performance on CORA, Comparing against Neural ParsCit, Achieved State-of-the-art Performance

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Figure 23: Performance comparison of two LLM based methods (LMParsCit – only finetuning, and LMParsCitPrompt – fine tuning & prompt engineering) on CORA

CORA:

Figure 23 illustrates LMParsCit achieve F1 score of 0.99 in all fields.
When comparing against Neural ParsCit, the performance of LMParsCit significantly improved by 0.6% to 6.94%, depending on the fields.
Incorporating prompt engineering improves the performance greatly when comparing against Neural ParsCit. For example:

Title field demonstrates an improvement of 0.16% using fine tuning only and incorporating prompt increases to 1.86%.
The venue field also sees a 6.94% improvement from Neural ParsCit to LMParsCit.

55 of 56

RQ4: LMParsCit Performance on ETDCite, Comparing against Neural ParsCit, Achieved State-of-the-art Performance

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24

Figure 24: Performance comparison of two LLM based methods (LMParsCit – only fine tuning, and LMParsCitPrompt – fine tuning & prompt engineering) on ETDCite

ETDCite:

Figure 24 illustrates LMParsCit achieve F1 score of 0.94 to 0.99 based on the metadata fields.
When comparing against Neural ParsCit, the performance of LMParsCit significantly improved by 0.8% to 32.8%, depending on the fields.
Incorporating prompt engineering improves the performance greatly when comparing against Neural ParsCit. For example:

Title field demonstrates an improvement of 7.2% using fine tuning only and incorporating prompt increases to 9.4%.
The author field sees a significant improvement, an improvement by 28.4% using fine tuning only and incorporating prompt increases to 32.8%.

56 of 56

Conclusion

We developed ETDSuite, a comprehensive suite consisting of four frameworks: AutoMeta, MetaEnhance, ETDPC, and LMParsCit—each targeting key challenges in the mining of ETDs.
AutoMeta, leveraging a CRF model with visual features, achieved F1 scores between 81.3% and 96% across various metadata fields, establishing it as a reliable solution for metadata extraction.
MetaEnhance uses AI-driven methods to detect, correct, and standardize metadata, achieving F1 scores ranging from 95% to 99%. This framework corrected errors in 85% to 98% of cases, significantly improving metadata quality and consistency.
ETDPC, a cross-attentive, two-stream multimodal model, attained F1 scores between 84% and 96% for 9 out of 13 ETD page categories, demonstrating high accuracy in document classification.
LMParsCit, powered by the Llama3-8b-instruct model, set a new standard in citation parsing, achieving an impressive 99.9% F1 score for extracting key metadata fields from diverse citation styles, types, and academic disciplines.
Through ETDSuite, we contributed valuable evaluation benchmarks (e.g., MetaEnhance-ETD500, ETDCite), datasets (e.g., ETD-500), and tools that enhance library repositories, addressing critical issues of accessibility, discoverability, and readability for ETDs.

ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24