1 of 12

A Summary of Contributions to the ETD Project

Lamia Salsabil

PhD Student

Advisor: Dr. Jian Wu

Department of Computer Science

Old Dominion University, Norfolk, Virginia

November 11, 2022

@liya_lamia @WebSciDL

Presented By:

2 of 12

Research Study 1: ETD Segmentation

3 of 12

Page-Level Segmentation

Segment ETD on page-level
Goal: Build and prepare dataset for page-level segmentation

Page Labeling
ETD Text & Data Extraction

4 of 12

Page Labeling

Manually labeled 92,375 pages from 500 scanned ETDs
14 different categories
Page labels stored in text files

Figure 1: ETD Pages - 14 Classes

5 of 12

ETD Text & Data Extraction

Extracted Text

Bounding Box

Figure 2: ETD Text & Data Extraction Pipeline

6 of 12

Research Study 2: Computational Reproducibility study using URLs linking to Open Access Datasets/Software

7 of 12

Past Work

Goal: To study computational reproducibility using URLs linking to open access datasets and software
Contribution:

Developed a hybrid classifier OADSClassifier to automatically identify OADS URLs in a scientific paper
OADSClassifier achieves a best F1 score of 0.92
Investigated the disciplinary dependencies and chronological trend of OADS in Electronic theses and dissertations (ETDs) using OADSClassifier

OADS: Open Access Datasets and Software

OADS-URLs: URLs linking to OADS

8 of 12

Recent Work

Goal:

To classify URLs of scholarly articles based on one of the 5 classes

Third Party Dataset
Third Party Software
Author Provided Dataset
Author Provided Software
General URLs

We want to develop a classifier based on attention mechanism and transfer learning to classify the URLs in 5 classes

9 of 12

Dataset and Challenges

1028 sentences containing URLs
Data Source: CORD-19 Dataset, In-house corpus of SBS Papers, arXiv Dataset, ETD Corpus
There are 5 classes
Challenges:

Small number of samples for 5 classes
Experimented with Text-GCN and HAHNN

Figure 3: Sentences with URLs and their classes

10 of 12

Research Study 3: Text Quality Comparison

11 of 12

Text Extraction from Born-Digital ETDs

Compared two standard libraries PDFPlumber and SymbolScraper
Comparison dataset was built by selecting 100 born-digital ETDs
Compared the number of characters and tokens
Inspected a sample of citation strings curated from extracted text

12 of 12

Citation String Based Comparison

Manually inspected 6762 citation strings extracted
SymbolScraper fails to identify white spaces between two words for 19 citation strings
For three citation strings, SymbolScraper failed to extract the entire string.