1 of 12

A Summary of Contributions to the ETD Project

Lamia Salsabil

PhD Student

Advisor: Dr. Jian Wu

Department of Computer Science

Old Dominion University, Norfolk, Virginia

November 11, 2022

@liya_lamia @WebSciDL

Presented By:

2 of 12

2

Research Study 1: ETD Segmentation

3 of 12

Page-Level Segmentation

  • Segment ETD on page-level
  • Goal: Build and prepare dataset for page-level segmentation
    • Page Labeling
    • ETD Text & Data Extraction

3

4 of 12

Page Labeling

  • Manually labeled 92,375 pages from 500 scanned ETDs
  • 14 different categories
  • Page labels stored in text files

4

Figure 1: ETD Pages - 14 Classes

5 of 12

ETD Text & Data Extraction

5

Extracted Text

Bounding Box

Figure 2: ETD Text & Data Extraction Pipeline

6 of 12

6

Research Study 2: Computational Reproducibility study using URLs linking to Open Access Datasets/Software

7 of 12

Past Work

  • Goal: To study computational reproducibility using URLs linking to open access datasets and software
  • Contribution:
    • Developed a hybrid classifier OADSClassifier to automatically identify OADS URLs in a scientific paper
    • OADSClassifier achieves a best F1 score of 0.92
    • Investigated the disciplinary dependencies and chronological trend of OADS in Electronic theses and dissertations (ETDs) using OADSClassifier

7

OADS: Open Access Datasets and Software

OADS-URLs: URLs linking to OADS

8 of 12

Recent Work

  • Goal:
    • To classify URLs of scholarly articles based on one of the 5 classes
      • Third Party Dataset
      • Third Party Software
      • Author Provided Dataset
      • Author Provided Software
      • General URLs
  • We want to develop a classifier based on attention mechanism and transfer learning to classify the URLs in 5 classes

8

9 of 12

Dataset and Challenges

  • 1028 sentences containing URLs
  • Data Source: CORD-19 Dataset, In-house corpus of SBS Papers, arXiv Dataset, ETD Corpus
  • There are 5 classes
  • Challenges:
    • Small number of samples for 5 classes
    • Experimented with Text-GCN and HAHNN

9

Figure 3: Sentences with URLs and their classes

10 of 12

10

Research Study 3: Text Quality Comparison

11 of 12

Text Extraction from Born-Digital ETDs

  • Compared two standard libraries PDFPlumber and SymbolScraper
  • Comparison dataset was built by selecting 100 born-digital ETDs
  • Compared the number of characters and tokens
  • Inspected a sample of citation strings curated from extracted text

11

12 of 12

Citation String Based Comparison

  • Manually inspected 6762 citation strings extracted
  • SymbolScraper fails to identify white spaces between two words for 19 citation strings
  • For three citation strings, SymbolScraper failed to extract the entire string.

12