ETDSuite: A Toolkit to Mine Electronic Theses and Dissertations To Enrich Scholarly Big Data Using Natural Language Processing and Computer Vision
Presented By:
Muntabir Hasan Choudhury
Ph.D. Candidate
Advisor: Dr. Jian Wu
Committee Members: Michael L. Nelson, Michele C. Weigle, Sampath Jayarathna, Edward A. Fox (Virginia Tech)
Department of Computer Science
Old Dominion University, Norfolk, Virginia
November 6, 2024
@TasinChoudhury @WebSciDL
Dissertation Defense Examination
Outline
2
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
What are Electronic Theses and Dissertations (ETDs)?
3
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
ETDs have Layouts Different from Regular Scholarly Articles
4
Abstract
Acknowledgement
Table of Contents
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
ETDs have Layouts Different from Regular Scholarly Articles
5
Chapter
List of Figures
List of Tables
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
ETDs have Layouts Different from Regular Scholarly Articles
6
Appendix
Chapter Abstract
Curriculum Vitae
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
Mining ETDs is Important
7
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
Extracting ETD Metadata Automatically is Challenging
8
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
9
Title
Author
University
Degree
Year
Program
Degree
Program
Author
Year
University
Title
Author
Title
University
Year
Degree
Advisor
Program
MIT 1965
Stanford 1970
University of Michigan 1971
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
Cover Pages of ETD Across Universities – Task for Extracting Metadata is Challenging
ETD Cover Page Layouts are Evolving – Extracting Metadata became Challenging
10
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
Figure 1: Measurement of the cosine distance with respect to the ETD in 1970
Metadata Represents a Key Aspect of Digital Objects but Quality Becomes a Concern
11
[1] Bui et al. (2006). An Assessment of Metadata Quality: A Case Study of the National Science Digital Library Metadata Repository. CAIS/ACSI’06.
[2] Uddin et al. (2021). Building A Large Collection of Multi-domain Electronic Theses and Dissertations. IEEE Big Data.
[3] Jung-Ran Park (2009). Metadata Quality in Digital Repositories: A Survey of the Current State of the Art. Cataloging & Classification Quarterly.
[4] Wu et al. (2014). The impact of user corrections on a crawl-based digital library: A CiteSeerX perspective. IEEE CollaborateCom’14.
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
Figure 2: Missing Values: Year (87%), Department (55%), University (98%)
Optical Character Recognition (OCR) is Challenging for Scanned ETDs
12
Figure 3: OCR Challenges for ETDs: (i) scribble, (ii) stamp, (iii) overlapped letters, and (iv) copyright character
Figure 4: Noisy OCR Text using Tesseract
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
Limitations of the Existing Frameworks to Segment ETDs
13
[1] Ahuja et al. (2022). Parsing Electronic Theses and Dissertations Using Object Detection. WIESP@ACL’22.
[2] Wang et al. (2022). YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. ACL’22.
[3] Xu et al. (2021). LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. ACL/IJCNLP’21.
[4] Appalaraju et al. (2021). DocFormer: End-to-End Transformer for Document Understanding. ICCV’21.
[5] Harley et al. (2015). Evaluation of deep convolutional nets for document image classification and retrieval. ICDAR’15.
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
Segmentation of ETDs is Challenging Due to Lack of Labeled Dataset
14
[1] Kahu et al. (2021). ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations. JCDL’21.
[2] Ling, M. et al. (2021). Document Domain Randomization for Deep Learning Document Layout Extraction. ICDAR’21.
Figure 5: Number of training samples for each class
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
Accurately Parsing Citation Strings from ETDs is Challenging
15
[1] Park et al. (2012). A hybrid two-stage approach for discipline-independent canonical representation extraction from references. JCDL’12.
[2] Prasad et al. (2018). Neural ParsCit: A Deep Learning Based Reference String Parser. IJDL.
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
Research Questions
16
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
Research Goals & Objectives
17
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
[1] Choudhury et al. (2021). Automatic Metadata Extraction Incorporating Visual Features from Scanned Electronic Theses and Dissertations. JCDL’21.
[2] Choudhury et al. (2023). MetaEnhance: Metadata Quality Improvement for Electronic Theses and Dissertations of University Libraries. JCDL’23. (Best Paper Award)
[3] Choudhury et al. (2024). ETDPC: A Multimodality Framework for Classifying Pages in Electronic Theses and Dissertations. IAAI’24.
Datasets & Benchmarks for RQ1 (i.e., AutoMeta), RQ2 (i.e., MetaEnhance), RQ3 (i.e., ETDPC), and RQ4 (i.e., LMParsCit)
18
Dataset | Count | Description | Format | Task |
500 | Consists of Scanned ETDs and annotated metadata of ETD cover pages | XML / PDF / PNG | Used for extracting metadata from scanned ETDs | |
92,371 | Consists of ETD pages manually labeled into 13 categories. Positions and sizes of the text and the bounding boxes (bbox) extracted by AWS textract, an OCR tool | TXT / JSON / PNG | Used for classifying ETD pages | |
500 | 500 ETD benchmark evaluations by combining subsets (i.e., university, year, department, degree fields) using multiple criteria. Each criteria ensure to distribute multiple errors (e.g., missing values, misspellings, and incorrect values) | PDF / CSV | Used for improving and enhancing the quality of metadata | |
1,653 | A benchmark of annotated citation strings, collected from 20 ETDs, representing 5 citation types (e.g., journals, conferences, books), and 18 citation styles (e.g., apa, mla, harvard, chicago, etc.) | JSONL | Used for parsing citation strings in many styles |
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
Dataset is available on Harvard Dataverse:
AutoMeta-ETD500: https://doi.org/10.7910/DVN/18D6AZ
ETDPC-ETD500: https://doi.org/10.7910/DVN/MSFVLQ
MetaEnhance-ETDQual500: https://doi.org/10.7910/DVN/PI3U1V
ETDCite: https://doi.org/10.7910/DVN/ANU6LM
Table 1: Compiled datasets and benchmarks by crawling 114 US universities, collected over 533,047 ETDs in full text.
RQ1: General Machine Learning Approaches to Extract Metadata Designed for Journal and Conference Articles
19
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
[1] Lopez, P. (2009). GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications. ECDL’09.
[2] Tkaczyk, D. et al. CERMINE: Automatic Extraction of Structured Metadata from Scientific Literature. IJDAR.
[3] Shmilovici, A. (2005). Support Vector Machines. In: Maimon, O., Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston.
RQ1: Heuristic Based Rules and Performance on Seven Key Metadata Fields [1]
20
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
Figure 6: Rules for extracting metadata, accuracy measures. Acln% and Aocr% are based on TXT-clean and TXT-OCR datasets, respectively.
Figure 7: Example of Degree Extraction from MIT and Virginia Tech Sample
[1] Choudhury et al. (2020). A heuristic baseline method for metadata extraction from scanned electronic theses and dissertations. JCDL’20. (Best Poster Award)
Regular Expression
RQ1: Developed AutoMeta Using CRF Model with Sequence Tagging Method by Incorporating Textual & Visual Features for Scanned ETDs
21
B-title
I-title
I-title
B-author
I-author
I-author
[1] Wu et al. (2017). Hesdk: A hybrid approach to extracting scientific domain knowledge entities. JCDL’17.
[2] Councill et al. (2008). ParsCit: an open-source CRF reference string parsing package. LREC’08.
Figure 8: Sequence tagging using BIO schema
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
RQ1: Textual Features for AutoMeta
22
Figure 9: TXT-clean - a cleansed version of TXT-OCR
Proper Noun (NNP)
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
RQ1: Visual Features for AutoMeta
23
Figure 10: Bounding Box measurement for a token
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
RQ1: AutoMeta, Leveraging CRF Model By Incorporating Text and Visual Features, achieved 0.813 – 0.96 F1, Increased the F1 Significantly by 0.7% to 10.6%
24
Figure 11: Performance (F1) comparison of the models
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
RQ2: Existing Works on Metadata Quality Improvement Rely on Manual Approaches, Which are Slow and Not Scalable
25
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
CiteSeerX: https://citeseerx.ist.psu.edu/
[1] Wu et al. (2014). The impact of user corrections on a crawl-based digital library: A CiteSeerX perspective. IEEE CollaborateCom’14.
[2] Bui et al. (2006). An Assessment of Metadata Quality: A Case Study of the National Science Digital Library Metadata Repository. CAIS/ACSI’06.
RQ2: Proposed MetaEnhance, Focuses on AI methods and Models to Automatically Improve Metadata Quality, Which is More Scalable Than Manual Approach
26
Figure 12: MetaEnhance Framework – Error Detection, Error Correction, and Canonicalization
1
2
3
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
RQ2: Error Detection (incorrect or missing) and Correction (AutoMeta)
27
[1] Rohatgi et al. (2021). What Were People Searching For? A Query Log Analysis of An Academic Search Engine. JCDL’21.
[2] Cohan et al. (2020). SPECTER: Document-level Representation Learning using Citation-informed Transformers. ACL’20.
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
RQ2: Error Detection (incorrect or missing) and Correction (AutoMeta)
28
[1] Akbik et al. (2018). Contextual String Embeddings for Sequence Labeling. COLING’2018
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
OntoNotes dataset : https://catalog.ldc.upenn.edu/LDC2013T19
RQ2: Error Detection (acronyms) and Correction (canonicalization)
29
Field | Acronym/Colloquial | Full Name |
University (thesis.degree.generator) | JHU, jhu | Johns Hopkins University |
Degree (thesis.degree.name) | M.PHIL, MPHIL | Master of Philosophy |
Department (thesis.degree.discipline) | MSE, MSCE | Materials Science and Engineering |
[1] Nils Reimers and Iryna Gurevych. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP-IJCNLP’19.
Table 2: Example of acronym/colloquial names and their full names
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
RQ2: Error Detection (may contain misspellings) and Correction
30
Pyspellchecker: https://pypi.org/project/pyspellchecker/
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
RQ2: Error Detection (inconsistency) and Correction (need canonicalization)
31
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
RQ2: Evaluation Benchmark Dataset (i.e., ETDQual500) – Distribution of Errors
32
Field | #Missing | #Acronym | #Spell | #Incorrect |
Title | 0 | 0 | 0 | 1 |
Author | 2 | 0 | 0 | 0 |
Advisor | 150 | 35 | 0 | 0 |
University | 6 | 43 | 0 | 0 |
Year | 172 | 1 | 0 | 0 |
Degree | 156 | 82 | 0 | 4 |
Department | 269 | 85 | 2 | 0 |
Table 3: ETD error distributions in 500 benchmark dataset.
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
RQ2: MetaEnhance Achieved Nearly Perfect F1 Scores in Detecting Errors and 0.85 to 1.00 For Correcting Five of Seven Key Metadata Fields
33
Field | PED | RED | F1ED | PECC | RECC | F1ECC |
Title | 0.997 | 1.0 | 0.998 | 0.0 | 0.0 | 0.0 |
Author | 0.996 | 1.0 | 0.997 | 0.0 | 0.0 | 0.0 |
Advisor | 0.920 | 0.990 | 0.950 | 1.0 | 1.0 | 1.0 |
University | 1.0 | 1.0 | 1.0 | 0.740 | 1.0 | 0.850 |
Year | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
Degree | 1.0 | 1.0 | 1.0 | 0.980 | 1.0 | 0.980 |
Department | 0.996 | 1.0 | 0.997 | 0.997 | 1.0 | 0.980 |
Table 4: Performance of Error Detection (ED) and Error Corrections and Canonicalization (ECC)
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
RQ3: Existing Works Applied by Fine Tuning Vision Based Models to Segment ETDs, Yield to Low Performance
34
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
[1] Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition (2015).
Figure 13: Comparison against baseline models. We report classification accuracy and macro F1 on the test set
35
Figure 14: ETDPC – A Multimodality Framework (I – Image, and T - Text)
RQ3: Proposed ETDPC, A Two Stream Multimodal Classification Model with Cross-Attention That Uses a Vision Encoder (e.g., ResNet-50v2) and a Text Encoder (e.g., BERT with Talking Heads Attention)
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
RQ3: Method – Adopted ResNet50 (Version 2) as a Visual Modality
36
Figure 15: ResNet vs. ResNetv2
Source: https://shorturl.at/xBSZ0
[1] He et al. 2016. Identity Mappings in Deep Residual Networks. ECCV
[2] He et al. 2016. Deep Residual Learning for Image Recognition. CVPR
[3] Russakovsky et al. 2015. ImageNet Large Scale Visual Recognition Challenge. arXiv:1409.0575.
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
CIFAR10/100: https://www.cs.toronto.edu/~kriz/cifar.html
RQ3: Method – Adopted BERT with Talking-Heads as a Textual Modality
37
[1] Shazeer et al. 2020. Talking-Heads Attention. arXiv:2003.02436.
[2] Devlin et al. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
AWS Textract: https://aws.amazon.com/textract/ocr/
RQ3: Method – Multimodal with Cross-Attention
38
[1] Chen et al. CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification.ICCV
[2] Wei et al. Multi-Modality Cross Attention Network for Image and Sentence Matching. CVPR
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
RQ3: Proposed a Method to Augment Minority ETD Pages
39
[1] Zhang et al. 2020. PEGASUS: Pre-Training with Extracted Gap-Sentences for Abstractive Summarization. ICML
[2] Kahu et al. 2021. ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations. JCDL
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
RQ3: Experiments – Fine Tuning Hyperparameters & Training
40
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
RQ3: ETDPC Outperforms the Previous Models, Achieving F1 of 0.84 – 0.96 for 9 out of 13 Categories
41
Figure 16: Performance on ETD samples in the test set – a) performance of one-level classifier (i.e., Case a), where ETDPC is trained on ETDPC–ETD500; b) performance of two-level classifier (i.e., Case b), training first on chapter vs. non-chapter pages, and test on the remaining categories, including 21,171 manually labeled samples; and c) performance of the two-level classifier (Case c), trained on 21,171 manually labeled samples and 5,984 augmented samples. We highlight the categories with remarkable improvements of F1 scores.
Case a: single-level classifier
Case b: two-level classiifer
Case c: two-level classifier + data augmentation
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
RQ3: Performed an Ablation Study by Changing the Text Encoder in Multimodal Model and of Using Individual Modalities
42
Figure 17: Ablation Study – a) Experiment 1 illustrates the performance increment when changing the original BERT model to BERT with Talking-Heads. b) Experiment 2 illustrates the performance of using the individual modalities vs. the multimodal model with cross-attention.
(a) Experiment 1
(b) Experiment 2
BERT with Talking-Heads (red) beats BERT-large (blue)
Multimodality (yellow) performs single modality (blue and red)
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
RQ4: Existing Methods Employed to Extract Metadata Fields by Open Source Citation Parser
43
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
Citation Parser | Approach | Extracted Fields |
Regular Expression | author, date, editor, genre, issue, pages, publisher, title, volume, year | |
BibPro [1] | Template Matching | author, title, venue, volume, issue, page, date, journal, booktitle, techReport |
CERMINE [2] | ML (CRF) | author, issue, pages, title, volume, year, DOI, ISSN |
GROBID [3] | ML (CRF) | authors, booktitle, date, editor, issue, journal, location, note, pages, publisher, title, volume, web, institution |
ParsCit [4] | ML (CRF) | author, booktitle, date, editor, institution, journal, location, title, volume |
Neural ParsCit [5] | Deep Learning (BiLSTM-CRF) | author, booktitle, date, editor, institution, journal, location, note, poges, publisher, tech, title, volume |
TransParsCit [6] | Deep Learning (BiLSTM-Transformer-CRF) | author, container title, date, page, publisher, punc, title, volume |
[1] Chen et al. (2008). BibPro: A Citation Parser Based on Sequence Alignment Techniques. WAINA’08
[2] Tkaczyk, D. et al. CERMINE: Automatic Extraction of Structured Metadata from Scientific Literature. IJDAR.
[3] Lopez, P. (2009). GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications. ECDL’09.
[4] Councill et al. (2008). ParsCit: an open-source CRF reference string parsing package. LREC’08.
[5] Prasad et al. (2018). Neural ParsCit: A Deep Learning Based Reference String Parser. IJDL.
[6] Uddin, MD S. (2022). TransParsCit: A Transformer-Based Citation Parser Trained on Large-Scale Synthesized Data. Master of Science (MS), Thesis, Computer Science, Old Dominion University
RQ4: List of Existing Training Dataset for Citation Parsing Task
44
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
Dataset Name | #Instances | Domain |
Cora | 1,295 | Computer Science |
CiteSeer | 1,563 | Artificial Intelligence |
Umass | 1,829 | STEM |
FLUX-CiM HS | 2,000 | Health Science |
PubMed Central | Varies | Biomedical |
GROTOAP2 | 6,858 | Biomedical and Computer Science |
Venice | 40,000 | Humanities |
GIANT | 991 million (~1B) | Multi-Domain (~1,500 Citation Styles) |
RQ4: Existing Models Trained on Focused Domain, May Lead to a Poor Performance on ETD Citation Strings Consisting of Many Styles from Various Academic Disciplines
45
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
RQ4: TransParsCit using Recurrent Neural Network (e.g., LSTM, BiLSTM) with Transformer and CRF Led to a Poor Performance to Parse Citations
46
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
[1] Vaswani et al. (2017). Attention is all you need. NeruIPS’17.
[2] Schuster et al. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing.
RQ4: We Propose LMParsCit, using a LLM Based Models and Contribute a Challenging Benchmark to Overcome the limitations of Existing Methods
47
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
RQ4: ETDCite – A Benchmark compiled from ETDs
48
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
Figure 18: ETDCite evaluation benchmark, distribution of the ETD reference across University, Discipline, Year, and Bibliography Types (e.g., J – Journal, Conf. – Conference, B – Book, In-B – In Book, TR – Tech Report, Th – Thesis)
RQ4: ETDCite – A Benchmark compiled from ETDs
49
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
Figure 19: Heatmap demonstrating the relationship between citation types vs. citation Styles
RQ4: ETDCite – A Benchmark compiled from ETDs
50
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
Figure 20: Heatmap demonstrating the relationship between academic disciplines vs. citation styles
RQ4: ETDCite – A Benchmark compiled from ETDs
51
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
Figure 21: Heatmap demonstrating the relationship between academic disciplines vs. citation types
RQ4: Prompt Engineering – Conceptual Overview
<instruction> Extract Metadata from the following citation: Smith, J. (2021). Deep Learning in Practice. <output> {“author”: “Smith, J.”, “title”: “Deep Learning in Practice”, “year”: “2021”}
Q: What is 12 + 15 divided by 3?
A: Let’s solve it step by step. 15 divided by 3 is 5. Now, add 12 and 5. The answer is 17.
52
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
RQ4: LMParsCit – Using GIANT 1B, Fine-tuned Llama3-8b-instruct Incorporating Prompts
53
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
Optimization:
Hardware:
Figure 22: LMParsCit Framework
RQ4: LMParsCit Performance on CORA, Comparing against Neural ParsCit, Achieved State-of-the-art Performance
54
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
Figure 23: Performance comparison of two LLM based methods (LMParsCit – only finetuning, and LMParsCitPrompt – fine tuning & prompt engineering) on CORA
CORA:
RQ4: LMParsCit Performance on ETDCite, Comparing against Neural ParsCit, Achieved State-of-the-art Performance
55
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24
Figure 24: Performance comparison of two LLM based methods (LMParsCit – only fine tuning, and LMParsCitPrompt – fine tuning & prompt engineering) on ETDCite
ETDCite:
Conclusion
56
ETDSuite: A Toolkit to Mine ETDs To Enrich Scholarly Big Data using NLP & CV @TasinChoudhury @WebSciDL, Dissertation Defense 11/06/24