The AstroBibPile: Building a Dataset to Support AI-enabled Bibliography Curation Efforts
Alberto Accomazzi
IVOA Interop | 22 May 2024
What is the NASA Science Explorer?
NASA SciX is a literature-based, open digital information system covering and unifying the research disciplines funded by the NASA Science Mission Directorate.
It represents an extension of the NASA Astrophysics Data System to include all literature relevant to NASA Science research.
SciX supports NASA’s Open Science efforts and enables interdisciplinary research and collaboration.
SciX currently indexes 20M articles, 260k data and software records, and provides links to almost 500k data products
Context
The NASA Science Explorer (SciX) is primarily a literature database. SciX does not aim to be an index for all research data products, but rather make relevant data products discoverable from the literature, whenever feasible, through citation or data links.
Some types of data which are of most interest to SciX:
3
Two Strategies for Metadata Enrichment - Curation
Curation of Bibliographies
ADS has been aggregating and exposing connections between bibliographic records and data products which are curated by librarians and archivists.
The largest contributors to this effort are projects in astrophysics which track astronomical objects (SIMBAD and NED), data catalogs (Vizier), and archives (Chandra, MAST, ESA, NOIRLab, etc.).
This provides a way to enrich new and existing bibliographic records whenever associated data is identified or entered in a knowledge base.
Thanks to librarians and archivists for enabling this capability!
4
Two Strategies for Metadata Enrichment - TDM
Text Mining of the Literature
SciX obtains and processes the full-text of all papers in its database, maintains a citation database, and mines links to data products.
SciX detects the citation (in a reference list) or mention (in a data availability statement) of a software or data product, and records it in its database.
This helps, but doesn’t replace, the curation work described earlier, which requires human evaluation of the content and context in which data products are mentioned in the literature.
5
Curation Workflow (Observatory Bibliographies)
Current Process:
“There is tremendous diversity in the ways bibliographers track publications and maintain databases, due to parameters such as resources (personnel, time, budget, IT capabilities), type of observatory, historical practices, and reporting requirements to funders and outside agencies.” (Observatory Bibliographers Collaboration 2024, arXiv:2401.00060)
6
Curation Workflow (Observatory Bibliographies)
Future Trends:
“Efforts are underway to implement an automated paper classification system at STScI/MAST to identify science papers within a set of candidate papers; however, even if this product comes to fruition in the 2020s, it is expected that human intervention will be needed to extract additional information about the paper” (Observatory Bibliographers Collaboration 2024).
“It is worth noting that this ML approach does not completely remove human involvement in the process. Human expertise and learning are needed for marginal cases that are not resolved by existing capabilities. [...] There needs to be a continuing education program for retraining and updating the classifier models with new literature, which will require new labels identified by human experts periodically” (Chen et al. 2022).
7
Human vs. Machine
Human Curation
Automated Text Mining
8
What might be Possible
Use NLP and AI to accelerate progress
Some of these techniques have been successfully applied to identification of papers using Heliophysics missions and Planetary Feature Names detection in the literature.
9
Shapurian et al, arXiv:2312.08579
Buonomo et al, https://doi.org/10.5281/zenodo.8415073
What’s Missing - Labeled Datasets and Methods
Datasets
We need uniform, open datasets that can be used to train tools in support of the curation effort
Methods
We need to make methodologies explicit so we can enable reproducibility and scalability of effort
10
A Proposal: The “AstroBibPile” Open Dataset
11
AstroBibPile: Call for Contributors
Please get in touch! aaccomazzi@cfa.harvard.edu
12
Thank You!