1 of 13

The AstroBibPile: Building a Dataset to Support AI-enabled Bibliography Curation Efforts

Alberto Accomazzi

IVOA Interop | 22 May 2024

2 of 13

What is the NASA Science Explorer?

NASA SciX is a literature-based, open digital information system covering and unifying the research disciplines funded by the NASA Science Mission Directorate.

It represents an extension of the NASA Astrophysics Data System to include all literature relevant to NASA Science research.

SciX supports NASA’s Open Science efforts and enables interdisciplinary research and collaboration.

SciX currently indexes 20M articles, 260k data and software records, and provides links to almost 500k data products

https://SciXplorer.org

3 of 13

Context

The NASA Science Explorer (SciX) is primarily a literature database. SciX does not aim to be an index for all research data products, but rather make relevant data products discoverable from the literature, whenever feasible, through citation or data links.

Some types of data which are of most interest to SciX:

Datasets “close” to publications, either as DBF, supporting archival links, or citations, as they supplement the science presented therein; examples include VizieR catalogs, text-mined Zenodo links, archival data links, data citations
Reference catalogs, collections, and services, which are highly used and cited; examples include 2MASS, WISE, CSC, etc.
Software records either mentioned or cited in scientific publications.

4 of 13

Two Strategies for Metadata Enrichment - Curation

Curation of Bibliographies

ADS has been aggregating and exposing connections between bibliographic records and data products which are curated by librarians and archivists.

The largest contributors to this effort are projects in astrophysics which track astronomical objects (SIMBAD and NED), data catalogs (Vizier), and archives (Chandra, MAST, ESA, NOIRLab, etc.).

This provides a way to enrich new and existing bibliographic records whenever associated data is identified or entered in a knowledge base.

Thanks to librarians and archivists for enabling this capability!

5 of 13

Two Strategies for Metadata Enrichment - TDM

Text Mining of the Literature

SciX obtains and processes the full-text of all papers in its database, maintains a citation database, and mines links to data products.

SciX detects the citation (in a reference list) or mention (in a data availability statement) of a software or data product, and records it in its database.

This helps, but doesn’t replace, the curation work described earlier, which requires human evaluation of the content and context in which data products are mentioned in the literature.

6 of 13

Curation Workflow (Observatory Bibliographies)

Current Process:

Identifying candidate Publications through search of ADS/CrossRef

Scope of journals being considered
Refereed vs. non-refereed publications

Evaluation of Publications for Inclusion

Science Papers ≅ use of data
Engineering Papers ≅ instruments
Non-science Papers ≅ mention of data

“There is tremendous diversity in the ways bibliographers track publications and maintain databases, due to parameters such as resources (personnel, time, budget, IT capabilities), type of observatory, historical practices, and reporting requirements to funders and outside agencies.” (Observatory Bibliographers Collaboration 2024, arXiv:2401.00060)

7 of 13

Curation Workflow (Observatory Bibliographies)

Future Trends:

“Efforts are underway to implement an automated paper classification system at STScI/MAST to identify science papers within a set of candidate papers; however, even if this product comes to fruition in the 2020s, it is expected that human intervention will be needed to extract additional information about the paper” (Observatory Bibliographers Collaboration 2024).

“It is worth noting that this ML approach does not completely remove human involvement in the process. Human expertise and learning are needed for marginal cases that are not resolved by existing capabilities. [...] There needs to be a continuing education program for retraining and updating the classifier models with new literature, which will require new labels identified by human experts periodically” (Chen et al. 2022).

8 of 13

Human vs. Machine

Human Curation

Process for identifying and evaluating relevant papers requires a subject matter expert driving the effort
Librarians/archivists define principles behind the curation of each bibliography, based on the needs of each project
Human involvement is expensive and is often a limiting factor in the curation process
The involvement of a human in the loop makes the process somewhat subjective

Automated Text Mining

Useful for finding documents which contain the required information, but additional analysis of sentiment and intent is difficult
Difficult to capture the nuance behind mention of dataset in a paper or their relationship with other findings in the study
Can be implemented at scale for all the records indexed in SciX�
Forces the curation process to become explicit and implementable, thus increasing its reproducibility

9 of 13

What might be Possible

Use NLP and AI to accelerate progress

Named Entity Recognition: find and normalize mentions of missions, telescopes, instruments
Knowledge Graphs: facilitate disambiguation and relevance of concepts in papers
Large Language Models: use LLM’s capabilities for reasoning and classification of data use vs. mention

Some of these techniques have been successfully applied to identification of papers using Heliophysics missions and Planetary Feature Names detection in the literature.

Shapurian et al, arXiv:2312.08579

Buonomo et al, https://doi.org/10.5281/zenodo.8415073

10 of 13

What’s Missing - Labeled Datasets and Methods

Datasets

Each bibliography is curated by a different team using different criteria for inclusion
The data used to create the bibliography (scientific papers) is not always accessible due to license restrictions
The annotated bibliographies, and the set of metadata associated with them, are stored in different formats and different archives

We need uniform, open datasets that can be used to train tools in support of the curation effort

Methods

Each observatory/archive has separately developed own methodology for obtaining fulltext and applying TDM techniques
Criteria for evaluating “use” vs “mention” of data are non-uniform and have evolved over time, so making them explicit is useful
No economy of scale when each observatory works on their own�

We need to make methodologies explicit so we can enable reproducibility and scalability of effort

11 of 13

A Proposal: The “AstroBibPile” Open Dataset

Collect the data and methodologies behind the major active bibliographies in Astronomy, publish it to HuggingFace along with a collection of OA papers that can be used for their analysis
Submit a proposal for a new WIESP workshop with this as a shared task at one of the 2025 ACL meetings: https://www.aclweb.org/portal/content/joint-call-workshops-proposals-eaclaclnaaclemnlp-2024
Extend to the rest of Space Sciences (and possibly Earth Sciences)

12 of 13

AstroBibPile: Call for Contributors

Do you have data that can be useful for this effort? Please consider contributing to the AstroBibPile.
Are you interested in developing AI techniques to support the creation and maintenance of the bibliographies?
Do you have additional use cases that could benefit from the AstroBibPile dataset?
Would you like to know more?�

Please get in touch! aaccomazzi@cfa.harvard.edu

https://docs.google.com/document/d/1zDW61dOvpYxaNi74U74F39vzmZzxQc6iP05rIsbLjZU/edit?usp=sharing