- The data collections archived and distributed by the GES DISC NASA data center are widely utilized for various Earth Science studies.
- GES DISC collects these publications and provides their citations for the users, so it is helpful to categorize these papers based on how they relate to the datasets they are associated with.
- GES DISC scientists came up with four major categories of publications with regard to how they related to the Earth Science collections.
- The process currently requires manual labeling. We seek to create an automated classifier using manually labeled publications and their abstracts as training data for supervised machine learning algorithms to make category predictions.
Background and Motivation
https://disc.gsfc.nasa.gov/
Binary Classifier and Results
Data Distribution and Machine Learning
- Publication category distribution:
- application: 466 publications
- algorithm: 219 publications
- validation: 143 publications
- overview: 135 publications
- Distribution is extremely biased to “application”
- Population is also extremely biased this way
- Models run with imbalanced distribution may not perform well
- Two approaches to resolve: under sampling, over sampling
- Two algorithms of interest are Random Forest and Naive Bayes
- Naïve Bayes algorithm - based off of Bayes Theorem
- Utilize each word’s TF-IDF weight to calculate the
probability that a word belongs to a category
- Documents are classified as belonging to the category
for which the product of probability for each individual word belonging to that category is the highest
- Ex: P(App) = P(App|word1)*P(word1) * P(App|word2)*P(word2) * …
- P(Val) = P(Val|word1) *P(word1) * P(Val|word2)*P(word2) * …
- If P(Val)>P(App), output P(Val) for the document classification
- Random Forest algorithm is a tree ensemble algorithm
- A single decision tree can be used for modeling simple scenarios
- Random Forest algorithm trains multiple decision trees fit for different segments of the training sample
- Ensemble algorithm works in that the result which is output from the algorithm as a whole is the most popular output from amongst all of the trees in the “forest”
Automated classification of scientific publications linked to GES DISC datasets
NASA/Goddard Earth Sciences Data and Information Services Center (GES DISC)
Rohan Dayal1, 2, Irina Gerasimov1, 2, Armin Mehrabian1, 2, Jennifer Wei1, Mohammad Khayat1, 2 , Andrey Savtchenko1, 2
1Code 610.2, NASA Goddard Space Flight Center, Greenbelt, MD, USA 2ADNET Systems Inc., Lanham, MD, USA
- Data pre-processing steps are then applied to clean the abstracts
- Lowercasing all text so that same word uppercase and lowercase counts as one
- Removing all punctuation marks since they do not provide much knowledge
- Removing all stop words to make the data less noisy
- Lemmatizing the text - Reverting all words to their base form
- The text is converted to a numerical format through the TF-IDF weighting factor
- TF-IDF is the frequency of the word in the current document multiplied by the inverse of the frequency across documents of the word
- Each individual word in an abstract has its TF-IDF number calculated
- Combination of all these weights across an abstract creates a vector representation for the abstract
- The TF-IDF vectors will be used to train the machine learning algorithms employed
- We will apply the algorithm on the newer publications’ citations, as well as citations related to other GES DISC data
- We will integrate this classifier directly into the Zotero system so that this process can automatically occur. We will then develop a workflow for the remaining final labeling stages
- While initial results not ideal, another approach that still could substantially reduce manual efforts is binary classification between “application” and non-”application” publications
- “Application” categories make up the vast majority of the population of publications which use GES DISC collections
- The sample we have contains 466 applications, and 497 non-applications, so with marginal under sampling, can have balanced data
- Running the same multinomial Naïve Bayes algorithm
as a binary classifier between these two categories
produced promising results
- Test accuracy score was found to be 0.771, along with
similar recall and precision values in the 0.7-0.8 range
- The learning curve for this Naïve Bayes model is much closer to the ideal learning curve model, and shows that overfitting did not occur
The testing accuracies of the algorithms with different distributions are as follows:
- Original Sample Under Sampling Over Sampling
- Naive Bayes: 0.648 0.481 0.636
- Random Forest: 0.634 0.494 0.914
- Under sampling approach was the poorest performer. Likely as a result of the loss of data under sampling requires by reducing all categories to the minority category size
- Over sampling with Random Forest appears promising, but further analysis demonstrates that the model was overfit to the dataset
- A learning curve should show converging training and cross validation curves. The learning curve for RF lacks this convergence and maintains a gap between CV and training, along with constant training scores, indicating overfitting.
Ideal LC: RF LC:
| Paper Category Description |
| - Written by scientists to describe the mathematical basis for the algorithms used for creating datasets.
- Describe algorithms that convert sensor observations (radiances, brightness) into physical variables (water vapor, precipitation).
- Can be descriptions of a modeling or assimilation approach. Can help dataset users better understand the mathematics involved in dataset production, dataset provenance, structure, parameters, limitations, etc.
|
| - Describe validation of the datasets done by comparison between datasets coming from different sources, e.g. platforms/instruments and models, or the ground validation of Earth observational data.
- By science teams or the teams producing ground observation data.
- Important for dataset creators as well as users to understand the datasets uncertainties, errors, and limitations in applicability.
|
| - Largest category on how dataset was used in real life: analysis of events, phenomena, system modeling, environmental trends, etc.
- Created by science teams and researchers who acquire the data
- Have largest diversity of publication sources in terms of the location and science impact factor. Can help to evaluate dataset usability for specific studies as well as dataset science impact.
|
| - General descriptions of missions, sensors, projects, experiments, field campaigns, or assessments, that produced the datasets.
- Written by principal investigators
- Highly recommended to understand the basics of the project’s goals, the observational and mathematical principles employed.
|