第 1 張,共 12 張

NSDC Project Fall 2023: Galaxy Classification

Gianna Pedroza, Aidan Nguyen, Tishi Avvaru, Benjamin Yu

第 2 張,共 12 張

The Dataset: Galaxy Zoo

In the original Galaxy Zoo project, volunteers classified images of Sloan Digital Sky Survey galaxies as belonging to one of six categories - elliptical, clockwise spiral, anticlockwise spiral, edge-on , star/don't know, or merger.

第 3 張,共 12 張

The Dataset: Galaxy Zoo Table

The table gives classifications of galaxies with the fraction of the vote in each of the six categories is given flags identifying systems as classified as spiral, elliptical or uncertain.

The Galaxy Zoo project collected simple classifications of nearly 900,000 galaxies drawn from the Sloan Digital Sky Survey with classifications given by hundreds of thousands of volunteers.

第 4 張,共 12 張

Goals:

  • Classify the galaxies based off of physical features as either, spiral, elliptical, or uncertain.
  • Find the model that best fit the dataset and tested out multiple classification models: kNN, Logistic Regression, Naive Bayes, and a Neural Network.
  • Learn about different libraries that can be used for classification and machine learning.

第 5 張,共 12 張

Cleaning the Data:

  • To clean the data we first created a dataset using pandas to view the different features and removed features that were not useful for classification:
    • OBJID: is an id number for the given galaxy
    • RA: Right Ascension coordinate of the galaxy
    • DEC: Declination coordinate of the galaxy
    • NVOTE: How many people voted on the given galaxy
  • There were also individual columns for the three classifications so we combined them into one column with 1 = spiral, 2 = elliptical, and 3 = uncertain.

第 6 張,共 12 張

Visualization:

  • To visualize the features of all three classifications we created color coded histograms.

第 7 張,共 12 張

Model 1: K-Nearest Neighbors

  • K-Nearest Neighbors (KNN) - a supervised learning classifier which uses proximity to make predictions or classifications about a data point
    • K refers to the amount of “neighbors” that will be checked to determine the classification of a certain point

第 8 張,共 12 張

Model 2: Logistic Regression

  • Logistic Regression model - a supervised learning model often used for classification and predictive analysis
    • estimates the probability of an event occurring based on a given dataset of independent variables

第 9 張,共 12 張

Model 3: Naive Bayes

  • Naive Bayes is a classification algorithm that uses the Bayes Theorem to predict the probability of an item to be a specific class.
    • This classification model is called “naive” because it assumes that the data features are independent of one another

第 10 張,共 12 張

Model 4: Neural Network

  • Network of neurons with associated weights designed to predict output
  • Final Design: Tensorflow Sequential Model
  • 90% accuracy on validation data

第 11 張,共 12 張

Comparisons:

Model 1

Model 2

Model 3

Model 4

Accuracy

87%

86%

79%

90.35%

Precision

80%

68%

95%

81%

59%

96%

79%

46%

93%

第 12 張,共 12 張

Citations