Finding the Genre - Multi-Class Classification�Made by: Ortal Lasry and Or Cohen Raviv�Date: 12.11.24�
Background:
We utilized a Spotify dataset from Kaggle, which contains 114 track genres and 14 main numeric features (e.g., duration_ms, danceability) and a boolean feature indicating explicit content. The dataset originally had 20 columns and 114,000 rows, but after removing duplicates (keeping only the first occurrence), we retained 87,867 rows. The dataset also presents significant class imbalance in the target variable, track_genre.
Business Question:
How can we accurately classify music genres using audio features to improve recommendations and content organization on streaming platforms?
Analysis Summary:
Upon preliminary exploration, we observed that certain genres had more track IDs than others, contributing to data imbalance and complicating pattern interpretation due to overlapping characteristics within broader genres. For example, subgenres like 'rock', 'rock-n-roll', and 'alt-rock' fall under the broader 'rock' category. To address this, we grouped them into 10 classes. In addition, we used class imbalance when we fitted the ml models. This grouping aims to reduce genre overlap and improve classification accuracy.
Feature Engineering and Descriptive Analysis:
Descriptive statistics revealed right-skewed distributions in variables such as duration_ms, speechiness, acousticness, and instrumentalness, while loudness was left-skewed. These patterns guided our feature engineering approach.
Measuring skewness according to Fisher's skewness:
�
Note: Fisher’s skewness coefficient: positive values indicate right skew, negative values indicate left skew, and values close to zero suggest a symmetric distribution.
Feature Engineering and Descriptive Analysis:�
The correlation matrix �The correlation matrix between all quantitative features revealed a high correlation between energy and loudness (r=0.75) and between energy and acousticness (r=-0.72).�
Bar Chart for the Categorical Feature:� 'Explicit' (Presence of Explicit Language):
As expected, some genres have higher frequencies of explicit content than others. For example, Hip Hop and Death Metal show the highest frequencies of explicit content, while Classical music has none.
Modeling Approach and Evaluation:
In the first step, we split the dataset into training and testing sets (0.3%/ 0.7%) and used 5 features that exhibited high variance across music genres. We predicted only 5 music genres based on ML models, including Logistic Regression, Voting Classifier, Random Forest and XGboost. We assessed feature importance and found that ‘explicit,’ ‘danceability,’ and ‘speechiness’ made the greatest contributions to reducing Gini impurity. The Random Forest model achieved the highest accuracy (0.8) with a Test ROC AUC score of 0.95.
Modeling Approach and Evaluation:
Following this, we expanded our approach to predict all genres (114 music genres grouped into 10 classes) and included eight features in total. We tested the accuracy score for each model with the original features vs. engineered features. The results showed minor accuracy improvements, leading us to proceed with XGBoost using only the engineered features, which achieved a highest accuracy of 0.5 and a Test ROC AUC of 0.84. An alternative approach used PCA and K Means resulted in the highest accuracy level 0.76 (using Random Forest and engineered features).� Most F-scores and precision scores were found to be similar to the accuracy score.
Project summary and Insights
Future Research: