1 of 21

CS 4641 Final Presentation

Group 8:

Wendy Pan, Qianyu Zheng, Yurui Wu, Junjie Tang, Yuanming Luo

2 of 21

Introduction & Problem

Disorganized library

Without effectively grouping the music, the order of music in library is basically the time of getting in which is very disorganized. The recommendation algorithm will not effectively select the music that the customer most likely will pick.

Economic concerns

When consumers are constantly presented with irrelevant suggestions, poor recommendation algorithms not only make the user experience worse but may also cause streaming platforms to incur financial losses.

Machine learning

ML algorithm can be used to group music with their genre. Predict classification of new music.

GTZAN is the dataset we are using

3 of 21

Strategy 1

Stacked MFCC & Chroma

4 of 21

Preprocessing Workflow

5 of 21

Results

6 of 21

Discussion and Reflection

Initial model using full MFCC and Chroma graphs resulted in 60% accuracy possibly due to high dimensionality that hindered learning and generalization capabilities of the models
Proposed solution:

Reduce feature set by extracting key characteristics (e.g., statistical measures, domain-specific metrics) from MFCC and Chroma graphs.
Introduce additional features such as spectrograms, tempo graphs, and RMS values to enrich the feature set.

Objective: Improve model performance by balancing complexity and representational power of the features, enhancing model's ability to generalize and learn effectively.

7 of 21

Strategy 2.1

Manual Feature Engineering

For each audio generate 77 Features:

Mean and variance of root-mean-square of frames, spectral centroid, spectral bandwidth, spectral rolloff, zero crossing rate, harmony, 20 MFCC and 12 Chromas, and tempo.

8 of 21

Results

9 of 21

Discussion and Reflection

The SVM's RBF kernel is effective in handling the non-linearity in audio data, utilizing the higher dimensions to construct a genre-separating hyperplane with maximum margin
Regularization parameter C in SVM helps balance the trade-off between maximizing the margin and minimizing classification errors, enhancing robustness and generalization.
Despite its strengths, manual feature selection remains challenging and time-consuming, necessitating domain expertise and significant experimentation.

10 of 21

Strategy 2.2

Manual Feature Engineering with Split Training

11 of 21

Pipeline

Features:

Same 77 features as in Strategy 2.1

k = 10 is the current best-performing value for hyperparameter k.

12 of 21

Results

SVM

MLP

5% improvement across models!

13 of 21

Discussion and Reflection

Advantages of local feature extraction on using subsamples:

Local pattern recognition
Increased training data and reduced overfitting
Aggregation of predictions

Proposed challenge:

Genre mismatch in subsamples

Advantages of local feature extraction on subsamples:

Local pattern recognition: The model can learn to recognize local patterns and characteristics that may be obscured or averaged out when considering the full audio at once.
Increased training data and reduced overfitting: Each subsample acts as a unique example, expanding the dataset size by up to 10x, reducing overfitting.
Aggregation of predictions: Combines insights from multiple subsamples for more robust and accurate overall predictions.

Proposed challenges and limitations:

Genre mismatch in subsamples: When an audio sample is split into subsamples, there is no guarantee that each subsample will belong to the same genre as the full audio, bringing in training noises
Mislabeling impact: Subsamples mislabeled based on the full track's genre can lead to incorrect training and poor model performance.

14 of 21

Strategy 3

Two-Step VGG Fine Tuning

15 of 21

Spectrogram processing

16 of 21

Model structure

Step 1:

Train the head

Step 2:

Train one

Conv block

17 of 21

Results

18 of 21

Discussion and Reflection

2-step fine-tuning process - adapt the model to specific tasks while leveraging pre-trained ImageNet features
Focuses on fine-tuning the later convolutional layers
Combines audio-to-image conversion, transfer learning, gradual layer adaptation, and precise preprocessing to enhance model effectiveness, showing promising results for the specified task

19 of 21

Comparison

Performance: (best) 3 > 2 > 1 (worst)

Computational Cost: (cheapest) 1 < 2 < 3 (most expensive)

Requirement for Domain-Specific Knowledge: (need most) 2 > 1 > 3 (need least)

Model Interpretability: 2

20 of 21

References

[1] Tzanetakis, G. and Cook, P. (2002) ‘Musical genre classification of Audio Signals’, IEEE Transactions on Speech and Audio Processing, 10(5), pp. 293–302. doi:10.1109/tsa.2002.800560.

[2] Li, T., Ogihara, M., & Li, Q. (2003, July). A comparative study on content-based music genre classification. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 282-289).

[3] Burred, J. J., & Lerch, A. (2003, September). A hierarchical approach to automatic musical genre classification. In Proceedings of the 6th international conference on digital audio effects (pp. 8-11).

[4] Ndou, N., Ajoodha, R., & Jadhav, A. (2021, April). Music genre classification: A review of deep-learning and traditional machine-learning approaches. In 2021 IEEE International IOT, Electronics and Mechatronics Conference (pp. 1-6). IEEE

[5] Bahuleyan, H. (2018). Music Genre Classification using Machine Learning Techniques. arXiv:1804.01149v1

21 of 21

Thank You