Deep embeddings with Essentia models
Pablo Alonso-Jiménez Dmitry Bogdanov Xavier Serra
Music Technology Group, Universitat Pompeu Fabra
Extractring embeddings
Essentia has dedicated algorithms to perform inference with each model.
With the `output` parameter we can select the layer of the network to retrieve. It is defaulted to the main task of the network (e.g., music tag indices, bpm bins, separated audio) but it can be set to point to any layer of interest.
On the music auto-tagging models we retrieved the penultimate layer as embeddings, on Tempo-CNN the logits of the last layer, on Spleeter we used the concatenation of the bottleneck layers of each stem as embeddings, and on the feature extractor models we used directly the output proposed by the authors.
TensorFlow integration in Essentia
Essentia is an open-source C++/Python library for audio signal processing, developed at the MTG-UPF and licensed under Affero GPLv3.
Pre-trained models
We have prepared various MIR models for several tasks. They can also be used as embeddings extractors. The following plots show the embeddings produced with our models for a 2 minutes rock track.
audio = MonoLoader(filename='your_song.mp3', sampleRate=16000)()
musicnn_embs = TensorflowPredictMusiCNN(graphFilename='msd-musicnn-1.pb',
output='model/dense/BiasAdd')(audio)
Uses in MIR
Our main goal is to provide fast C++ inference for state-of-the-art deep learning models in Essentia suitable for deployment in diverse MIR applications.
We host a collection of models for specific use-cases (auto-tagging, tempo estimation, source separation, music classification by genre, mood, and instrumentation).
Some of these models produce embeddings suitable for transfer learning.
data loader
feature extraction
TensorFlow
Essentia Pool
Results
aggregation
Essentia bindings
data loader
feature extraction
bindings
Results
e.g., FFmpeg
e.g., SciPy
e.g., PyTorch, TensorFlow
aggregation
Python
C/C++
Python
C++
e.g., NumPy
Functionalities
Design criteria
data flow
control
Data processing pipeline in Essentia:
Data processing pipeline found in common MIR projects:
pre-trained model
pre-trained model
Suitable framework for research and deployment scenarios.
MusiCNN
music auto-tagging
787K parameters
200 embedding dimensions
220/350K training size
fully supervised
VGG-I
music auto-tagging
605K parameters
256 embedding dimensions
220/350K training size
fully supervised
Tempo-CNN
tempo estimation
1.2M parameters
256 embedding dimensions
11K training size
fully supervised
OpenL3
feature extractor
4.7M parameters
512 embedding dimensions
296K training size
self supervised
VGGish
feature extractor
62M parameters
128 embedding dimensions
70M training size
fully supervised
Spleeter
source separation
49M parameters
1280 embedding dimensions
unknown training size
fully supervised
audio = MonoLoader(filename='your_song.mp3', sampleRate=11025)()
tempocnn_embs = TensorflowPredictTempoCNN(graphFilename='deepsquare-k16-3.pb',
output='1x1/Relu0_reshape')(audio)
Downstream tasks
We compared the capabilities of the pre-trained models as feature extractors in 16 downstream tasks.
dortmund alternative, blues, electronic, folk-country, funksoulrnb, jazz, pop, raphiphop, rock 1820 excerpts | gtzan blues, classic, country, disco, hiphop, jazz, metal, pop, reggae, rock 1000 excerpts | rosamerica classic, dance, hip hop, jazz, pop, rhythm and blues, rock, speech 400 full tracks |
acoustic acoustic, non-acoustic 321 full tracks | aggressive aggressive, non-aggressive 280 full tracks/excerpts | electronic electronic, non-electronic 332 full tracks/excerpts | happy happy, non-happy 302 excerpts | party party, non-party 349 full excerpts | relaxed relaxed, non-relaxed 446 full tracks/excerpts | sad sad, non-sad 230 full tracks/excerpts |
genre recognition
voice/instrumental voice, instrumental 1000 excerpts | tonal/atonal tonal, atonal 345 full tracks | gender female, male 3311 full tracks | danceability danceable, non-danceable 306 full tracks | fs-loop-ds bass, fx, melody, percussion, other 2104 excerpts | urbansound8k air conditioner, car horn, children playing, dog bark, drilling, engine idling, gunshot, jackhammer, siren, street music 8732 excerpts |
m ood detection
miscellaneous audio tasks
The pre-trained models are compared by training a multilayer perceptron for each task on top of the proposed embeddings.
The models are compared in two ways. Using 5-fold cross-validation (5F) and evaluating in the MTG-Jamanendo dataset (JA) for which we collected annotations following the taxonomies of all our tasks.
The table below also contains a column for the accuracy drop (AD), the difference between both metrics as a proxy for the generalization capabilities on each case.
The best embeddings for each task are shaded light/medium grey for each metric. The results are expressed in terms of class-weighted accuracies.
MusiCNN and VGG-I are trained on two versions of MSD-Last.fm targeting the top 50 and the top 200 tags (T200 models), resulting in training sizes of 220K and 350K.
These models are available online at https://essentia.upf.edu/models/
More examples at: https://essentia.upf.edu/machine_learning.html
Deep embeddings with Essentia models
Pablo Alonso-Jiménez Dmitry Bogdanov Xavier Serra
Music Technology Group, Universitat Pompeu Fabra
Extractring embeddings
Essentia has dedicated algorithms to perform inference with each model.
With the `output` parameter we can select the layer of the network to retrieve. It is defaulted to the main task of the network (e.g., music tag indices, bpm bins, separated audio) but it can be set to point to any layer of interest.
On the music auto-tagging models we retrieved the penultimate layer as embeddings, on Tempo-CNN the logits of the last layer, on Spleeter we used the concatenation of the bottleneck layers of each stem as embeddings, and on the feature extractor models we used directly the output proposed by the authors.
TensorFlow integration in Essentia
Essentia is an open-source C++/Python library for audio signal processing, developed at the MTG-UPF and licensed under Affero GPLv3.
Pre-trained models
We have prepared various MIR models for several tasks. They can also be used as embeddings extractors. The following plots show the embeddings produced with our models for a 2 minutes rock track.
audio = MonoLoader(filename='your_song.mp3', sampleRate=16000)()
musicnn_embs = TensorflowPredictMusiCNN(graphFilename='msd-musicnn-1.pb',
output='model/dense/BiasAdd')(audio)
Uses in MIR
Our main goal is to provide fast C++ inference for state-of-the-art deep learning models in Essentia suitable for deployment in diverse MIR applications.
We host a collection of models for specific use-cases (auto-tagging, tempo estimation, source separation, music classification by genre, mood, and instrumentation).
Some of these models produce embeddings suitable for transfer learning.
data loader
feature extraction
TensorFlow
Essentia Pool
Results
aggregation
Essentia bindings
data loader
feature extraction
bindings
Results
e.g., FFmpeg
e.g., SciPy
e.g., PyTorch, TensorFlow
aggregation
Python
C/C++
Python
C++
e.g., NumPy
Functionalities
Design criteria
data flow
control
Data processing pipeline in Essentia:
Data processing pipeline found in common MIR projects:
pre-trained model
pre-trained model
Suitable framework for research and deployment scenarios.
MusiCNN
music auto-tagging
VGG-I
music auto-tagging
Tempo-CNN
tempo estimation
OpenL3
VGGish
Spleeter
audio = MonoLoader(filename='your_song.mp3', sampleRate=11025)()
tempocnn_embs = TensorflowPredictTempoCNN(graphFilename='deepsquare-k16-3.pb',
output='1x1/Relu0_reshape')(audio)
Downstream tasks
We compared the capabilities of the pre-trained models as feature extractors in 16 downstream tasks.
dortmund alternative, blues, electronic, folk-country, funksoulrnb, jazz, pop, raphiphop, rock 1820 excerpts | gtzan blues, classic, country, disco, hiphop, jazz, metal, pop, reggae, rock 1000 excerpts | rosamerica classic, dance, hip hop, jazz, pop, rhythm and blues, rock, speech 400 full tracks |
acoustic acoustic, non-acoustic 321 full tracks | aggressive aggressive, non-aggressive 280 full tracks/excerpts | electronic electronic, non-electronic 332 full tracks/excerpts | happy happy, non-happy 302 excerpts | party party, non-party 349 full excerpts | relaxed relaxed, non-relaxed 446 full tracks/excerpts | sad sad, non-sad 230 full tracks/excerpts |
genre recognition
voice/instrumental voice, instrumental 1000 excerpts | tonal/atonal tonal, atonal 345 full tracks | gender female, male 3311 full tracks | danceability danceable, non-danceable 306 full tracks | fs-loop-ds bass, fx, melody, percussion, other 2104 excerpts | urbansound8k air conditioner, car horn, children playing, dog bark, drilling, engine idling, gunshot, jackhammer, siren, street music 8732 excerpts |
m ood detection
miscellaneous audio tasks
The pre-trained models are compared by training a multilayer perceptron for each task on top of the proposed embeddings.
The models are compared in two ways. Using 5-fold cross-validation (5F) and evaluating in the MTG-Jamanendo dataset (JA) for which we collected annotations following the taxonomies of all our tasks.
The table below also contains a column for the accuracy drop (AD), the difference between both metrics as a proxy for the generalization capabilities on each case.
The best embeddings for each task are shaded light/medium grey for each metric. The results are expressed in terms of class-weighted accuracies.
MusiCNN and VGG-I are trained on two versions of MSD-Last.fm targeting the top 50 and the top 200 tags (T200 models), resulting in training sizes of 220K and 350K.
These models are available online at https://essentia.upf.edu/models/
More examples at: https://essentia.upf.edu/machine_learning.html