1 of 3

Deep embeddings with Essentia models

Pablo Alonso-Jiménez Dmitry Bogdanov Xavier Serra

Music Technology Group, Universitat Pompeu Fabra

Extractring embeddings

Essentia has dedicated algorithms to perform inference with each model.

With the `output` parameter we can select the layer of the network to retrieve. It is defaulted to the main task of the network (e.g., music tag indices, bpm bins, separated audio) but it can be set to point to any layer of interest.

On the music auto-tagging models we retrieved the penultimate layer as embeddings, on Tempo-CNN the logits of the last layer, on Spleeter we used the concatenation of the bottleneck layers of each stem as embeddings, and on the feature extractor models we used directly the output proposed by the authors.

TensorFlow integration in Essentia

Essentia is an open-source C++/Python library for audio signal processing, developed at the MTG-UPF and licensed under Affero GPLv3.

Pre-trained models

We have prepared various MIR models for several tasks. They can also be used as embeddings extractors. The following plots show the embeddings produced with our models for a 2 minutes rock track.

audio = MonoLoader(filename='your_song.mp3', sampleRate=16000)()

musicnn_embs = TensorflowPredictMusiCNN(graphFilename='msd-musicnn-1.pb',

output='model/dense/BiasAdd')(audio)

Uses in MIR

Our main goal is to provide fast C++ inference for state-of-the-art deep learning models in Essentia suitable for deployment in diverse MIR applications.

We host a collection of models for specific use-cases (auto-tagging, tempo estimation, source separation, music classification by genre, mood, and instrumentation).

Some of these models produce embeddings suitable for transfer learning.

data loader

feature extraction

TensorFlow

Essentia Pool

Results

aggregation

Essentia bindings

data loader

feature extraction

bindings

Results

e.g., FFmpeg

e.g., SciPy

e.g., PyTorch, TensorFlow

aggregation

Python

C/C++

Python

C++

e.g., NumPy

Functionalities

Audio features

Spectral features
Rhythm and tempo
Tonality and melody
Fingerprinting

Inference with TensorFlow models

Design criteria

C++ with Python bindings
Large-scale deployment
Real-time processing
Cross-platform

(Linux, MacOS, Win, iOS, Android, JS)

https://essentia.upf.edu/

data flow

control

Data processing pipeline in Essentia:

Data processing pipeline found in common MIR projects:

pre-trained model

Suitable framework for research and deployment scenarios.

MusiCNN

music auto-tagging

787K parameters

200 embedding dimensions

220/350K training size

fully supervised

VGG-I

music auto-tagging

605K parameters

256 embedding dimensions

220/350K training size

fully supervised

Tempo-CNN

tempo estimation

1.2M parameters

256 embedding dimensions

11K training size

fully supervised

OpenL3

feature extractor

4.7M parameters

512 embedding dimensions

296K training size

self supervised

VGGish

feature extractor

62M parameters

128 embedding dimensions

70M training size

fully supervised

Spleeter

source separation

49M parameters

1280 embedding dimensions

unknown training size

fully supervised

audio = MonoLoader(filename='your_song.mp3', sampleRate=11025)()

tempocnn_embs = TensorflowPredictTempoCNN(graphFilename='deepsquare-k16-3.pb',

output='1x1/Relu0_reshape')(audio)

Downstream tasks

We compared the capabilities of the pre-trained models as feature extractors in 16 downstream tasks.

dortmund alternative, blues, electronic, folk-country, funksoulrnb, jazz, pop, raphiphop, rock 1820 excerpts	gtzan blues, classic, country, disco, hiphop, jazz, metal, pop, reggae, rock 1000 excerpts	rosamerica classic, dance, hip hop, jazz, pop, rhythm and blues, rock, speech 400 full tracks

acoustic acoustic, non-acoustic 321 full tracks	aggressive aggressive, non-aggressive 280 full tracks/excerpts	electronic electronic, non-electronic 332 full tracks/excerpts	happy happy, non-happy 302 excerpts	party party, non-party 349 full excerpts	relaxed relaxed, non-relaxed 446 full tracks/excerpts	sad sad, non-sad 230 full tracks/excerpts

genre recognition

voice/instrumental voice, instrumental 1000 excerpts	tonal/atonal tonal, atonal 345 full tracks	gender female, male 3311 full tracks	danceability danceable, non-danceable 306 full tracks	fs-loop-ds bass, fx, melody, percussion, other 2104 excerpts	urbansound8k air conditioner, car horn, children playing, dog bark, drilling, engine idling, gunshot, jackhammer, siren, street music 8732 excerpts

m ood detection

miscellaneous audio tasks

The pre-trained models are compared by training a multilayer perceptron for each task on top of the proposed embeddings.

The models are compared in two ways. Using 5-fold cross-validation (5F) and evaluating in the MTG-Jamanendo dataset (JA) for which we collected annotations following the taxonomies of all our tasks.

The table below also contains a column for the accuracy drop (AD), the difference between both metrics as a proxy for the generalization capabilities on each case.

The best embeddings for each task are shaded light/medium grey for each metric. The results are expressed in terms of class-weighted accuracies.

MusiCNN tends to be more successful in 5-fold cross-validation
VGG-like models tend to suffer less accuracy drop (better generalization)
In the MusiCNN model, more tags and data tend to be beneficial for generalization
Models not trained for classification (OpenL3, Spleeter, Tempo-CNN) are not so powerful

MusiCNN and VGG-I are trained on two versions of MSD-Last.fm targeting the top 50 and the top 200 tags (T200 models), resulting in training sizes of 220K and 350K.

These models are available online at https://essentia.upf.edu/models/

More examples at: https://essentia.upf.edu/machine_learning.html

2 of 3

Deep embeddings with Essentia models

Pablo Alonso-Jiménez Dmitry Bogdanov Xavier Serra

Music Technology Group, Universitat Pompeu Fabra

Extractring embeddings

Essentia has dedicated algorithms to perform inference with each model.

TensorFlow integration in Essentia

Essentia is an open-source C++/Python library for audio signal processing, developed at the MTG-UPF and licensed under Affero GPLv3.

Pre-trained models

We have prepared various MIR models for several tasks. They can also be used as embeddings extractors. The following plots show the embeddings produced with our models for a 2 minutes rock track.

audio = MonoLoader(filename='your_song.mp3', sampleRate=16000)()

musicnn_embs = TensorflowPredictMusiCNN(graphFilename='msd-musicnn-1.pb',

output='model/dense/BiasAdd')(audio)

Uses in MIR

Our main goal is to provide fast C++ inference for state-of-the-art deep learning models in Essentia suitable for deployment in diverse MIR applications.

We host a collection of models for specific use-cases (auto-tagging, tempo estimation, source separation, music classification by genre, mood, and instrumentation).

Some of these models produce embeddings suitable for transfer learning.

data loader

feature extraction

TensorFlow

Essentia Pool

Results

aggregation

Essentia bindings

data loader

feature extraction

bindings

Results

e.g., FFmpeg

e.g., SciPy

e.g., PyTorch, TensorFlow

aggregation

Python

C/C++

Python

C++

e.g., NumPy

Functionalities

Audio features

Spectral features
Rhythm and tempo
Tonality and melody
Fingerprinting

Inference with TensorFlow models

Design criteria

C++ with Python bindings
Large-scale deployment
Real-time processing
Cross-platform

(Linux, MacOS, Win, iOS, Android, JS)

https://essentia.upf.edu/

data flow

control

Data processing pipeline in Essentia:

Data processing pipeline found in common MIR projects:

pre-trained model

Suitable framework for research and deployment scenarios.

MusiCNN

music auto-tagging

VGG-I

music auto-tagging

Tempo-CNN

tempo estimation

OpenL3

VGGish

Spleeter

audio = MonoLoader(filename='your_song.mp3', sampleRate=11025)()

tempocnn_embs = TensorflowPredictTempoCNN(graphFilename='deepsquare-k16-3.pb',

output='1x1/Relu0_reshape')(audio)

Downstream tasks

We compared the capabilities of the pre-trained models as feature extractors in 16 downstream tasks.

dortmund alternative, blues, electronic, folk-country, funksoulrnb, jazz, pop, raphiphop, rock 1820 excerpts	gtzan blues, classic, country, disco, hiphop, jazz, metal, pop, reggae, rock 1000 excerpts	rosamerica classic, dance, hip hop, jazz, pop, rhythm and blues, rock, speech 400 full tracks

acoustic acoustic, non-acoustic 321 full tracks	aggressive aggressive, non-aggressive 280 full tracks/excerpts	electronic electronic, non-electronic 332 full tracks/excerpts	happy happy, non-happy 302 excerpts	party party, non-party 349 full excerpts	relaxed relaxed, non-relaxed 446 full tracks/excerpts	sad sad, non-sad 230 full tracks/excerpts

genre recognition

voice/instrumental voice, instrumental 1000 excerpts	tonal/atonal tonal, atonal 345 full tracks	gender female, male 3311 full tracks	danceability danceable, non-danceable 306 full tracks	fs-loop-ds bass, fx, melody, percussion, other 2104 excerpts	urbansound8k air conditioner, car horn, children playing, dog bark, drilling, engine idling, gunshot, jackhammer, siren, street music 8732 excerpts

m ood detection

miscellaneous audio tasks

The pre-trained models are compared by training a multilayer perceptron for each task on top of the proposed embeddings.

The models are compared in two ways. Using 5-fold cross-validation (5F) and evaluating in the MTG-Jamanendo dataset (JA) for which we collected annotations following the taxonomies of all our tasks.

The table below also contains a column for the accuracy drop (AD), the difference between both metrics as a proxy for the generalization capabilities on each case.

The best embeddings for each task are shaded light/medium grey for each metric. The results are expressed in terms of class-weighted accuracies.

MusiCNN tends to be more successful in 5-fold cross-validation
VGG-like models tend to suffer less accuracy drop (better generalization)
In the MusiCNN model, more tags and data tend to be beneficial for generalization
Models not trained for classification (OpenL3, Spleeter, Tempo-CNN) are not so powerful

MusiCNN and VGG-I are trained on two versions of MSD-Last.fm targeting the top 50 and the top 200 tags (T200 models), resulting in training sizes of 220K and 350K.

These models are available online at https://essentia.upf.edu/models/

More examples at: https://essentia.upf.edu/machine_learning.html

1 of 3

2 of 3

3 of 3