1 of 3

Deep embeddings with Essentia models

Pablo Alonso-Jiménez Dmitry Bogdanov Xavier Serra

Music Technology Group, Universitat Pompeu Fabra

Extractring embeddings

Essentia has dedicated algorithms to perform inference with each model.

With the `output` parameter we can select the layer of the network to retrieve. It is defaulted to the main task of the network (e.g., music tag indices, bpm bins, separated audio) but it can be set to point to any layer of interest.

On the music auto-tagging models we retrieved the penultimate layer as embeddings, on Tempo-CNN the logits of the last layer, on Spleeter we used the concatenation of the bottleneck layers of each stem as embeddings, and on the feature extractor models we used directly the output proposed by the authors.

TensorFlow integration in Essentia

Essentia is an open-source C++/Python library for audio signal processing, developed at the MTG-UPF and licensed under Affero GPLv3.

Pre-trained models

We have prepared various MIR models for several tasks. They can also be used as embeddings extractors. The following plots show the embeddings produced with our models for a 2 minutes rock track.

audio = MonoLoader(filename='your_song.mp3', sampleRate=16000)()

musicnn_embs = TensorflowPredictMusiCNN(graphFilename='msd-musicnn-1.pb',

output='model/dense/BiasAdd')(audio)

Uses in MIR

Our main goal is to provide fast C++ inference for state-of-the-art deep learning models in Essentia suitable for deployment in diverse MIR applications.

We host a collection of models for specific use-cases (auto-tagging, tempo estimation, source separation, music classification by genre, mood, and instrumentation).

Some of these models produce embeddings suitable for transfer learning.

data loader

feature extraction

TensorFlow

Essentia Pool

Results

aggregation

Essentia bindings

data loader

feature extraction

bindings

Results

e.g., FFmpeg

e.g., SciPy

e.g., PyTorch, TensorFlow

aggregation

Python

C/C++

Python

C++

e.g., NumPy

Functionalities

  • Audio features
    • Spectral features
    • Rhythm and tempo
    • Tonality and melody
    • Fingerprinting
  • Inference with TensorFlow models

Design criteria

  • C++ with Python bindings
  • Large-scale deployment
  • Real-time processing
  • Cross-platform
    • (Linux, MacOS, Win, iOS, Android, JS)

data flow

control

Data processing pipeline in Essentia:

Data processing pipeline found in common MIR projects:

pre-trained model

pre-trained model

Suitable framework for research and deployment scenarios.

MusiCNN

music auto-tagging

787K parameters

200 embedding dimensions

220/350K training size

fully supervised

VGG-I

music auto-tagging

605K parameters

256 embedding dimensions

220/350K training size

fully supervised

Tempo-CNN

tempo estimation

1.2M parameters

256 embedding dimensions

11K training size

fully supervised

OpenL3

feature extractor

4.7M parameters

512 embedding dimensions

296K training size

self supervised

VGGish

feature extractor

62M parameters

128 embedding dimensions

70M training size

fully supervised

Spleeter

source separation

49M parameters

1280 embedding dimensions

unknown training size

fully supervised

audio = MonoLoader(filename='your_song.mp3', sampleRate=11025)()

tempocnn_embs = TensorflowPredictTempoCNN(graphFilename='deepsquare-k16-3.pb',

output='1x1/Relu0_reshape')(audio)

Downstream tasks

We compared the capabilities of the pre-trained models as feature extractors in 16 downstream tasks.

dortmund

alternative, blues, electronic, folk-country,

funksoulrnb, jazz, pop, raphiphop, rock

1820 excerpts

gtzan

blues, classic, country, disco, hiphop,

jazz, metal, pop, reggae, rock

1000 excerpts

rosamerica

classic, dance, hip hop, jazz, pop,

rhythm and blues, rock, speech

400 full tracks

acoustic

acoustic, non-acoustic

321 full tracks

aggressive

aggressive,

non-aggressive

280 full tracks/excerpts

electronic

electronic,

non-electronic

332 full tracks/excerpts

happy

happy, non-happy

302 excerpts

party

party, non-party

349 full excerpts

relaxed

relaxed, non-relaxed

446 full tracks/excerpts

sad

sad, non-sad

230 full tracks/excerpts

genre recognition

voice/instrumental

voice, instrumental

1000 excerpts

tonal/atonal

tonal, atonal

345 full tracks

gender

female, male

3311 full tracks

danceability

danceable, non-danceable

306 full tracks

fs-loop-ds

bass, fx, melody, percussion, other

2104 excerpts

urbansound8k

air conditioner, car horn, children playing, dog bark, drilling, engine idling, gunshot, jackhammer, siren, street music

8732 excerpts

m ood detection

miscellaneous audio tasks

The pre-trained models are compared by training a multilayer perceptron for each task on top of the proposed embeddings.

The models are compared in two ways. Using 5-fold cross-validation (5F) and evaluating in the MTG-Jamanendo dataset (JA) for which we collected annotations following the taxonomies of all our tasks.

The table below also contains a column for the accuracy drop (AD), the difference between both metrics as a proxy for the generalization capabilities on each case.

The best embeddings for each task are shaded light/medium grey for each metric. The results are expressed in terms of class-weighted accuracies.

  • MusiCNN tends to be more successful in 5-fold cross-validation
  • VGG-like models tend to suffer less accuracy drop (better generalization)
  • In the MusiCNN model, more tags and data tend to be beneficial for generalization
  • Models not trained for classification (OpenL3, Spleeter, Tempo-CNN) are not so powerful

MusiCNN and VGG-I are trained on two versions of MSD-Last.fm targeting the top 50 and the top 200 tags (T200 models), resulting in training sizes of 220K and 350K.

These models are available online at https://essentia.upf.edu/models/

2 of 3

Deep embeddings with Essentia models

Pablo Alonso-Jiménez Dmitry Bogdanov Xavier Serra

Music Technology Group, Universitat Pompeu Fabra

Extractring embeddings

Essentia has dedicated algorithms to perform inference with each model.

With the `output` parameter we can select the layer of the network to retrieve. It is defaulted to the main task of the network (e.g., music tag indices, bpm bins, separated audio) but it can be set to point to any layer of interest.

On the music auto-tagging models we retrieved the penultimate layer as embeddings, on Tempo-CNN the logits of the last layer, on Spleeter we used the concatenation of the bottleneck layers of each stem as embeddings, and on the feature extractor models we used directly the output proposed by the authors.

TensorFlow integration in Essentia

Essentia is an open-source C++/Python library for audio signal processing, developed at the MTG-UPF and licensed under Affero GPLv3.

Pre-trained models

We have prepared various MIR models for several tasks. They can also be used as embeddings extractors. The following plots show the embeddings produced with our models for a 2 minutes rock track.

audio = MonoLoader(filename='your_song.mp3', sampleRate=16000)()

musicnn_embs = TensorflowPredictMusiCNN(graphFilename='msd-musicnn-1.pb',

output='model/dense/BiasAdd')(audio)

Uses in MIR

Our main goal is to provide fast C++ inference for state-of-the-art deep learning models in Essentia suitable for deployment in diverse MIR applications.

We host a collection of models for specific use-cases (auto-tagging, tempo estimation, source separation, music classification by genre, mood, and instrumentation).

Some of these models produce embeddings suitable for transfer learning.

data loader

feature extraction

TensorFlow

Essentia Pool

Results

aggregation

Essentia bindings

data loader

feature extraction

bindings

Results

e.g., FFmpeg

e.g., SciPy

e.g., PyTorch, TensorFlow

aggregation

Python

C/C++

Python

C++

e.g., NumPy

Functionalities

  • Audio features
    • Spectral features
    • Rhythm and tempo
    • Tonality and melody
    • Fingerprinting
  • Inference with TensorFlow models

Design criteria

  • C++ with Python bindings
  • Large-scale deployment
  • Real-time processing
  • Cross-platform
    • (Linux, MacOS, Win, iOS, Android, JS)

data flow

control

Data processing pipeline in Essentia:

Data processing pipeline found in common MIR projects:

pre-trained model

pre-trained model

Suitable framework for research and deployment scenarios.

MusiCNN

music auto-tagging

VGG-I

music auto-tagging

Tempo-CNN

tempo estimation

OpenL3

VGGish

Spleeter

audio = MonoLoader(filename='your_song.mp3', sampleRate=11025)()

tempocnn_embs = TensorflowPredictTempoCNN(graphFilename='deepsquare-k16-3.pb',

output='1x1/Relu0_reshape')(audio)

Downstream tasks

We compared the capabilities of the pre-trained models as feature extractors in 16 downstream tasks.

dortmund

alternative, blues, electronic, folk-country,

funksoulrnb, jazz, pop, raphiphop, rock

1820 excerpts

gtzan

blues, classic, country, disco, hiphop,

jazz, metal, pop, reggae, rock

1000 excerpts

rosamerica

classic, dance, hip hop, jazz, pop,

rhythm and blues, rock, speech

400 full tracks

acoustic

acoustic, non-acoustic

321 full tracks

aggressive

aggressive,

non-aggressive

280 full tracks/excerpts

electronic

electronic,

non-electronic

332 full tracks/excerpts

happy

happy, non-happy

302 excerpts

party

party, non-party

349 full excerpts

relaxed

relaxed, non-relaxed

446 full tracks/excerpts

sad

sad, non-sad

230 full tracks/excerpts

genre recognition

voice/instrumental

voice, instrumental

1000 excerpts

tonal/atonal

tonal, atonal

345 full tracks

gender

female, male

3311 full tracks

danceability

danceable, non-danceable

306 full tracks

fs-loop-ds

bass, fx, melody, percussion, other

2104 excerpts

urbansound8k

air conditioner, car horn, children playing, dog bark, drilling, engine idling, gunshot, jackhammer, siren, street music

8732 excerpts

m ood detection

miscellaneous audio tasks

The pre-trained models are compared by training a multilayer perceptron for each task on top of the proposed embeddings.

The models are compared in two ways. Using 5-fold cross-validation (5F) and evaluating in the MTG-Jamanendo dataset (JA) for which we collected annotations following the taxonomies of all our tasks.

The table below also contains a column for the accuracy drop (AD), the difference between both metrics as a proxy for the generalization capabilities on each case.

The best embeddings for each task are shaded light/medium grey for each metric. The results are expressed in terms of class-weighted accuracies.

  • MusiCNN tends to be more successful in 5-fold cross-validation
  • VGG-like models tend to suffer less accuracy drop (better generalization)
  • In the MusiCNN model, more tags and data tend to be beneficial for generalization
  • Models not trained for classification (OpenL3, Spleeter, Tempo-CNN) are not so powerful

MusiCNN and VGG-I are trained on two versions of MSD-Last.fm targeting the top 50 and the top 200 tags (T200 models), resulting in training sizes of 220K and 350K.

These models are available online at https://essentia.upf.edu/models/

3 of 3