1 of 203

Training Sound Event Classifiers Using Different Types of Supervision

Eduardo Fonseca

December 1^st, 2021

Supervisors:

Dr. Xavier Serra i Casals

Dr. Frederic Font Corbera

Board:

Dr. Emmanouil Benetos

Dr. Annamaria Mesaros

Dr. Marius Miron

2 of 203

Sound event classification

Automatic identification of all kinds of everyday sounds

Introduction

Motivation

3 of 203

Applications

Introduction

Motivation

4 of 203

A challenging endeavour

Introduction

Motivation

5 of 203

Sound event classification before this thesis

Research context

methods → supervised learning: feature engineering → deep learning
datasets → limited size and vocabulary (e.g., urban or domestic)

Introduction

Motivation

6 of 203

Sound event classification before this thesis

Research context

methods → supervised learning: feature engineering → deep learning
datasets → limited size and vocabulary (e.g., urban or domestic)

Limitations

deep learning approaches are data-hungry
each dataset encompasses a small bit of the variety of everyday sounds

Introduction

Motivation

7 of 203

Sound event classification before this thesis

Research context

methods → supervised learning: feature engineering → deep learning
datasets → limited size and vocabulary (e.g., urban or domestic)

Limitations

deep learning approaches are data-hungry
each dataset encompasses a small bit of the variety of everyday sounds

Hindering the development of general-purpose classifiers

recognize hundreds of sound classes

Introduction

Motivation

8 of 203

But sound data were available ...

Large amounts of diverse everyday sound data

Two main aspects in common

very large amounts of audio
lack of reliable homogeneous labels → user-provided metadata

Introduction

Motivation

9 of 203

AudioCommons

Horizon 2020 EU-funded AudioCommons

automatic description of sounds
exploit Freesound for research purposes

Introduction

Motivation

10 of 203

What can we do about it?

What can we do to allow improvement of coverage & performance in sound classifiers?

In this thesis, we identify four research avenues

Introduction

Motivation

11 of 203

1. Building a new dataset

Most evident option → better datasets

larger size
larger vocabulary

Web repositories → sources for sound event dataset creation

In this thesis → new sound event dataset

fully-open & distributable
large-vocabulary
reliable labels

Introduction

Research directions

12 of 203

2. Improving generalization

Costly manual annotation → less than ideal amount of training data

Techniques to increase generalization to unseen examples → paramount

In this thesis → improve generalization

CNN architectural modifications → robustness to small time/frequency shifts
data augmentation

Introduction

Research directions

13 of 203

3. Learning with noisy labels

Label noise is a reality

transition to larger datasets w/ less precise labeling
labels can be inferred automatically (e.g. from user-provided metadata)

Supervision given by noisy labels → only feasible choice

pressing issue for sound event classification

In this thesis → learning with noisy labels

dataset to supports label noise research
techniques to mitigate the effect of label noise

Introduction

Research directions

14 of 203

4. Self-supervised learning

Textual labels accompanying audio → not always available

unlabeled data is much more abundant

Self-supervised learning → learning representations without external supervision

downstream tasks such as classification

In this thesis → self-supervised learning

strategies to learn audio representations from unlabeled data

Introduction

Research directions

15 of 203

Objectives of this thesis

Research on dataset creation as well as supervised & unsupervised learning

train large-vocabulary sound event classifiers

Introduction

Objectives

16 of 203

Objectives of this thesis

Research on dataset creation as well as supervised & unsupervised learning

train large-vocabulary sound event classifiers

Objectives

Build an annotated open dataset of sound events, of larger coverage and size

Introduction

Objectives

17 of 203

Objectives of this thesis

Research on dataset creation as well as supervised & unsupervised learning

train large-vocabulary sound event classifiers

Objectives

Build an annotated open dataset of sound events, of larger coverage and size
Devise a learning method to improve generalization to new unseen examples

Introduction

Objectives

18 of 203

Objectives of this thesis

Research on dataset creation as well as supervised & unsupervised learning

train large-vocabulary sound event classifiers

Objectives

Build an annotated open dataset of sound events, of larger coverage and size
Devise a learning method to improve generalization to new unseen examples
Develop techniques to mitigate the negative effect of label noise when training sound event classifiers

Introduction

Objectives

19 of 203

Objectives of this thesis

Research on dataset creation as well as supervised & unsupervised learning

train large-vocabulary sound event classifiers

Objectives

Build an annotated open dataset of sound events, of larger coverage and size
Devise a learning method to improve generalization to new unseen examples
Develop techniques to mitigate the negative effect of label noise when training sound event classifiers
Develop methodologies for learning sound event representations in unsupervised fashion

Introduction

Objectives

20 of 203

Objectives of this thesis

Research on dataset creation as well as supervised & unsupervised learning

train large-vocabulary sound event classifiers

Objectives

Build an annotated open dataset of sound events, of larger coverage and size
Devise a learning method to improve generalization to new unseen examples
Develop techniques to mitigate the negative effect of label noise when training sound event classifiers
Develop methodologies for learning sound event representations in unsupervised fashion
Release data and source code as open resources → open and reproducible research

Introduction

Objectives

21 of 203

Outline

Introduction - Chapter 1
The Freesound Dataset 50k (FSD50K) - Chapter 3
Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks - Chapter 4
Training Sound Event Classifiers With Noisy Labels - Chapter 5
Self-Supervised Learning of Sound Event Representations - Chapter 6
DCASE Challenge Tasks Organization - Appendix A
Summary and Conclusions - Chapter 7

Introduction

22 of 203

Outline

Introduction
The Freesound Dataset 50k (FSD50K)
Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks
Training Sound Event Classifiers With Noisy Labels
Self-Supervised Learning of Sound Event Representations
DCASE Challenge Tasks Organization
Summary and Conclusions

23 of 203

2. The Freesound Dataset 50k (FSD50K)

Fonseca, E., Favory, X., Pons, J., Font, F., Serra, X, FSD50K: an open dataset of human-labeled sound events. In press in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.

Fonseca, E., Pons, J., Favory, X., Font, F., Bogdanov, D., Ferraro, A., Oramas, S., Porter, A., Serra, X, Freesound Datasets: A platform for the creation of open audio datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), 2017.

Motivation
The creation of FSD50K
FSD50K description
Experiments
Summary

24 of 203

Why?

Most existing datasets → relatively small and/or domain-specific
AudioSet

unprecedented size, coverage and diversity
suffers from openness and stability issues

Sound event research lags behind in terms of dataset availability

The Freesound Dataset 50k (FSD50K)

Motivation

25 of 203

Data acquisition

Source of audio

Vocabulary

Infrastructure

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

26 of 203

Freesound

Online collaborative audio clip sharing site
500,000+ audio clips
Wide variety of audio content
User-provided tags
Creative Commons licenses

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

27 of 203

AudioSet Ontology

Hierarchy with 632 sound event classes
Per-class textual description
Most comprehensive set of everyday sounds

convenient for Freesound

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

28 of 203

Freesound Annotator

Website → collaborative creation of open audio datasets based on Freesound

Tools → exploration / annotation / monitoring

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

29 of 203

Candidate labels nomination

List of keywords per class

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

30 of 203

Candidate labels nomination

List of keywords per class
Each class populated w/ corresponding clips

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

31 of 203

Candidate labels nomination

List of keywords per class
Each class populated w/ corresponding clips

Outcome: 268k clips

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

32 of 203

Validation task

Manually validate candidate labels nominated in the previous stage
Annotation tool w/ two phases

training phase → get familiar with the class
validation phase

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

33 of 203

Validation task

Internal quality assessment → get feedback for improvement
Features: FAQs / Present = PP + PNP / inter-annotator agreement / loudness norm / priority

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

34 of 203

Validation task

Annotation campaign

divide classes according to estimated level of diﬀiculty
gather annotations using crowdsourcing and hired annotators
350+ raters (including 6 hired annotators)

Outcome: 51k clips

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

35 of 203

Data split

Split the data into development and evaluation
Main criteria

non-divisibility of uploaders

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

data

classes

dev

eval

c₀

c₁

c₂

c₃

c₄

...

36 of 203

Data split

Split the data into development and evaluation
Main criteria

non-divisibility of uploaders
small uploaders for evaluation set

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

data

classes

dev

eval

c₀

c₁

c₂

c₃

c₄

...

37 of 203

Data split

Split the data into development and evaluation
Main criteria

non-divisibility of uploaders
small uploaders for evaluation set

Procedure:

sort uploaders by size and diversity
allocate data to evaluation set

Outcome: two candidate subsets disjoint in terms of uploaders

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

data

classes

dev

eval

c₀

c₁

c₂

c₃

c₄

...

38 of 203

Refinement task for evaluation set

Validation of candidate labels proposed by a simple nomination system

labels correct but potentially incomplete
goal → exhaustive labelling

Adding missing labels is complex

hired annotators → deep understanding of ontology & FAQs
interface for exploration of large-vocabulary

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

39 of 203

Refinement task for evaluation set

Annotation tool w/ two phases

training phase
refinement phase

review existing labels
add any missing labels

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

Favory et al., Facilitating the manual annotation of sounds when using large taxonomies, FRUCT 2018

40 of 203

After manual annotation

Candidate development set → correct but potentially incomplete labels
Candidate evaluation set → exhaustively-labeled
51k clips / 395 classes

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

heroes behind FSD50K

(hired annotators)

great collaborators

41 of 203

Post-processing

Determine FSD50K vocabulary (200 classes)

merge small leaf nodes with their parents

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

42 of 203

Post-processing

Determine FSD50K vocabulary (200 classes)

merge small leaf nodes with their parents

Balancing development/evaluation sets

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

43 of 203

Post-processing

Determine FSD50K vocabulary (200 classes)

merge small leaf nodes with their parents

Balancing development/evaluation sets
Development = train + validation

minimize WC (same uploader & class)
allow BC (same uploader / ≠ class)

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

data

classes

train

c₀

c₁

c₂

c₃

c₄

...

val

44 of 203

Post-processing

Determine FSD50K vocabulary (200 classes)

merge small leaf nodes with their parents

Balancing development/evaluation sets
Development = train + validation

minimize WC (same uploader & class)
allow BC (same uploader / ≠ class)

Hierarchical label propagation

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

data

classes

train

c₀

c₁

c₂

c₃

c₄

...

val

45 of 203

Freesound Dataset 50k (FSD50K)

51k audio clips / 108h audio / 200 sound event classes
Human sounds, sounds of things, animals, natural sounds, music, ...

The Freesound Dataset 50k (FSD50K)

FSD50K description

train

eval set

36,796 clips / 70.5 hours

4170 / 9.9

development set

val

10,231 / 27.9

Clips / duration

46 of 203

Freesound Dataset 50k (FSD50K)

51k audio clips / 108h audio / 200 sound event classes
Human sounds, sounds of things, animals, natural sounds, music, ...
Metadata (raw annotations, sound predominance, Freesound metadata, FAQs)
Creative Commons licenses

The Freesound Dataset 50k (FSD50K)

FSD50K description

train

eval set

36,796 clips / 70.5 hours

4170 / 9.9

development set

val

10,231 / 27.9

Clips / duration

47 of 203

Limitations

Label noise → refinement task to quantify label noise after validation task (using 12k clips)

94.3% of incoming labels → verified as correct
50.9% of clips → unlabeled material

Data imbalance
Data bias in development set
Some parts of the vocabulary are not very specific

The Freesound Dataset 50k (FSD50K)

FSD50K description

48 of 203

Impact of train/validation separation

Three train/validation splits

random sampling
iterative stratification
proposed approach

All 3 validation sets → similar size

The Freesound Dataset 50k (FSD50K)

Experiments

49 of 203

Impact of train/validation separation

Three train/validation splits

random sampling
iterative stratification
proposed approach

All 3 validation sets → similar size
Main difference → uploaders “shared” between train and validation

both WC & BC contamination

minimize WC contamination

The Freesound Dataset 50k (FSD50K)

Experiments

50 of 203

Impact of train/validation separation

Ignore WC contamination → validation perf is overly optimistic

The Freesound Dataset 50k (FSD50K)

Experiments

51 of 203

Impact of train/validation separation

Ignore WC contamination → validation perf is overly optimistic
Minimize WC contamination → validation perf is a good proxy for evaluation perf

The Freesound Dataset 50k (FSD50K)

Experiments

52 of 203

Summary

Dataset creation

human validation of nominated candidate labels
refinement process → complete audio transcription

Crowdsourcing & recruited trained annotators
Special emphasis on careful curation of evaluation set → unprecedented

The Freesound Dataset 50k (FSD50K)

Summary

53 of 203

Impact

Largest fully-open dataset of human-labeled sound events & second largest after AudioSet

The Freesound Dataset 50k (FSD50K)

Summary

54 of 203

Impact

Largest fully-open dataset of human-labeled sound events & second largest after AudioSet
First paper in “Dataset Papers” section of IEEE Challenges & Data Collection

The Freesound Dataset 50k (FSD50K)

Summary

55 of 203

Impact

Largest fully-open dataset of human-labeled sound events & second largest after AudioSet
First paper in “Dataset Papers” section of IEEE Challenges & Data Collection
Enabled 6 audio challenges

DCASE 2018 Task 2, 2019 Task 2, 2019 Task 4, 2020 Task 4, 2021 Task 4, HEAR Challenge

Source for creation of 7 datasets

FUSS, GISE-51, USM-SED, Divide_and_Remaster, LibriFSD50K, FSD-MIX-SED, FSD-MIX_CLIPS

Used in research beyond sound classification & detection

universal sound separation, speech enhancement, few-shot learning, representation learning, federated learning

The Freesound Dataset 50k (FSD50K)

Summary

56 of 203

Outline

Introduction
The Freesound Dataset 50k (FSD50K)
Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks
Training Sound Event Classifiers With Noisy Labels
Self-Supervised Learning of Sound Event Representations
DCASE Challenge Tasks Organization
Summary and Conclusions

57 of 203

3. Improving Sound Event Classification by Increasing Shift Invariance in

Convolutional Neural Networks

Fonseca, E., Ferraro, A., Serra, X., Improving sound event classification by increasing shift invariance in convolutional neural networks. Under review in IEEE Signal Processing Letters, 2021.

Motivation
Method
Experiments
Takeaways

58 of 203

Why?

CNNs are one of the cornerstones of sound classification
One of the commonly-assumed properties of CNNs → shift invariance

output predictions are not affected by small shifts in input signal

Recent works in computer vision uncover

small shifts can change network’s predictions substantially

Improving SET by Increasing Shift Invariance in CNNs

Motivation

Azulay, A. & Weiss, Y., Why do deep convolutional networks generalize so poorly to small image transformations? JMLR 2018

59 of 203

Is this really a problem?

Apply time/freq shifts to input spectrograms of {1,3,5} frames/bands

analyze network’s robustness against shifts

Classification consistency: % of cases net predicts same top class for original and shifted

Improving SET by Increasing Shift Invariance in CNNs

Motivation

60 of 203

Is this really a problem? So it seems...

By applying a time shift of 1 frame (10ms), top prediction changes 18% of the time

Improving SET by Increasing Shift Invariance in CNNs

Motivation

61 of 203

Is this really a problem? So it seems...

As time shifts increase, the network becomes less consistent

Improving SET by Increasing Shift Invariance in CNNs

Motivation

62 of 203

Is this really a problem? So it seems...

For freq shifts, consistency is even worse

Improving SET by Increasing Shift Invariance in CNNs

Motivation

63 of 203

How to increase shift invariance in CNNs?

One of the problems: wrongly executed subsampling operations

prevalent in CNNs (strided conv or pool)

Techniques to improve subsampling in CNNs

low-pass filter based solutions
architectural changes to explicitly enforce shift invariance

Improving SET by Increasing Shift Invariance in CNNs

Method

64 of 203

Proposed approach: overview

Focus: subsampling operations within max-pooling layers

Improving SET by Increasing Shift Invariance in CNNs

Method

65 of 203

Proposed approach: overview

Focus: subsampling operations within max-pooling layers
Max-pooling layer with squared size k and stride s

a unit-stride max-pooling operation with size k
a subsampling operation with stride s

Improving SET by Increasing Shift Invariance in CNNs

Method

stride-s

MP_k,1

y_mp

66 of 203

Proposed approach: overview

Focus: subsampling operations within max-pooling layers
Max-pooling layer with squared size k and stride s

a unit-stride max-pooling operation with size k
a subsampling operation with stride s

We can add a low-pass filter before subsampling

Improving SET by Increasing Shift Invariance in CNNs

Method

stride-s

LPF_m,n

MP_k,1

stride-s

MP_k,1

y_lpf

y_mp

67 of 203

Proposed approach: overview

Focus: subsampling operations within max-pooling layers
Max-pooling layer with squared size k and stride s

a unit-stride max-pooling operation with size k
a subsampling operation with stride s

We add a low-pass filter before subsampling
We substitute naive subsampling by a more sophisticated strategy

Improving SET by Increasing Shift Invariance in CNNs

Method

stride-s

APS

LPF_m,n

MP_k,1

stride-s

MP_k,1

y_aps

y_lpf

y_mp

68 of 203

Low-pass filtering before subsampling

Improving SET by Increasing Shift Invariance in CNNs

Method

subsampling

max( )

69 of 203

Low-pass filtering before subsampling

Low-pass filters

Non-trainable binomial kernels (BlurPool)
Trainable kernel & softmax function (TLPF)

Improving SET by Increasing Shift Invariance in CNNs

Method

Zhang, Making convolutional networks shift-invariant again, ICML 2019

subsampling

max( )

conv( )

subsampling

LPF_m,n

70 of 203

Adaptive polyphase sampling

APS is a downsampling mechanism

addresses lack of shift invariance caused by subsampling operations

Observation: subsampling a patch and its shifted-by-one-bin version

different results when bins are sampled at same fixed positions

Improving SET by Increasing Shift Invariance in CNNs

Method

Chaman & Dokmanic, Truly shift-invariant convolutional neural networks, CVPR 2021

original

feature map

shifted

feature map

naive subsampling

stride-s

MP_k,1

71 of 203

Adaptive polyphase sampling

When subsampling a feature map, instead of always using the same grid

multiple candidate grids could actually be used

Improving SET by Increasing Shift Invariance in CNNs

Method

subsampling

max()

72 of 203

Adaptive polyphase sampling

When subsampling a feature map, instead of always using the same grid

multiple candidate grids could actually be used

Subsampling operation with stride s = 2

four possible grids → four possible candidate subsampled feature maps

Improving SET by Increasing Shift Invariance in CNNs

Method

max()

73 of 203

Adaptive polyphase sampling

APS → select subsampling grid adaptively by maximizing output energy (l₁ norm)

grid follows shift at the input
increase robustness to input shifts

Improving SET by Increasing Shift Invariance in CNNs

Method

Chaman & Dokmanic, Truly shift-invariant convolutional neural networks, CVPR 2021

naive subsampling

original

feature map

shifted

feature map

adaptive

polyphase

sampling

stride-s

MP_k,1

APS

MP_k,1

74 of 203

Experimental setup

Task: FSD50K multi-label classification
Baseline model: VGG41

features several max-pooling layers

VGG42 → width x2

Improving SET by Increasing Shift Invariance in CNNs

Experiments

Dense(200) + Sigmoid

2x Conv2D(3,3) + BN + ReLU

Max-Pool 2x2

Weights: 1.2M�

2x Conv2D(3,3) + BN + ReLU

Max-Pool 2x2

2x Conv2D(3,3) + BN + ReLU

Max-Pool 2x2

2x Conv2D(3,3) + BN + ReLU

Global-Pool max(avg(freq))

75 of 203

Evaluation using a small model

All methods outperform the baseline system

Improving SET by Increasing Shift Invariance in CNNs

Experiments

76 of 203

Evaluation using a small model

All methods outperform the baseline system
Low-pass filtering feature maps is helpful
Trainable vs. non-trainable? not critical, yet TLPF → slightly higher mAP

Improving SET by Increasing Shift Invariance in CNNs

Experiments

77 of 203

Evaluation using a small model

All methods outperform the baseline system
Low-pass filtering feature maps is helpful
Trainable vs. non-trainable? not critical, yet TLPF → slightly higher mAP
APS l1 and TLPF 5x5 → on par performance

Improving SET by Increasing Shift Invariance in CNNs

Experiments

78 of 203

Evaluation using a small model

All methods outperform the baseline system
Low-pass filtering feature maps is helpful
Trainable vs. non-trainable? not critical, yet TLPF → slightly higher mAP
APS l1 and TLPF 5x5 → on par performance
Joining them → small performance boost

Improving SET by Increasing Shift Invariance in CNNs

Experiments

79 of 203

Evaluation using regularization and a larger model

Why mixup?

improve generalization & analyze performance w/ strong regularizer

Boosts are solid → proposed methods beyond regularization

Improving SET by Increasing Shift Invariance in CNNs

Experiments

80 of 203

Evaluation using regularization and a larger model

VGG42 + mixup → proposed methods beneficial in more competitive conditions

Improving SET by Increasing Shift Invariance in CNNs

Experiments

81 of 203

Trainable vs. fixed LPF

Substitute TLPF by BlurPool in best system

performance differences are not large, yet TLPF → slightly higher mAP

Improving SET by Increasing Shift Invariance in CNNs

Experiments

82 of 203

Trainable vs. fixed LPF

BlurPool → all low-pass filters are fixed by construction
TLPF → multiple filters are learned w/ different patterns

Improving SET by Increasing Shift Invariance in CNNs

Experiments

Zhang, Making convolutional networks shift-invariant again, ICML 2019

BlurPool

TLPF

83 of 203

Characterizing the increase of shift invariance

“proposed” = TLPF & APS → higher robustness to the applied shifts

higher classification consistency in all cases

Improving SET by Increasing Shift Invariance in CNNs

Experiments

84 of 203

Characterizing the increase of shift invariance

time shift @ water dripping

frequency shift @ keyboard

Improving SET by Increasing Shift Invariance in CNNs

Experiments

time frames shifted

85 of 203

Characterizing the increase of shift invariance

time shift @ water dripping

frequency shift @ keyboard

Improving SET by Increasing Shift Invariance in CNNs

Experiments

86 of 203

Comparison with previous work

Our best system obtains state-of-the-art mAP, outperforming

baseline models by a large margin
PSLA approach (collection of training techniques)
slightly outperforming Transformer-based approaches

Improving SET by Increasing Shift Invariance in CNNs

Experiments

Gong et al., PSLA: Improving audio event classification with pretraining, sampling, labeling, and aggregation. arXiv 2021

Verma & Berger, Audio transformers: Transformer architectures for large scale audio understanding. Adieu convolutions, arXiv 2021

87 of 203

Takeaways

Models evaluated → only-partial shift invariance
Inserting proposed pooling methods into VGG variants

higher robustness to time/frequency shifts
recognition boosts

Improving SET by Increasing Shift Invariance in CNNs

Takeaways

88 of 203

Essentia TensorFlow models

In progress → adding models to the Essentia TF model zoo

Improving SET by Increasing Shift Invariance in CNNs

Takeaways

89 of 203

Outline

Introduction
The Freesound Dataset 50k (FSD50K)
Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks
Training Sound Event Classifiers With Noisy Labels
Self-Supervised Learning of Sound Event Representations
DCASE Challenge Tasks Organization
Summary and Conclusions

90 of 203

4. Training Sound Event Classifiers

with Noisy Labels

Fonseca, E., Plakal, M., Ellis, D. P. W., Font, F., Favory, X., Serra, X., Learning sound event classifiers from web audio with noisy labels. In Proceedings of the 44th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.

Fonseca, E., Font, F., Serra, X., Model-agnostic approaches to handling noisy labels when training sound event classifiers. In Proceedings of the 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019.

* Fonseca, E., Hershey, S., Plakal, M., Ellis, D. P. W., Jansen, A., Moore, R. C., Addressing missing labels in large-scale sound event recognition using a teacher-student framework with loss masking. IEEE Signal Processing Letters, 2020.

* Work done during an internship at Google Research.

Motivation
FSDnoisy18k
Loss functions
Addressing the problem of missing labels
Takeaways

91 of 203

Label noise in sound event classification

Performance decrease / increased complexity

Training Sound Event Classifiers with Noisy Labels

Motivation

ESC-50

Chime-home

UrbanSound8K

FSD50K

108

AudioSet

5800

labelling

exhaustive

less precise

hours

audio

92 of 203

FSDnoisy18k

20 classes / 18k audio clips / 42.5h of audio

Training Sound Event Classifiers with Noisy Labels

FSDnoisy18k

93 of 203

The creation of FSDnoisy18k

Freesound

audio content & metadata (tags)

AudioSet Ontology

20 class labels

Training Sound Event Classifiers with Noisy Labels

FSDnoisy18k

94 of 203

The creation of FSDnoisy18k

List of keywords for every class
Each class populated w/ corresponding clips

Training Sound Event Classifiers with Noisy Labels

FSDnoisy18k

95 of 203

The creation of FSDnoisy18k

Training Sound Event Classifiers with Noisy Labels

FSDnoisy18k

candidate

clips

1234.wav

5678.wav

7654.wav

...

Rain

validation task

clean

train set

test set

noisy

train set

96 of 203

Label noise distribution in FSDnoisy18k

in-vocabulary (IV) → events that are part of our target class set (closed-set)
out-of-vocabulary (OOV) → events not covered by the class set (open-set)

Training Sound Event Classifiers with Noisy Labels

FSDnoisy18k

97 of 203

FSDnoisy18k

20 classes / 18k clips / 42.5 h
Single-label data
Proportion train_noisy / train_clean = 90% / 10%
Per-class varying degree of types and amount of label noise

Training Sound Event Classifiers with Noisy Labels

FSDnoisy18k

noisy

clean

test set

15,813 clips / 38.8 h

1772 / 2.4

947 / 1.4

train set

98 of 203

Noise-robust loss functions

Default loss function for multi-class setting → Categorical Cross-Entropy (CCE)

Training Sound Event Classifiers with Noisy Labels

Loss functions

target labels

predictions

99 of 203

Noise-robust loss functions

CCE is sensitive to label noise: emphasis on difficult examples (weighting)

beneficial for clean data
detrimental for noisy data

Training Sound Event Classifiers with Noisy Labels

Loss functions

100 of 203

Noise-robust loss functions

Generalized cross-entropy loss intuition

CCE → sensitive to noisy labels (weighting)
Mean Absolute Error (MAE)

avoid weighting
difficult convergence

100

Training Sound Event Classifiers with Noisy Labels

Loss functions

Ghosh et al., Robust loss functions under label noise for deep neural networks, AAAI 2017

101 of 203

Noise-robust loss functions

Lq loss is a generalization of CCE and MAE

q = 1 → Lq = MAE
q → 0 → Lq = CCE

101

Training Sound Event Classifiers with Noisy Labels

Loss functions

Zhang & Sabuncu, Generalized cross-entropy loss for training deep neural networks with noisy labels, NeurIPS 2018

102 of 203

Noise-robust loss functions

102

Training Sound Event Classifiers with Noisy Labels

Loss functions

103 of 203

Noise-robust loss functions

Supervision by user-provided tags can be useful for sound event classification

103

Training Sound Event Classifiers with Noisy Labels

Loss functions

104 of 203

Noise-robust loss functions

Supervision by user-provided tags can be useful for sound event classification
Lq works well for sound classification tasks with OOV (and some IV) noise

104

Training Sound Event Classifiers with Noisy Labels

Loss functions

L_soft defined in Reed,et al., Training deep neural networks on noisy labels with bootstrapping, ICLR 2015

105 of 203

Noise-robust loss functions

noisy set & Lq → 1.9% boost (little engineering effort)
noisy set & 2.4h curated data → 5.1% boost (significant manual effort)

105

Training Sound Event Classifiers with Noisy Labels

Loss functions

106 of 203

Loss-based instance selection

Deep networks in presence of label noise

problem is more severe as learning progresses

106

Training Sound Event Classifiers with Noisy Labels

Loss functions

Arpit et al., A closer look at memorization in deep networks, ICML 2017

learning

epoch

learn easy &

general

patterns

memorize

label

noise

107 of 203

Loss-based instance selection

Learning process as a two-stage process
After n1 epochs

model has converged to some extent → use it for instance selection

identify instances with large training loss
ignore them for gradient update

107

Training Sound Event Classifiers with Noisy Labels

Loss functions

learning

epoch

stage1: regular training

108 of 203

Loss-based instance selection

Approach 1

discard large-loss instances from each mini-batch
dynamically at every iteration
time-dependent loss function

108

Training Sound Event Classifiers with Noisy Labels

Loss functions

learning

epoch

stage1: regular training

stage2:

discard instances

@ mini-batch

109 of 203

Loss-based instance selection

Approach 2

use checkpoint to predict scores on whole dataset
convert to loss values
prune dataset, keeping a subset to continue learning

109

Training Sound Event Classifiers with Noisy Labels

Loss functions

learning

epoch

stage1: regular training

stage2:

regular training

dataset pruning

110 of 203

Loss-based instance selection

110

Training Sound Event Classifiers with Noisy Labels

Loss functions

111 of 203

Loss-based instance selection

Pruning dataset slightly outperforms discarding at mini-batch

111

Training Sound Event Classifiers with Noisy Labels

Loss functions

112 of 203

Loss-based instance selection

Pruning dataset slightly outperforms discarding at mini-batch
Pruning dataset is more stable

112

Training Sound Event Classifiers with Noisy Labels

Loss functions

113 of 203

Addressing the problem of missing labels

113

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

114 of 203

Addressing the problem of missing labels

AudioSet creation process → similar to FSD50K’s

nomination & validation process can fail → missing labels

114

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

115 of 203

Addressing the problem of missing labels

AudioSet creation process → similar to FSD50K’s

nomination & validation process can fail → missing labels

Labels in AudioSet

explicit (positive or negative) → received human validation
implicit negative → no human validation

115

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

116 of 203

Addressing the problem of missing labels

AudioSet creation process → similar to FSD50K’s

nomination & validation process can fail → missing labels

Labels in AudioSet

explicit (positive or negative) → received human validation
implicit negative → no human validation → missing “Present” labels?

116

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

117 of 203

Addressing the problem of missing labels

AudioSet creation process → similar to FSD50K’s

nomination & validation process can fail → missing labels

Labels in AudioSet

explicit (positive or negative) → received human validation
implicit negative → no human validation → missing “Present” labels?

117

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

118 of 203

Addressing the problem of missing labels

AudioSet creation process → similar to FSD50K’s

nomination & validation process can fail → missing labels

Labels in AudioSet

explicit (positive or negative) → received human validation
implicit negative → no human validation → missing “Present” labels?

118

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

119 of 203

Teacher-student framework

Skeptical teacher → detect potential missing labels per class

train teacher

119

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

train

teacher

model

train

set

120 of 203

Teacher-student framework

Skeptical teacher → detect potential missing labels per class

train teacher
predict scores for train set

120

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

train

teacher

model

train

set

predict

scores

clips

527 labels

0.13 0.57 ... 0.86

0.29 0.41 ... 0.03

… … …

… … ...

121 of 203

Teacher-student framework

Skeptical teacher → detect potential missing labels per class

train teacher
predict scores for train set
hypothesis: top-scored negatives → missing “Present” labels

121

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

score distribution

train

teacher

model

train

set

discard

predict

implicit negatives

explicit

scores

clips

527 labels

0.13 0.57 ... 0.86

0.29 0.41 ... 0.03

… … …

… … ...

122 of 203

Teacher-student framework

Skeptical teacher → detect potential missing labels per class

train teacher
predict scores for train set
hypothesis: top-scored negatives → missing “Present” labels
flag top-scored negatives in new enhanced label set

122

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

score distribution

train

teacher

model

train

set

discard

enhanced label set

predict

implicit negatives

explicit

scores

clips

527 labels

0.13 0.57 ... 0.86

0.29 0.41 ... 0.03

… … …

… … ...

123 of 203

Teacher-student framework

Train student model using enhanced label set & loss masking

123

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

train

student

model

d´

lwlrap

eval

set

predict

loss masking

train

set

124 of 203

Teacher-student framework

Train student model using enhanced label set & loss masking

create a binary mask using the information encoded in enhanced label set

124

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

train

student

model

d´

lwlrap

eval

set

predict

loss masking

train

set

125 of 203

Teacher-student framework

Train student model using enhanced label set & loss masking

create a binary mask using the information encoded in enhanced label set

apply mask to negative term of loss → discard loss contributions of missing labels

125

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

train

student

model

d´

lwlrap

eval

set

predict

loss masking

train

set

126 of 203

Experimental setup

AudioSet → two train sets

similar class distribution
size proportion 1:5

Two models of different capacity

126

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

127 of 203

Performance vs. discarded negatives

Train a ResNet-50 teacher
Generate 18 new label sets. For each:

different threshold t_c∈ [0, 20] %
train a student & report eval

127

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

128 of 203

Performance vs. discarded negatives

Train a ResNet-50 teacher
Generate 18 new label sets. For each:

different threshold t_c∈ [0, 20] %
train a student & report eval

Each point → one experiment trial

128

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

129 of 203

Insight from results

Method yields boosts in all cases

best operating points → discarding 2 - 6%

129

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

130 of 203

Insight from results

Method yields boosts in all cases

best operating points → discarding 2 - 6%

Main pattern → consistent steep increase

most of the boost: removing just ≈1%

130

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

131 of 203

Insight from results

Method yields boosts in all cases

best operating points → discarding 2 - 6%

Main pattern → consistent steep increase

most of the boost: removing just ≈1%

Effect of train set size

boost is higher when train set is smaller
effect still observable with 6800h+

131

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

132 of 203

Summary & takeaways

FSDnoisy18k to support label noise research

empirical characterization of label noise
large amounts of Freesound & tags → feasible for training sound event classifiers

132

Training Sound Event Classifiers with Noisy Labels

Takeaways

133 of 203

Summary & takeaways

FSDnoisy18k to support label noise research

empirical characterization of label noise
large amounts of Freesound & tags → feasible for training sound event classifiers

Simple model-agnostic approaches → improve performance

noise-robust loss functions → effective in mitigating effect of label noise

133

Training Sound Event Classifiers with Noisy Labels

Takeaways

134 of 203

Summary & takeaways

FSDnoisy18k to support label noise research

empirical characterization of label noise
large amounts of Freesound & tags → feasible for training sound event classifiers

Simple model-agnostic approaches → improve performance

noise-robust loss functions → effective in mitigating effect of label noise
rejecting noisy samples during training → more effective

134

Training Sound Event Classifiers with Noisy Labels

Takeaways

135 of 203

Summary & takeaways

FSDnoisy18k to support label noise research

empirical characterization of label noise
large amounts of Freesound & tags → feasible for training sound event classifiers

Simple model-agnostic approaches → improve performance

noise-robust loss functions → effective in mitigating effect of label noise
rejecting noisy samples during training → more effective
addressing missing labels → pathology in AudioSet labelling

boost is higher when train set is smaller, but still observable with massive data

135

Training Sound Event Classifiers with Noisy Labels

Takeaways

136 of 203

Outline

Introduction
The Freesound Dataset 50k (FSD50K)
Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks
Training Sound Event Classifiers With Noisy Labels
Self-Supervised Learning of Sound Event Representations
DCASE Challenge Tasks Organization
Summary and Conclusions

136

137 of 203

5. Self-Supervised Learning of

Sound Event Representations

^ Fonseca, E., Ortego, D., McGuinness, K., O’Connor, N. E., Serra, X., Unsupervised contrastive learning of sound event representations. In Proceedings of the 46th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.

^ Work done in collaboration with Dublin City University.

* Fonseca, E., Jansen, A., Ellis, D. P. W., Wisdom, S., Tagliasacchi, M., Hershey, J. R., Plakal, M., Hershey, S., Moore, R. C., Serra, X., Self-supervised learning from automatically separated sound scenes. In Proceedings of the 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021.

* “Best Audio Representation Learning Paper Award” at WASPAA 2021. Work done during an internship at Google Research.

Motivation
Similarity maximization for sound event representation learning
Representation learning from automatically separated sound scenes
Takeaways

138 of 203

Why?

Common scenario in sound event research

few manually-labeled data but abundant unlabeled data

Self-supervised learning

proxy learning task

learn mapping from inputs to low dimensional representations

use representations for downstream tasks e.g. classification

138

Self-Supervised Learning of Sound Event Representations

Motivation

139 of 203

Self-supervised contrastive representation learning

Contrastive learning is learning by comparing

we compare pairs of input examples

positive pairs of similar inputs
negative pairs of unrelated inputs

Goal is an embedding space where representations …

of similar examples → close together
of dissimilar examples → further away

139

Self-Supervised Learning of Sound Event Representations

Motivation

140 of 203

Building a proxy learning task

To compare pairs of positive examples:

How to generate the pairs of positive examples?

Composition of data augmentation methods

Once generated, how to compare them?

Similarity maximization

140

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

141 of 203

Building a proxy learning task

To compare pairs of positive examples:

How to generate the pairs of positive examples?

Composition of data augmentation methods

Once generated, how to compare them?

Proxy task of similarity maximization

141

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

142 of 203

Proposed approach: overview

Similarity maximization, inspired by SimCLR

maximize similarity between differently augmented views of sound events

142

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

Chen et al., A simple framework for contrastive learning of visual representations, ICML 2020

DA′

Mix-back

Encoder

TPS

Shared weights

Head

Augmentation

front-end

143 of 203

Proposed approach: Temporal Proximity sampling

Sample two patches at random within clip log-mel spectrogram

natural data augmentation

143

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

TPS

144 of 203

Proposed approach: mix-back

Mix incoming patch with a background patch

reduce mutual information & retain semantics by sound transparency

144

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

Mix-back

TPS

145 of 203

Proposed approach: other data augmentations

Simple methods for on-the-fly computation on T-F patches
Random resized cropping, compression, Gaussian noise addition, specAugment, random time/frequency shifts
Hyperparameters randomly sampled from a distribution

145

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

Park et al., SpecAugment: A simple data augmentation method for automatic speech recognition. InterSpeech 2019

DA′

Mix-back

TPS

Augmentation

front-end

146 of 203

Proposed approach: encoder and head

Convolutional encoder → extract low-dimensional embeddings h (for downstream tasks)
MLP head → map h to metric embedding z

146

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

DA′

Mix-back

Encoder

TPS

Shared weights

Head

Augmentation

front-end

147 of 203

Proposed approach: contrastive loss

147

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

Chen et al., A simple framework for contrastive learning of visual representations, ICML 2020

DA′

Mix-back

Encoder

TPS

Head

Augmentation

front-end

148 of 203

Evaluation using FSDnoisy18k

Unsupervised representation learning

train on train_noisy without labels
validate on train_clean using labels in kNN Evaluation

148

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

149 of 203

Evaluation using FSDnoisy18k

Unsupervised representation learning

train on train_noisy without labels
validate on train_clean using labels in kNN Evaluation

Evaluation of the representation using supervised tasks

Model fine tuning after initializing w/ pre-trained weights on two downstream tasks

train on train_noisy
train on train_clean

149

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

150 of 203

Ablation study: Temporal Proximity sampling

Best: sampling at random
Worst: using same patch

150

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

TPS

151 of 203

Ablation study: mix-back

Lightly mixing patches with unrelated backgrounds helps
Adjusting patch energy is beneficial

foreground dominant over background

151

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

152 of 203

Ablation study: data augmentation

Explore DAs individually

random resized cropping (RRC): small stretch in time/freq & small freq transposition
SpecAugment (time/freq masking)

Explore DA compositions

RRC + compression + Gaussian noise addition
RRC + SpecAugment

152

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

Park et al., SpecAugment: A simple data augmentation method for automatic speech recognition. InterSpeech 2019

153 of 203

Evaluation of learned representations

Supervised baselines: CRNN ⩬ VGG-like > ResNet-18

ResNet-18: large capacity for not so much data & noisy labels

153

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

154 of 203

Evaluation of learned representations

End-to-end model fine tuning

Goal: measure benefit wrt training from scratch in noisy- & small-data regimes
Unsupervised contrastive pre-training is best in all cases

154

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

155 of 203

Evaluation of learned representations

End-to-end model fine tuning

Goal: measure benefit wrt training from scratch in noisy- & small-data regimes
Unsupervised contrastive pre-training is best in all cases
ResNet-18:

lowest accuracy trained from scratch (limited by data or label quality)
top accuracy w/ unsupervised pre-training (alleviate these problems)

155

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

156 of 203

Takeaways

Successful representation learning by tuning compound

positive patch sampling & mix-back & data augmentation

Unsupervised contrastive pre-training can

mitigate the impact of data scarcity
increase robustness against noisy labels

156

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

157 of 203

Let’s recap for our next framework

To compare pairs of positive examples:

How to generate the pairs of positive examples?

Composition of data augmentation methods

Once generated, how to compare them?

Similarity maximization

157

Self-Supervised Learning of Sound Event Representations

158 of 203

Previously, how to generate pairs of positive examples?

Previously, composition of data augmentation methods

temporal proximity sampling
cropping
artificial mixing
time/freq masking
shifts
…

Artificial & handcrafted transformations with tunable hyperparameters
Risk of introducing somewhat unrealistic domain shift?

158

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

159 of 203

Sound separation to generate views for contrastive learning

Real-world sound scenes: time-varying collections of sound events
Association of sound events with mixture and each other is semantically constrained

Not all classes co-occur naturally

159

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

160 of 203

Sound separation to generate views for contrastive learning

Decompose sound scene (mixture) into

simpler separated channels share semantics with mixture and with each other

160

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

Sound Separation

channels

mixture

161 of 203

Sound separation to generate views for contrastive learning

Decompose sound scene (mixture) into

simpler separated channels share semantics with mixture and with each other

Unlike previous approaches to generate views, sound separation:

input-dependent & reduces need for parameter tuning

Comparing mixture vs channel meets recommended guidelines

mutual information between views is reduced
some relevant semantic information is preserved

161

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

Tian et al., What makes for good views for contrastive learning? NeurIPS 2020

Sound Separation

channels

mixture

162 of 203

How to compare pairs of examples?

Two popular proxy tasks:

Similarity Maximization (SimCLR)

maximize the similarity between differently-augmented views

Coincidence Prediction

predict whether a pair of examples occur within a temporal proximity

162

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

Jansen et al., Coincidence, categorization, and consolidation: learning to recognize sounds with minimal supervision. ICASSP 2020

163 of 203

How to compare pairs of examples?

Two popular proxy tasks:

Similarity Maximization (SimCLR)

maximize the similarity between differently-augmented views

Coincidence Prediction

predict whether a pair of examples occurs within a temporal proximity

We propose to optimize them jointly as a multi-task objective
Same goal → semantically structured embedding space, pursued in different way

SM: co-locate representations of positives
CP: weaker condition → get a representation that supports coincidence prediction

163

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

164 of 203

Proposed approach: overview

164

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

DA′

Encoder

Sim. Head

MixIT

DA′′′

Encoder

DA′′

Encoder

Coin. Head

concat

random

channel selection

Sim. Head

Sound separation

augmentation Front-end

Similarity maximization

Coincidence prediction

165 of 203

Augmentation Front-end

165

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

DA′

Encoder

Sim. Head

MixIT

DA′′′

Encoder

DA′′

Encoder

Coin. Head

concat

random

channel selection

Sim. Head

Sound separation

augmentation Front-end

Similarity maximization

Coincidence prediction

166 of 203

MixIT for unsupervised sound separation

Mixture invariant training (MixIT)

fully unsupervised
promising results in Universal Sound Separation

166

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

Wisdom et al., Unsupervised sound separation using mixture invariant training. NeurIPS 2020

167 of 203

A simple composition of data augmentation methods

Use more than one augmentation → more challenging proxy task
We combine sound separation with

Temporal proximity sampling
SpecAugment

167

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

Park et al., SpecAugment: A simple data augmentation method for automatic speech recognition. InterSpeech 2019

168 of 203

Proxy learning tasks

168

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

DA′

Encoder

Sim. Head

MixIT

DA′′′

Encoder

DA′′

Encoder

Coin. Head

concat

random

channel selection

Sim. Head

Sound separation

augmentation Front-end

Similarity maximization

Coincidence prediction

169 of 203

Coincidence prediction (CP)

Based on slowness prior: waveforms vary quickly ↔ semantics change slowly

stable representation to explain semantics
representation would support prediction of coincidence in temporal proximity

169

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

Wiskott & Sejnowski, Slow feature analysis: Unsupervised learning of invariances, Neural Computation 2002

Encoder

Coin. Head

concat

Coincidence prediction

170 of 203

Coincidence prediction (CP)

Based on slowness prior: waveforms vary quickly ↔ semantics change slowly

stable representation to explain semantics
representation would support prediction of coincidence in temporal proximity

Encoder: extract low-dimensional embeddings h & concatenate pairs
Coincidence Head: map [h_m, h_c] to probability that pair is coinciding

binary classification task → binary cross entropy loss

170

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

Encoder

Coin. Head

concat

Coincidence prediction

171 of 203

Evaluation

Downstream classification with shallow model on AudioSet (mAP)

train & eval shallow network on top of learned representation

171

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

172 of 203

Sound separation for contrastive learning

Comparing input mixture w/ separated channels

better representations than using only the input mixture
outperforms SpecAugment (SA)

172

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

173 of 203

How about using a separation model before convergence?

Four audio processors → four training checkpoints of a single separation network

173

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

174 of 203

How about using a separation model before convergence?

All processors provide valid forms of augmentation
Combining some of them using OR rule can be helpful

174

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

175 of 203

Jointly optimizing both proxy tasks

Training entire framework: similarity maximization & coincidence prediction

small boosts across the board

Key ingredient:

combination of diverse processing by separation model as learning progresses

175

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

176 of 203

Comparison with previous work

Proposed framework

outperforms some past approaches - some of them MultiModal (MM)
competitive with SOTA

176

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

Jansen et al., Unsupervised learning of semantic audio representations. ICASSP 2018

Jansen et al., Coincidence, categorization, and consolidation: learning to recognize sounds with minimal supervision. ICASSP 2020

Alayrac et al., Self-supervised multimodal versatile networks. 2020

Wang & van der Oord. Multi-format contrastive learning of audio representations. SAS Workshop NeurIPS 2020

177 of 203

Takeaways

Sound separation → valid augmentation to generate views for contrastive learning
Learning to associate sound mixtures w/ separated channels elicits semantic structure in learned representation
Transformations by different checkpoints of the same separation model

valid augmentations for generating positives

Benefit in jointly training similarity maximization and coincidence prediction

177

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

178 of 203

Best Audio Representation Learning Paper Award at WASPAA21

178

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

179 of 203

Outline

Introduction
The Freesound Dataset 50k (FSD50K)
Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks
Training Sound Event Classifiers With Noisy Labels
Self-Supervised Learning of Sound Event Representations
DCASE Challenge Tasks Organization
Summary and Conclusions

179

180 of 203

6. DCASE Challenge Tasks Organization

Fonseca, E., Plakal, M., Font, F., Ellis, D. P. W., Serra, X., Audio tagging with noisy labels and minimal supervision. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE), 2019.

Fonseca, E., Plakal, M., Font, F., Ellis, D. P. W., Favory, X., Pons, J., Serra, X., General-purpose tagging of Freesound audio with AudioSet labels: Task description, dataset, and baseline. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE), 2018.

DCASE 2018 Task 2: General-purpose Audio Tagging of Freesound Content with AudioSet Labels
DCASE 2019 Task 2: Audio Tagging with Noisy Labels and Minimal Supervision

181 of 203

DCASE Challenge Tasks organization

181

DCASE Challenge Tasks Organization

182 of 203

DCASE Challenge Tasks organization

Design of problem formulation
Development of audio datasets
Management of Kaggle platform & discussion forums

182

DCASE Challenge Tasks Organization

183 of 203

Kaggle platform

183

DCASE Challenge Tasks Organization

184 of 203

Open knowledge

2 open datasets (FSDKaggle2018 & FSDKaggle2019) and baselines
Exchange of ideas & papers & code on Kaggle forums
Code for winning solutions was released

184

DCASE Challenge Tasks Organization

185 of 203

Outline

Introduction
The Freesound Dataset 50k (FSD50K)
Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks
Training Sound Event Classifiers With Noisy Labels
Self-Supervised Learning of Sound Event Representations
DCASE Challenge Tasks Organization
Summary and Conclusions

185

186 of 203

7. Summary and Conclusions

Technical contributions
Other academic contributions & merits
Publications

187 of 203

Technical contributions

A comprehensive review of sound event recognition & different topics covered

187

Summary and Conclusions

188 of 203

Technical contributions

A comprehensive review of sound event recognition & different topics covered
FSD50K → the largest fully-open dataset of human-labeled sound events

188

Summary and Conclusions

189 of 203

Technical contributions

A comprehensive review of sound event recognition & different topics covered
FSD50K → the largest fully-open dataset of human-labeled sound events
A comprehensive characterization of FSD50K

189

Summary and Conclusions

190 of 203

Technical contributions

A comprehensive review of sound event recognition & different topics covered
FSD50K → the largest fully-open dataset of human-labeled sound events
A comprehensive characterization of FSD50K
Architectural modifications to increase shift invariance in CNNs

190

Summary and Conclusions

191 of 203

Technical contributions

A comprehensive review of sound event recognition & different topics covered
FSD50K → the largest fully-open dataset of human-labeled sound events
A comprehensive characterization of FSD50K
Architectural modifications to increase shift invariance in CNNs
FSDnoisy18k → first dataset to provide for the investigation of label noise in SET

191

Summary and Conclusions

192 of 203

Technical contributions

A comprehensive review of sound event recognition & different topics covered
FSD50K → the largest fully-open dataset of human-labeled sound events
A comprehensive characterization of FSD50K
Architectural modifications to increase shift invariance in CNNs
FSDnoisy18k → first dataset to provide for the investigation of label noise in SET
Techniques to mitigate the effect of label noise during training of sound event classifiers

192

Summary and Conclusions

193 of 203

Technical contributions

A comprehensive review of sound event recognition & different topics covered
FSD50K → the largest fully-open dataset of human-labeled sound events
A comprehensive characterization of FSD50K
Architectural modifications to increase shift invariance in CNNs
FSDnoisy18k → first dataset to provide for the investigation of label noise in SET
Techniques to mitigate the effect of label noise during training of sound event classifiers
Self-supervised contrastive learning frameworks for learning audio representations

193

Summary and Conclusions

194 of 203

Other academic contributions & merits

Best Audio Representation Learning Paper Award at WASPAA 2021

“Self-Supervised Learning From Automatically Separated Sound Scenes”

Developer of awarded proposals for Google Faculty Research Awards 2017 & 2018
Technical Program Co-Chair of DCASE Workshop 2021
Challenge co-organizer

DCASE2018 Task2, DCASE2019 Task2, DCASE2021 Task4
Holistic Evaluation of Audio Representations (HEAR) 2021

Reviewer:

EURASIP, IEEE SPL, ICASSP, WASPAA, DCASE, EUSIPCO, IJCNN, MMSP, ISMIR

Resources for reproducibility

4 datasets available from Zenodo
5 code repositories available from GitHub

194

Summary and Conclusions

195 of 203

19 peer-reviewed publications

3 Journal Articles as first author in TASLP and IEEE SPL

Fonseca, E., Favory, X., Pons, J., Font, F., & Serra, X. (2020). FSD50K: an Open Dataset of Human-Labeled Sound Events. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
Fonseca, E., Hershey, S., Plakal, M., Ellis, D. P. W., Jansen, A., & Moore, R. C. (2020). Addressing Missing Labels in Large-Scale Sound Event Recognition Using a Teacher-Student Framework With Loss Masking. IEEE Signal Processing Letters
Fonseca, E., Ferraro, A., & Serra, X. (2021). Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks. Under review in IEEE Signal Processing Letters.

195

Summary and Conclusions

196 of 203

19 peer-reviewed publications

9 Conference Articles as first author in ICASSP, WASPAA, DCASE, SMC, ISMIR

Fonseca, E., Jansen, A., Ellis, D. P. W., Wisdom, S., Tagliasacchi, M., Hershey, J. R., Plakal, M., Hershey, S., Moore, R. C., & Serra, X. (2021). Self-Supervised Learning from Automatically Separated Sound Scenes. In Proceedings of the 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).
Fonseca, E., Ortego, D., McGuinness, K., O’Connor, N. E., & Serra, X. (2021). Unsupervised Contrastive Learning of Sound Event Representations. In Proceedings of the 46th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Fonseca, E., Font, F., & Serra, X. (2019). Model-agnostic Approaches to Handling Noisy Labels When Training Sound Event Classifiers. In Proceedings of the 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).
Fonseca, E., Plakal, M., Font, F., Ellis, D. P. W., & Serra, X. (2019). Audio Tagging with Noisy Labels and Minimal Supervision. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)
Fonseca, E., Plakal, M., Ellis, D. P. W., Font, F., Favory, X., & Serra, X. (2019). Learning Sound Event Classifiers from Web Audio with Noisy Labels. In Proceedings of the 44th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Fonseca, E., Plakal, M., Font, F., Ellis, D. P. W., Favory, X., Pons, J., & Serra, X. (2018). General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018).
Fonseca, E., Gong, R., & Serra, X. (2018). A Simple Fusion of Deep and Shallow Learning for Acoustic Scene Classification. In Proceedings of 5th Sound & Music Computing Conference (SMC 2018).
Fonseca, E., Pons, J., Favory, X., Font, F., Bogdanov, D., Ferraro, A., Oramas, S., Porter, A., & Serra, X. (2017). Freesound Datasets: A Platform for the Creation of Open Audio Datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR).
Fonseca, E., Gong, R., Bogdanov, D., Slizovskaia, O., Gómez, E., & Serra, X. (2017). Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017).

196

Summary and Conclusions

197 of 203

19 peer-reviewed publications

7 Conference Articles through collaborations in ICASSP, DCASE, FRUCT

Zinemanas, P., Rocamora, M., Fonseca, E., Font, F., & Serra, X. (2021). Towards Interpretable Sound Event Detection with Attention Based on Prototypes. Accepted for publication in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021).
Hershey, S., Ellis, D. P. W., Fonseca, E., Jansen, A., Liu, C., Moore, R. C., & Plakal, M. (2021). The Benefit of Temporally-Strong Labels in Audio Event Classification. In Proceedings of the 46th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Turpault, N., Serizel, R., Wisdom, S., Erdogan, H., Hershey, J.R., Fonseca, E., Seetharaman, P., & Salamon, J. (2021). Sound Event Detection and Separation: a Benchmark on DESED Synthetic Soundscapes. In Proceedings of the 46th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Wisdom, S., Erdogan, H., Ellis, D. P. W., Serizel, R., Turpault, N., Fonseca, E., Salamon, J., Seetharaman, P., & Hershey, J.R. (2021). What’s All the FUSS About Free Universal Sound Separation Data?. In Proceedings of the 46th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Turpault, N., Wisdom, S., Erdogan, H., Hershey, J.R., Serizel, R., Fonseca, E., Seetharaman, P., & Salamon, J. (2020). Improving Sound Event Detection In Domestic Environments Using Sound Separation. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020).
Pérez-López, A., Fonseca, E., & Serra, X. (2019). A Hybrid Parametric-Deep Learning Approach for Sound Event Localization and Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019).
Favory, F., Fonseca, E., & Serra, X. (2018). Facilitating the Manual Annotation of Sounds When Using Large Taxonomies. In Proceedings of the 23rd Conference of Open Innovations Association, FRUCT.

197

Summary and Conclusions

198 of 203

Evolution of research context in sound event classification

198

Summary and Conclusions

small vocabulary

supervision weakness

Supervised classification

clean labels

199 of 203

Evolution of research context in sound event classification

199

Summary and Conclusions

small vocabulary

large vocabulary

Supervised classification

clean labels

supervision weakness

Supervised classification

noisy labels

Self-supervised

representation

learning

Supervised classification

clean labels

200 of 203

Evolution of research context in sound event classification

200

Summary and Conclusions

small vocabulary

large vocabulary

Supervised classification

clean labels

This Thesis

supervision weakness

Supervised classification

noisy labels

Self-supervised

representation

learning

Supervised classification

clean labels

201 of 203

Thank you!

201

202 of 203

Training Sound Event Classifiers Using Different Types of Supervision

Eduardo Fonseca

December 1^st, 2021

Supervisors:

Dr. Xavier Serra i Casals

Dr. Frederic Font Corbera

Board:

Dr. Emmanouil Benetos

Dr. Annamaria Mesaros

Dr. Marius Miron