Training Sound Event Classifiers Using Different Types of Supervision
Eduardo Fonseca
December 1st, 2021
Supervisors:
Dr. Xavier Serra i Casals
Dr. Frederic Font Corbera
Board:
Dr. Emmanouil Benetos
Dr. Annamaria Mesaros
Dr. Marius Miron
Sound event classification
2
Introduction
Motivation
Applications
3
Introduction
Motivation
A challenging endeavour
4
Introduction
Motivation
Sound event classification before this thesis
5
Introduction
Motivation
Sound event classification before this thesis
6
Introduction
Motivation
Sound event classification before this thesis
7
Introduction
Motivation
But sound data were available ...
8
Introduction
Motivation
AudioCommons
9
Introduction
Motivation
What can we do about it?
10
Introduction
Motivation
1. Building a new dataset
11
Introduction
Research directions
2. Improving generalization
12
Introduction
Research directions
3. Learning with noisy labels
13
Introduction
Research directions
4. Self-supervised learning
14
Introduction
Research directions
Objectives of this thesis
15
Introduction
Objectives
Objectives of this thesis
16
Introduction
Objectives
Objectives of this thesis
17
Introduction
Objectives
Objectives of this thesis
18
Introduction
Objectives
Objectives of this thesis
19
Introduction
Objectives
Objectives of this thesis
20
Introduction
Objectives
Outline
21
Introduction
Outline
22
2. The Freesound Dataset 50k (FSD50K)
Fonseca, E., Favory, X., Pons, J., Font, F., Serra, X, FSD50K: an open dataset of human-labeled sound events. In press in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.
Fonseca, E., Pons, J., Favory, X., Font, F., Bogdanov, D., Ferraro, A., Oramas, S., Porter, A., Serra, X, Freesound Datasets: A platform for the creation of open audio datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), 2017.
Why?
24
The Freesound Dataset 50k (FSD50K)
Motivation
Data acquisition
25
The Freesound Dataset 50k (FSD50K)
The creation of FSD50K
Freesound
26
The Freesound Dataset 50k (FSD50K)
The creation of FSD50K
AudioSet Ontology
27
The Freesound Dataset 50k (FSD50K)
The creation of FSD50K
Freesound Annotator
28
The Freesound Dataset 50k (FSD50K)
The creation of FSD50K
Candidate labels nomination
29
The Freesound Dataset 50k (FSD50K)
The creation of FSD50K
Candidate labels nomination
30
The Freesound Dataset 50k (FSD50K)
The creation of FSD50K
Candidate labels nomination
Outcome: 268k clips
31
The Freesound Dataset 50k (FSD50K)
The creation of FSD50K
Validation task
32
The Freesound Dataset 50k (FSD50K)
The creation of FSD50K
Validation task
33
The Freesound Dataset 50k (FSD50K)
The creation of FSD50K
Validation task
34
The Freesound Dataset 50k (FSD50K)
The creation of FSD50K
Data split
35
The Freesound Dataset 50k (FSD50K)
The creation of FSD50K
data
classes
dev
eval
c0
c1
c2
c3
c4
...
Data split
36
The Freesound Dataset 50k (FSD50K)
The creation of FSD50K
data
classes
dev
eval
c0
c1
c2
c3
c4
...
Data split
37
The Freesound Dataset 50k (FSD50K)
The creation of FSD50K
data
classes
dev
eval
c0
c1
c2
c3
c4
...
Refinement task for evaluation set
38
The Freesound Dataset 50k (FSD50K)
The creation of FSD50K
Refinement task for evaluation set
39
The Freesound Dataset 50k (FSD50K)
The creation of FSD50K
Favory et al., Facilitating the manual annotation of sounds when using large taxonomies, FRUCT 2018
After manual annotation
40
The Freesound Dataset 50k (FSD50K)
The creation of FSD50K
heroes behind FSD50K
(hired annotators)
me
great collaborators
Post-processing
41
The Freesound Dataset 50k (FSD50K)
The creation of FSD50K
Post-processing
42
The Freesound Dataset 50k (FSD50K)
The creation of FSD50K
Post-processing
43
The Freesound Dataset 50k (FSD50K)
The creation of FSD50K
data
classes
train
c0
c1
c2
c3
c4
...
val
WC
BC
Post-processing
44
The Freesound Dataset 50k (FSD50K)
The creation of FSD50K
data
classes
train
c0
c1
c2
c3
c4
...
val
WC
BC
Freesound Dataset 50k (FSD50K)
45
The Freesound Dataset 50k (FSD50K)
FSD50K description
train
eval set
36,796 clips / 70.5 hours
4170 / 9.9
development set
val
10,231 / 27.9
Clips / duration
Freesound Dataset 50k (FSD50K)
46
The Freesound Dataset 50k (FSD50K)
FSD50K description
train
eval set
36,796 clips / 70.5 hours
4170 / 9.9
development set
val
10,231 / 27.9
Clips / duration
Limitations
47
The Freesound Dataset 50k (FSD50K)
FSD50K description
Impact of train/validation separation
48
The Freesound Dataset 50k (FSD50K)
Experiments
Impact of train/validation separation
both WC & BC contamination
minimize WC contamination
49
The Freesound Dataset 50k (FSD50K)
Experiments
Impact of train/validation separation
50
The Freesound Dataset 50k (FSD50K)
Experiments
Impact of train/validation separation
51
The Freesound Dataset 50k (FSD50K)
Experiments
Summary
52
The Freesound Dataset 50k (FSD50K)
Summary
Impact
53
The Freesound Dataset 50k (FSD50K)
Summary
Impact
54
The Freesound Dataset 50k (FSD50K)
Summary
Impact
55
The Freesound Dataset 50k (FSD50K)
Summary
Outline
56
3. Improving Sound Event Classification by Increasing Shift Invariance in
Convolutional Neural Networks
Fonseca, E., Ferraro, A., Serra, X., Improving sound event classification by increasing shift invariance in convolutional neural networks. Under review in IEEE Signal Processing Letters, 2021.
Why?
58
Improving SET by Increasing Shift Invariance in CNNs
Motivation
Azulay, A. & Weiss, Y., Why do deep convolutional networks generalize so poorly to small image transformations? JMLR 2018
Is this really a problem?
59
Improving SET by Increasing Shift Invariance in CNNs
Motivation
Is this really a problem? So it seems...
60
Improving SET by Increasing Shift Invariance in CNNs
Motivation
Is this really a problem? So it seems...
61
Improving SET by Increasing Shift Invariance in CNNs
Motivation
Is this really a problem? So it seems...
62
Improving SET by Increasing Shift Invariance in CNNs
Motivation
How to increase shift invariance in CNNs?
63
Improving SET by Increasing Shift Invariance in CNNs
Method
Proposed approach: overview
64
Improving SET by Increasing Shift Invariance in CNNs
Method
Proposed approach: overview
65
Improving SET by Increasing Shift Invariance in CNNs
Method
stride-s
MPk,1
ymp
x
Proposed approach: overview
66
Improving SET by Increasing Shift Invariance in CNNs
Method
stride-s
LPFm,n
MPk,1
stride-s
MPk,1
ylpf
ymp
x
x
Proposed approach: overview
67
Improving SET by Increasing Shift Invariance in CNNs
Method
stride-s
APS
LPFm,n
MPk,1
MPk,1
stride-s
MPk,1
yaps
ylpf
ymp
x
x
x
Low-pass filtering before subsampling
68
Improving SET by Increasing Shift Invariance in CNNs
Method
subsampling
max( )
max( )
Low-pass filtering before subsampling
Low-pass filters
69
Improving SET by Increasing Shift Invariance in CNNs
Method
Zhang, Making convolutional networks shift-invariant again, ICML 2019
subsampling
max( )
max( )
max( )
conv( )
subsampling
LPFm,n
Adaptive polyphase sampling
70
Improving SET by Increasing Shift Invariance in CNNs
Method
Chaman & Dokmanic, Truly shift-invariant convolutional neural networks, CVPR 2021
original
feature map
shifted
feature map
naive subsampling
stride-s
MPk,1
Adaptive polyphase sampling
71
Improving SET by Increasing Shift Invariance in CNNs
Method
subsampling
max()
Adaptive polyphase sampling
72
Improving SET by Increasing Shift Invariance in CNNs
Method
max()
max()
max()
max()
Adaptive polyphase sampling
73
Improving SET by Increasing Shift Invariance in CNNs
Method
Chaman & Dokmanic, Truly shift-invariant convolutional neural networks, CVPR 2021
naive subsampling
original
feature map
shifted
feature map
adaptive
polyphase
sampling
stride-s
MPk,1
APS
MPk,1
Experimental setup
74
Improving SET by Increasing Shift Invariance in CNNs
Experiments
Dense(200) + Sigmoid
2x Conv2D(3,3) + BN + ReLU
Max-Pool 2x2
Weights: 1.2M�
2x Conv2D(3,3) + BN + ReLU
Max-Pool 2x2
2x Conv2D(3,3) + BN + ReLU
Max-Pool 2x2
2x Conv2D(3,3) + BN + ReLU
Global-Pool max(avg(freq))
Evaluation using a small model
75
Improving SET by Increasing Shift Invariance in CNNs
Experiments
Evaluation using a small model
76
Improving SET by Increasing Shift Invariance in CNNs
Experiments
Evaluation using a small model
77
Improving SET by Increasing Shift Invariance in CNNs
Experiments
Evaluation using a small model
78
Improving SET by Increasing Shift Invariance in CNNs
Experiments
Evaluation using regularization and a larger model
79
Improving SET by Increasing Shift Invariance in CNNs
Experiments
Evaluation using regularization and a larger model
80
Improving SET by Increasing Shift Invariance in CNNs
Experiments
Trainable vs. fixed LPF
81
Improving SET by Increasing Shift Invariance in CNNs
Experiments
Trainable vs. fixed LPF
82
Improving SET by Increasing Shift Invariance in CNNs
Experiments
Zhang, Making convolutional networks shift-invariant again, ICML 2019
BlurPool
TLPF
Characterizing the increase of shift invariance
83
Improving SET by Increasing Shift Invariance in CNNs
Experiments
Characterizing the increase of shift invariance
time shift @ water dripping
frequency shift @ keyboard
84
Improving SET by Increasing Shift Invariance in CNNs
Experiments
time frames shifted
Characterizing the increase of shift invariance
time shift @ water dripping
frequency shift @ keyboard
85
Improving SET by Increasing Shift Invariance in CNNs
Experiments
Comparison with previous work
86
Improving SET by Increasing Shift Invariance in CNNs
Experiments
Gong et al., PSLA: Improving audio event classification with pretraining, sampling, labeling, and aggregation. arXiv 2021
Verma & Berger, Audio transformers: Transformer architectures for large scale audio understanding. Adieu convolutions, arXiv 2021
Takeaways
87
Improving SET by Increasing Shift Invariance in CNNs
Takeaways
Essentia TensorFlow models
88
Improving SET by Increasing Shift Invariance in CNNs
Takeaways
Outline
89
4. Training Sound Event Classifiers
with Noisy Labels
Fonseca, E., Plakal, M., Ellis, D. P. W., Font, F., Favory, X., Serra, X., Learning sound event classifiers from web audio with noisy labels. In Proceedings of the 44th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
Fonseca, E., Font, F., Serra, X., Model-agnostic approaches to handling noisy labels when training sound event classifiers. In Proceedings of the 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019.
* Fonseca, E., Hershey, S., Plakal, M., Ellis, D. P. W., Jansen, A., Moore, R. C., Addressing missing labels in large-scale sound event recognition using a teacher-student framework with loss masking. IEEE Signal Processing Letters, 2020.
* Work done during an internship at Google Research.
Label noise in sound event classification
91
Training Sound Event Classifiers with Noisy Labels
Motivation
3
ESC-50
7
Chime-home
9
UrbanSound8K
FSD50K
108
AudioSet
5800
labelling
exhaustive
less precise
hours
audio
FSDnoisy18k
92
Training Sound Event Classifiers with Noisy Labels
FSDnoisy18k
The creation of FSDnoisy18k
93
Training Sound Event Classifiers with Noisy Labels
FSDnoisy18k
The creation of FSDnoisy18k
94
Training Sound Event Classifiers with Noisy Labels
FSDnoisy18k
The creation of FSDnoisy18k
95
Training Sound Event Classifiers with Noisy Labels
FSDnoisy18k
candidate
clips
1234.wav
5678.wav
7654.wav
...
Rain
validation task
clean
train set
test set
noisy
train set
Label noise distribution in FSDnoisy18k
96
Training Sound Event Classifiers with Noisy Labels
FSDnoisy18k
FSDnoisy18k
97
Training Sound Event Classifiers with Noisy Labels
FSDnoisy18k
noisy
clean
test set
15,813 clips / 38.8 h
1772 / 2.4
947 / 1.4
train set
Noise-robust loss functions
98
Training Sound Event Classifiers with Noisy Labels
Loss functions
target labels
predictions
Noise-robust loss functions
99
Training Sound Event Classifiers with Noisy Labels
Loss functions
Noise-robust loss functions
100
Training Sound Event Classifiers with Noisy Labels
Loss functions
Ghosh et al., Robust loss functions under label noise for deep neural networks, AAAI 2017
Noise-robust loss functions
101
Training Sound Event Classifiers with Noisy Labels
Loss functions
Zhang & Sabuncu, Generalized cross-entropy loss for training deep neural networks with noisy labels, NeurIPS 2018
Noise-robust loss functions
102
Training Sound Event Classifiers with Noisy Labels
Loss functions
Noise-robust loss functions
103
Training Sound Event Classifiers with Noisy Labels
Loss functions
Noise-robust loss functions
104
Training Sound Event Classifiers with Noisy Labels
Loss functions
L_soft defined in Reed,et al., Training deep neural networks on noisy labels with bootstrapping, ICLR 2015
Noise-robust loss functions
105
Training Sound Event Classifiers with Noisy Labels
Loss functions
Loss-based instance selection
106
Training Sound Event Classifiers with Noisy Labels
Loss functions
Arpit et al., A closer look at memorization in deep networks, ICML 2017
learning
epoch
n1
learn easy &
general
patterns
memorize
label
noise
Loss-based instance selection
107
Training Sound Event Classifiers with Noisy Labels
Loss functions
learning
epoch
n1
stage1: regular training
Lq
Loss-based instance selection
108
Training Sound Event Classifiers with Noisy Labels
Loss functions
learning
epoch
n1
stage1: regular training
Lq
stage2:
discard instances
@ mini-batch
Loss-based instance selection
109
Training Sound Event Classifiers with Noisy Labels
Loss functions
learning
epoch
n1
stage1: regular training
Lq
stage2:
regular training
Lq
dataset pruning
Loss-based instance selection
110
Training Sound Event Classifiers with Noisy Labels
Loss functions
Loss-based instance selection
111
Training Sound Event Classifiers with Noisy Labels
Loss functions
Loss-based instance selection
112
Training Sound Event Classifiers with Noisy Labels
Loss functions
Addressing the problem of missing labels
113
Training Sound Event Classifiers with Noisy Labels
Addressing the problem of missing labels
Addressing the problem of missing labels
114
Training Sound Event Classifiers with Noisy Labels
Addressing the problem of missing labels
Addressing the problem of missing labels
115
Training Sound Event Classifiers with Noisy Labels
Addressing the problem of missing labels
Addressing the problem of missing labels
116
Training Sound Event Classifiers with Noisy Labels
Addressing the problem of missing labels
Addressing the problem of missing labels
117
Training Sound Event Classifiers with Noisy Labels
Addressing the problem of missing labels
Addressing the problem of missing labels
118
Training Sound Event Classifiers with Noisy Labels
Addressing the problem of missing labels
Teacher-student framework
119
Training Sound Event Classifiers with Noisy Labels
Addressing the problem of missing labels
train
teacher
model
train
set
y
Teacher-student framework
120
Training Sound Event Classifiers with Noisy Labels
Addressing the problem of missing labels
train
teacher
model
train
set
y
predict
scores
clips
527 labels
0.13 0.57 ... 0.86
0.29 0.41 ... 0.03
… … …
… … ...
Teacher-student framework
121
Training Sound Event Classifiers with Noisy Labels
Addressing the problem of missing labels
score distribution
train
teacher
model
train
set
discard
y
predict
implicit negatives
explicit
scores
clips
527 labels
0.13 0.57 ... 0.86
0.29 0.41 ... 0.03
… … …
… … ...
Teacher-student framework
122
Training Sound Event Classifiers with Noisy Labels
Addressing the problem of missing labels
score distribution
train
teacher
model
train
set
discard
enhanced label set
y
predict
implicit negatives
explicit
y
scores
clips
527 labels
0.13 0.57 ... 0.86
0.29 0.41 ... 0.03
… … …
… … ...
Teacher-student framework
123
Training Sound Event Classifiers with Noisy Labels
Addressing the problem of missing labels
y
train
student
model
d´
lwlrap
eval
set
predict
loss masking
train
set
Teacher-student framework
124
Training Sound Event Classifiers with Noisy Labels
Addressing the problem of missing labels
y
train
student
model
d´
lwlrap
eval
set
predict
loss masking
train
set
Teacher-student framework
125
Training Sound Event Classifiers with Noisy Labels
Addressing the problem of missing labels
y
train
student
model
d´
lwlrap
eval
set
predict
loss masking
train
set
Experimental setup
126
Training Sound Event Classifiers with Noisy Labels
Addressing the problem of missing labels
Performance vs. discarded negatives
127
Training Sound Event Classifiers with Noisy Labels
Addressing the problem of missing labels
Performance vs. discarded negatives
128
Training Sound Event Classifiers with Noisy Labels
Addressing the problem of missing labels
Insight from results
129
Training Sound Event Classifiers with Noisy Labels
Addressing the problem of missing labels
Insight from results
130
Training Sound Event Classifiers with Noisy Labels
Addressing the problem of missing labels
Insight from results
131
Training Sound Event Classifiers with Noisy Labels
Addressing the problem of missing labels
Summary & takeaways
132
Training Sound Event Classifiers with Noisy Labels
Takeaways
Summary & takeaways
133
Training Sound Event Classifiers with Noisy Labels
Takeaways
Summary & takeaways
134
Training Sound Event Classifiers with Noisy Labels
Takeaways
Summary & takeaways
135
Training Sound Event Classifiers with Noisy Labels
Takeaways
Outline
136
5. Self-Supervised Learning of
Sound Event Representations
^ Fonseca, E., Ortego, D., McGuinness, K., O’Connor, N. E., Serra, X., Unsupervised contrastive learning of sound event representations. In Proceedings of the 46th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
^ Work done in collaboration with Dublin City University.
* Fonseca, E., Jansen, A., Ellis, D. P. W., Wisdom, S., Tagliasacchi, M., Hershey, J. R., Plakal, M., Hershey, S., Moore, R. C., Serra, X., Self-supervised learning from automatically separated sound scenes. In Proceedings of the 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021.
* “Best Audio Representation Learning Paper Award” at WASPAA 2021. Work done during an internship at Google Research.
Why?
138
Self-Supervised Learning of Sound Event Representations
Motivation
Self-supervised contrastive representation learning
139
Self-Supervised Learning of Sound Event Representations
Motivation
Building a proxy learning task
To compare pairs of positive examples:
140
Self-Supervised Learning of Sound Event Representations
Similarity Maximization for Sound Event Representation Learning
Building a proxy learning task
To compare pairs of positive examples:
141
Self-Supervised Learning of Sound Event Representations
Similarity Maximization for Sound Event Representation Learning
Proposed approach: overview
142
Self-Supervised Learning of Sound Event Representations
Similarity Maximization for Sound Event Representation Learning
Chen et al., A simple framework for contrastive learning of visual representations, ICML 2020
DA′
Mix-back
Mix-back
DA
Encoder
Encoder
TPS
Shared weights
Head
Head
Augmentation
front-end
Proposed approach: Temporal Proximity sampling
143
Self-Supervised Learning of Sound Event Representations
Similarity Maximization for Sound Event Representation Learning
TPS
Proposed approach: mix-back
144
Self-Supervised Learning of Sound Event Representations
Similarity Maximization for Sound Event Representation Learning
Mix-back
Mix-back
TPS
Proposed approach: other data augmentations
145
Self-Supervised Learning of Sound Event Representations
Similarity Maximization for Sound Event Representation Learning
Park et al., SpecAugment: A simple data augmentation method for automatic speech recognition. InterSpeech 2019
DA′
Mix-back
Mix-back
DA
TPS
Augmentation
front-end
Proposed approach: encoder and head
146
Self-Supervised Learning of Sound Event Representations
Similarity Maximization for Sound Event Representation Learning
DA′
Mix-back
Mix-back
DA
Encoder
Encoder
TPS
Shared weights
Head
Head
Augmentation
front-end
Proposed approach: contrastive loss
147
Self-Supervised Learning of Sound Event Representations
Similarity Maximization for Sound Event Representation Learning
Chen et al., A simple framework for contrastive learning of visual representations, ICML 2020
DA′
Mix-back
Mix-back
DA
Encoder
Encoder
TPS
Head
Head
Augmentation
front-end
Evaluation using FSDnoisy18k
148
Self-Supervised Learning of Sound Event Representations
Similarity Maximization for Sound Event Representation Learning
Evaluation using FSDnoisy18k
149
Self-Supervised Learning of Sound Event Representations
Similarity Maximization for Sound Event Representation Learning
Ablation study: Temporal Proximity sampling
150
Self-Supervised Learning of Sound Event Representations
Similarity Maximization for Sound Event Representation Learning
TPS
Ablation study: mix-back
151
Self-Supervised Learning of Sound Event Representations
Similarity Maximization for Sound Event Representation Learning
Ablation study: data augmentation
152
Self-Supervised Learning of Sound Event Representations
Similarity Maximization for Sound Event Representation Learning
Park et al., SpecAugment: A simple data augmentation method for automatic speech recognition. InterSpeech 2019
Evaluation of learned representations
153
Self-Supervised Learning of Sound Event Representations
Similarity Maximization for Sound Event Representation Learning
Evaluation of learned representations
154
Self-Supervised Learning of Sound Event Representations
Similarity Maximization for Sound Event Representation Learning
Evaluation of learned representations
155
Self-Supervised Learning of Sound Event Representations
Similarity Maximization for Sound Event Representation Learning
Takeaways
156
Self-Supervised Learning of Sound Event Representations
Similarity Maximization for Sound Event Representation Learning
Let’s recap for our next framework
To compare pairs of positive examples:
157
Self-Supervised Learning of Sound Event Representations
Previously, how to generate pairs of positive examples?
158
Self-Supervised Learning of Sound Event Representations
Representation Learning from Automatically Separated Sound Scenes
Sound separation to generate views for contrastive learning
159
Self-Supervised Learning of Sound Event Representations
Representation Learning from Automatically Separated Sound Scenes
Sound separation to generate views for contrastive learning
160
Self-Supervised Learning of Sound Event Representations
Representation Learning from Automatically Separated Sound Scenes
Sound Separation
channels
mixture
Sound separation to generate views for contrastive learning
161
Self-Supervised Learning of Sound Event Representations
Representation Learning from Automatically Separated Sound Scenes
Tian et al., What makes for good views for contrastive learning? NeurIPS 2020
Sound Separation
channels
mixture
How to compare pairs of examples?
Two popular proxy tasks:
162
Self-Supervised Learning of Sound Event Representations
Representation Learning from Automatically Separated Sound Scenes
Jansen et al., Coincidence, categorization, and consolidation: learning to recognize sounds with minimal supervision. ICASSP 2020
How to compare pairs of examples?
Two popular proxy tasks:
163
Self-Supervised Learning of Sound Event Representations
Representation Learning from Automatically Separated Sound Scenes
Proposed approach: overview
164
Self-Supervised Learning of Sound Event Representations
Representation Learning from Automatically Separated Sound Scenes
DA′
DA
Encoder
Encoder
Sim. Head
MixIT
DA′′′
Encoder
DA′′
Encoder
Coin. Head
concat
random
channel selection
Sim. Head
Sound separation
augmentation Front-end
Similarity maximization
Coincidence prediction
Augmentation Front-end
165
Self-Supervised Learning of Sound Event Representations
Representation Learning from Automatically Separated Sound Scenes
DA′
DA
Encoder
Encoder
Sim. Head
MixIT
DA′′′
Encoder
DA′′
Encoder
Coin. Head
concat
random
channel selection
Sim. Head
Sound separation
augmentation Front-end
Similarity maximization
Coincidence prediction
MixIT for unsupervised sound separation
166
Self-Supervised Learning of Sound Event Representations
Representation Learning from Automatically Separated Sound Scenes
Wisdom et al., Unsupervised sound separation using mixture invariant training. NeurIPS 2020
A simple composition of data augmentation methods
167
Self-Supervised Learning of Sound Event Representations
Representation Learning from Automatically Separated Sound Scenes
Park et al., SpecAugment: A simple data augmentation method for automatic speech recognition. InterSpeech 2019
Proxy learning tasks
168
Self-Supervised Learning of Sound Event Representations
Representation Learning from Automatically Separated Sound Scenes
DA′
DA
Encoder
Encoder
Sim. Head
MixIT
DA′′′
Encoder
DA′′
Encoder
Coin. Head
concat
random
channel selection
Sim. Head
Sound separation
augmentation Front-end
Similarity maximization
Coincidence prediction
Coincidence prediction (CP)
169
Self-Supervised Learning of Sound Event Representations
Representation Learning from Automatically Separated Sound Scenes
Wiskott & Sejnowski, Slow feature analysis: Unsupervised learning of invariances, Neural Computation 2002
Encoder
Encoder
Coin. Head
concat
Coincidence prediction
Coincidence prediction (CP)
170
Self-Supervised Learning of Sound Event Representations
Representation Learning from Automatically Separated Sound Scenes
Encoder
Encoder
Coin. Head
concat
Coincidence prediction
Evaluation
171
Self-Supervised Learning of Sound Event Representations
Representation Learning from Automatically Separated Sound Scenes
Sound separation for contrastive learning
172
Self-Supervised Learning of Sound Event Representations
Representation Learning from Automatically Separated Sound Scenes
How about using a separation model before convergence?
173
Self-Supervised Learning of Sound Event Representations
Representation Learning from Automatically Separated Sound Scenes
How about using a separation model before convergence?
174
Self-Supervised Learning of Sound Event Representations
Representation Learning from Automatically Separated Sound Scenes
Jointly optimizing both proxy tasks
175
Self-Supervised Learning of Sound Event Representations
Representation Learning from Automatically Separated Sound Scenes
Comparison with previous work
176
Self-Supervised Learning of Sound Event Representations
Representation Learning from Automatically Separated Sound Scenes
Jansen et al., Unsupervised learning of semantic audio representations. ICASSP 2018
Jansen et al., Coincidence, categorization, and consolidation: learning to recognize sounds with minimal supervision. ICASSP 2020
Alayrac et al., Self-supervised multimodal versatile networks. 2020
Wang & van der Oord. Multi-format contrastive learning of audio representations. SAS Workshop NeurIPS 2020
Takeaways
177
Self-Supervised Learning of Sound Event Representations
Representation Learning from Automatically Separated Sound Scenes
Best Audio Representation Learning Paper Award at WASPAA21
178
Self-Supervised Learning of Sound Event Representations
Representation Learning from Automatically Separated Sound Scenes
Outline
179
6. DCASE Challenge Tasks Organization
Fonseca, E., Plakal, M., Font, F., Ellis, D. P. W., Serra, X., Audio tagging with noisy labels and minimal supervision. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE), 2019.
Fonseca, E., Plakal, M., Font, F., Ellis, D. P. W., Favory, X., Pons, J., Serra, X., General-purpose tagging of Freesound audio with AudioSet labels: Task description, dataset, and baseline. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE), 2018.
DCASE Challenge Tasks organization
181
DCASE Challenge Tasks Organization
DCASE Challenge Tasks organization
182
DCASE Challenge Tasks Organization
Kaggle platform
183
DCASE Challenge Tasks Organization
Open knowledge
184
DCASE Challenge Tasks Organization
Outline
185
7. Summary and Conclusions
Technical contributions
187
Summary and Conclusions
Technical contributions
188
Summary and Conclusions
Technical contributions
189
Summary and Conclusions
Technical contributions
190
Summary and Conclusions
Technical contributions
191
Summary and Conclusions
Technical contributions
192
Summary and Conclusions
Technical contributions
193
Summary and Conclusions
Other academic contributions & merits
194
Summary and Conclusions
19 peer-reviewed publications
195
Summary and Conclusions
19 peer-reviewed publications
196
Summary and Conclusions
19 peer-reviewed publications
197
Summary and Conclusions
Evolution of research context in sound event classification
198
Summary and Conclusions
small vocabulary
supervision weakness
Supervised classification
clean labels
Evolution of research context in sound event classification
199
Summary and Conclusions
small vocabulary
large vocabulary
Supervised classification
clean labels
supervision weakness
Supervised classification
noisy labels
Self-supervised
representation
learning
Supervised classification
clean labels
Evolution of research context in sound event classification
200
Summary and Conclusions
small vocabulary
large vocabulary
Supervised classification
clean labels
This Thesis
supervision weakness
Supervised classification
noisy labels
Self-supervised
representation
learning
Supervised classification
clean labels
Thank you!
201
Training Sound Event Classifiers Using Different Types of Supervision
Eduardo Fonseca
December 1st, 2021
Supervisors:
Dr. Xavier Serra i Casals
Dr. Frederic Font Corbera
Board:
Dr. Emmanouil Benetos
Dr. Annamaria Mesaros
Dr. Marius Miron
Credits