1 of 203

Training Sound Event Classifiers Using Different Types of Supervision

Eduardo Fonseca

December 1st, 2021

Supervisors:

Dr. Xavier Serra i Casals

Dr. Frederic Font Corbera

Board:

Dr. Emmanouil Benetos

Dr. Annamaria Mesaros

Dr. Marius Miron

2 of 203

Sound event classification

  • Automatic identification of all kinds of everyday sounds

2

Introduction

Motivation

3 of 203

Applications

3

Introduction

Motivation

4 of 203

A challenging endeavour

4

Introduction

Motivation

5 of 203

Sound event classification before this thesis

  • Research context
    • methods supervised learning: feature engineering → deep learning
    • datasets → limited size and vocabulary (e.g., urban or domestic)

5

Introduction

Motivation

6 of 203

Sound event classification before this thesis

  • Research context
    • methods supervised learning: feature engineering deep learning
    • datasets → limited size and vocabulary (e.g., urban or domestic)
  • Limitations
    • deep learning approaches are data-hungry
    • each dataset encompasses a small bit of the variety of everyday sounds

6

Introduction

Motivation

7 of 203

Sound event classification before this thesis

  • Research context
    • methods supervised learning: feature engineering deep learning
    • datasets → limited size and vocabulary (e.g., urban or domestic)
  • Limitations
    • deep learning approaches are data-hungry
    • each dataset encompasses a small bit of the variety of everyday sounds
  • Hindering the development of general-purpose classifiers
    • recognize hundreds of sound classes

7

Introduction

Motivation

8 of 203

But sound data were available ...

  • Large amounts of diverse everyday sound data

  • Two main aspects in common
    • very large amounts of audio
    • lack of reliable homogeneous labels → user-provided metadata

8

Introduction

Motivation

9 of 203

AudioCommons

  • Horizon 2020 EU-funded AudioCommons
    • automatic description of sounds
    • exploit Freesound for research purposes

9

Introduction

Motivation

10 of 203

What can we do about it?

  • What can we do to allow improvement of coverage & performance in sound classifiers?

  • In this thesis, we identify four research avenues

10

Introduction

Motivation

11 of 203

1. Building a new dataset

  • Most evident option → better datasets
    • larger size
    • larger vocabulary
  • Web repositories → sources for sound event dataset creation

  • In this thesis → new sound event dataset
    • fully-open & distributable
    • large-vocabulary
    • reliable labels

11

Introduction

Research directions

12 of 203

2. Improving generalization

  • Costly manual annotation → less than ideal amount of training data

  • Techniques to increase generalization to unseen examples → paramount

  • In this thesis → improve generalization
    • CNN architectural modifications → robustness to small time/frequency shifts
    • data augmentation

12

Introduction

Research directions

13 of 203

3. Learning with noisy labels

  • Label noise is a reality
    1. transition to larger datasets w/ less precise labeling
    2. labels can be inferred automatically (e.g. from user-provided metadata)
  • Supervision given by noisy labels → only feasible choice
    • pressing issue for sound event classification
  • In this thesis → learning with noisy labels
    • dataset to supports label noise research
    • techniques to mitigate the effect of label noise

13

Introduction

Research directions

14 of 203

4. Self-supervised learning

  • Textual labels accompanying audio → not always available
    • unlabeled data is much more abundant
  • Self-supervised learning → learning representations without external supervision
    • downstream tasks such as classification
  • In this thesis → self-supervised learning
    • strategies to learn audio representations from unlabeled data

14

Introduction

Research directions

15 of 203

Objectives of this thesis

  • Research on dataset creation as well as supervised & unsupervised learning
    • train large-vocabulary sound event classifiers

15

Introduction

Objectives

16 of 203

Objectives of this thesis

  • Research on dataset creation as well as supervised & unsupervised learning
    • train large-vocabulary sound event classifiers
  • Objectives
    1. Build an annotated open dataset of sound events, of larger coverage and size

16

Introduction

Objectives

17 of 203

Objectives of this thesis

  • Research on dataset creation as well as supervised & unsupervised learning
    • train large-vocabulary sound event classifiers
  • Objectives
    • Build an annotated open dataset of sound events, of larger coverage and size
    • Devise a learning method to improve generalization to new unseen examples

17

Introduction

Objectives

18 of 203

Objectives of this thesis

  • Research on dataset creation as well as supervised & unsupervised learning
    • train large-vocabulary sound event classifiers
  • Objectives
    • Build an annotated open dataset of sound events, of larger coverage and size
    • Devise a learning method to improve generalization to new unseen examples
    • Develop techniques to mitigate the negative effect of label noise when training sound event classifiers

18

Introduction

Objectives

19 of 203

Objectives of this thesis

  • Research on dataset creation as well as supervised & unsupervised learning
    • train large-vocabulary sound event classifiers
  • Objectives
    • Build an annotated open dataset of sound events, of larger coverage and size
    • Devise a learning method to improve generalization to new unseen examples
    • Develop techniques to mitigate the negative effect of label noise when training sound event classifiers
    • Develop methodologies for learning sound event representations in unsupervised fashion

19

Introduction

Objectives

20 of 203

Objectives of this thesis

  • Research on dataset creation as well as supervised & unsupervised learning
    • train large-vocabulary sound event classifiers
  • Objectives
    • Build an annotated open dataset of sound events, of larger coverage and size
    • Devise a learning method to improve generalization to new unseen examples
    • Develop techniques to mitigate the negative effect of label noise when training sound event classifiers
    • Develop methodologies for learning sound event representations in unsupervised fashion
    • Release data and source code as open resources → open and reproducible research

20

Introduction

Objectives

21 of 203

Outline

  1. Introduction - Chapter 1
  2. The Freesound Dataset 50k (FSD50K) - Chapter 3
  3. Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks - Chapter 4
  4. Training Sound Event Classifiers With Noisy Labels - Chapter 5
  5. Self-Supervised Learning of Sound Event Representations - Chapter 6
  6. DCASE Challenge Tasks Organization - Appendix A
  7. Summary and Conclusions - Chapter 7

21

Introduction

22 of 203

Outline

  • Introduction
  • The Freesound Dataset 50k (FSD50K)
  • Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks
  • Training Sound Event Classifiers With Noisy Labels
  • Self-Supervised Learning of Sound Event Representations
  • DCASE Challenge Tasks Organization
  • Summary and Conclusions

22

23 of 203

2. The Freesound Dataset 50k (FSD50K)

Fonseca, E., Favory, X., Pons, J., Font, F., Serra, X, FSD50K: an open dataset of human-labeled sound events. In press in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.

Fonseca, E., Pons, J., Favory, X., Font, F., Bogdanov, D., Ferraro, A., Oramas, S., Porter, A., Serra, X, Freesound Datasets: A platform for the creation of open audio datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), 2017.

  • Motivation
  • The creation of FSD50K
  • FSD50K description
  • Experiments
  • Summary

24 of 203

Why?

  • Most existing datasets → relatively small and/or domain-specific
  • AudioSet
    • unprecedented size, coverage and diversity
    • suffers from openness and stability issues
  • Sound event research lags behind in terms of dataset availability

24

The Freesound Dataset 50k (FSD50K)

Motivation

25 of 203

Data acquisition

  • Source of audio

  • Vocabulary

  • Infrastructure

25

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

26 of 203

Freesound

  • Online collaborative audio clip sharing site
  • 500,000+ audio clips
  • Wide variety of audio content
  • User-provided tags
  • Creative Commons licenses

26

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

27 of 203

AudioSet Ontology

  • Hierarchy with 632 sound event classes
  • Per-class textual description
  • Most comprehensive set of everyday sounds
    • convenient for Freesound

27

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

28 of 203

Freesound Annotator

  • Website → collaborative creation of open audio datasets based on Freesound

  • Tools → exploration / annotation / monitoring

28

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

29 of 203

Candidate labels nomination

  • List of keywords per class

29

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

30 of 203

Candidate labels nomination

  • List of keywords per class
  • Each class populated w/ corresponding clips

30

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

31 of 203

Candidate labels nomination

  • List of keywords per class
  • Each class populated w/ corresponding clips

Outcome: 268k clips

31

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

32 of 203

Validation task

  • Manually validate candidate labels nominated in the previous stage
  • Annotation tool w/ two phases
    1. training phase → get familiar with the class
    2. validation phase

32

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

33 of 203

Validation task

  • Internal quality assessment → get feedback for improvement
  • Features: FAQs / Present = PP + PNP / inter-annotator agreement / loudness norm / priority

33

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

34 of 203

Validation task

  • Annotation campaign
    • divide classes according to estimated level of difficulty
    • gather annotations using crowdsourcing and hired annotators
    • 350+ raters (including 6 hired annotators)
  • Outcome: 51k clips

34

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

35 of 203

Data split

  • Split the data into development and evaluation
  • Main criteria
    • non-divisibility of uploaders

35

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

data

classes

dev

eval

c0

c1

c2

c3

c4

...

36 of 203

Data split

  • Split the data into development and evaluation
  • Main criteria
    • non-divisibility of uploaders
    • small uploaders for evaluation set

36

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

data

classes

dev

eval

c0

c1

c2

c3

c4

...

37 of 203

Data split

  • Split the data into development and evaluation
  • Main criteria
    • non-divisibility of uploaders
    • small uploaders for evaluation set

  • Procedure:
    1. sort uploaders by size and diversity
    2. allocate data to evaluation set

  • Outcome: two candidate subsets disjoint in terms of uploaders

37

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

data

classes

dev

eval

c0

c1

c2

c3

c4

...

38 of 203

Refinement task for evaluation set

  • Validation of candidate labels proposed by a simple nomination system
    • labels correct but potentially incomplete
    • goal → exhaustive labelling
  • Adding missing labels is complex
    • hired annotators → deep understanding of ontology & FAQs
    • interface for exploration of large-vocabulary

38

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

39 of 203

Refinement task for evaluation set

  • Annotation tool w/ two phases
    • training phase
    • refinement phase
      • review existing labels
      • add any missing labels

39

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

Favory et al., Facilitating the manual annotation of sounds when using large taxonomies, FRUCT 2018

40 of 203

After manual annotation

  • Candidate development set → correct but potentially incomplete labels
  • Candidate evaluation set → exhaustively-labeled
  • 51k clips / 395 classes

40

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

heroes behind FSD50K

(hired annotators)

me

great collaborators

41 of 203

Post-processing

  • Determine FSD50K vocabulary (200 classes)
    • merge small leaf nodes with their parents

41

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

42 of 203

Post-processing

  • Determine FSD50K vocabulary (200 classes)
    • merge small leaf nodes with their parents
  • Balancing development/evaluation sets

42

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

43 of 203

Post-processing

  • Determine FSD50K vocabulary (200 classes)
    • merge small leaf nodes with their parents
  • Balancing development/evaluation sets
  • Development = train + validation
    • minimize WC (same uploader & class)
    • allow BC (same uploader / ≠ class)

43

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

data

classes

train

c0

c1

c2

c3

c4

...

val

WC

BC

44 of 203

Post-processing

  • Determine FSD50K vocabulary (200 classes)
    • merge small leaf nodes with their parents
  • Balancing development/evaluation sets
  • Development = train + validation
    • minimize WC (same uploader & class)
    • allow BC (same uploader / ≠ class)
  • Hierarchical label propagation

44

The Freesound Dataset 50k (FSD50K)

The creation of FSD50K

data

classes

train

c0

c1

c2

c3

c4

...

val

WC

BC

45 of 203

Freesound Dataset 50k (FSD50K)

  • 51k audio clips / 108h audio / 200 sound event classes
  • Human sounds, sounds of things, animals, natural sounds, music, ...

45

The Freesound Dataset 50k (FSD50K)

FSD50K description

train

eval set

36,796 clips / 70.5 hours

4170 / 9.9

development set

val

10,231 / 27.9

Clips / duration

46 of 203

Freesound Dataset 50k (FSD50K)

  • 51k audio clips / 108h audio / 200 sound event classes
  • Human sounds, sounds of things, animals, natural sounds, music, ...
  • Metadata (raw annotations, sound predominance, Freesound metadata, FAQs)
  • Creative Commons licenses

46

The Freesound Dataset 50k (FSD50K)

FSD50K description

train

eval set

36,796 clips / 70.5 hours

4170 / 9.9

development set

val

10,231 / 27.9

Clips / duration

47 of 203

Limitations

  • Label noise → refinement task to quantify label noise after validation task (using 12k clips)
    • 94.3% of incoming labels → verified as correct
    • 50.9% of clips → unlabeled material
  • Data imbalance
  • Data bias in development set
  • Some parts of the vocabulary are not very specific

47

The Freesound Dataset 50k (FSD50K)

FSD50K description

48 of 203

Impact of train/validation separation

  • Three train/validation splits
    • random sampling
    • iterative stratification
    • proposed approach
  • All 3 validation sets → similar size

48

The Freesound Dataset 50k (FSD50K)

Experiments

49 of 203

Impact of train/validation separation

  • Three train/validation splits
    • random sampling
    • iterative stratification
    • proposed approach
  • All 3 validation sets → similar size
  • Main difference → uploaders “shared” between train and validation

both WC & BC contamination

minimize WC contamination

49

The Freesound Dataset 50k (FSD50K)

Experiments

50 of 203

Impact of train/validation separation

  • Ignore WC contamination → validation perf is overly optimistic

50

The Freesound Dataset 50k (FSD50K)

Experiments

51 of 203

Impact of train/validation separation

  • Ignore WC contamination → validation perf is overly optimistic
  • Minimize WC contamination → validation perf is a good proxy for evaluation perf

51

The Freesound Dataset 50k (FSD50K)

Experiments

52 of 203

Summary

  • Dataset creation
    • human validation of nominated candidate labels
    • refinement process → complete audio transcription
  • Crowdsourcing & recruited trained annotators
  • Special emphasis on careful curation of evaluation set → unprecedented

52

The Freesound Dataset 50k (FSD50K)

Summary

53 of 203

Impact

  • Largest fully-open dataset of human-labeled sound events & second largest after AudioSet

53

The Freesound Dataset 50k (FSD50K)

Summary

54 of 203

Impact

  • Largest fully-open dataset of human-labeled sound events & second largest after AudioSet
  • First paper in “Dataset Papers” section of IEEE Challenges & Data Collection

54

The Freesound Dataset 50k (FSD50K)

Summary

55 of 203

Impact

  • Largest fully-open dataset of human-labeled sound events & second largest after AudioSet
  • First paper in “Dataset Papers” section of IEEE Challenges & Data Collection
  • Enabled 6 audio challenges
    • DCASE 2018 Task 2, 2019 Task 2, 2019 Task 4, 2020 Task 4, 2021 Task 4, HEAR Challenge
  • Source for creation of 7 datasets
    • FUSS, GISE-51, USM-SED, Divide_and_Remaster, LibriFSD50K, FSD-MIX-SED, FSD-MIX_CLIPS
  • Used in research beyond sound classification & detection
    • universal sound separation, speech enhancement, few-shot learning, representation learning, federated learning

55

The Freesound Dataset 50k (FSD50K)

Summary

56 of 203

Outline

  • Introduction
  • The Freesound Dataset 50k (FSD50K)
  • Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks
  • Training Sound Event Classifiers With Noisy Labels
  • Self-Supervised Learning of Sound Event Representations
  • DCASE Challenge Tasks Organization
  • Summary and Conclusions

56

57 of 203

3. Improving Sound Event Classification by Increasing Shift Invariance in

Convolutional Neural Networks

Fonseca, E., Ferraro, A., Serra, X., Improving sound event classification by increasing shift invariance in convolutional neural networks. Under review in IEEE Signal Processing Letters, 2021.

  • Motivation
  • Method
  • Experiments
  • Takeaways

58 of 203

Why?

  • CNNs are one of the cornerstones of sound classification
  • One of the commonly-assumed properties of CNNs → shift invariance
    • output predictions are not affected by small shifts in input signal
  • Recent works in computer vision uncover
    • small shifts can change network’s predictions substantially

58

Improving SET by Increasing Shift Invariance in CNNs

Motivation

Azulay, A. & Weiss, Y., Why do deep convolutional networks generalize so poorly to small image transformations? JMLR 2018

59 of 203

Is this really a problem?

  • Apply time/freq shifts to input spectrograms of {1,3,5} frames/bands
    • analyze network’s robustness against shifts
  • Classification consistency: % of cases net predicts same top class for original and shifted

59

Improving SET by Increasing Shift Invariance in CNNs

Motivation

60 of 203

Is this really a problem? So it seems...

  • By applying a time shift of 1 frame (10ms), top prediction changes 18% of the time

60

Improving SET by Increasing Shift Invariance in CNNs

Motivation

61 of 203

Is this really a problem? So it seems...

  • As time shifts increase, the network becomes less consistent

61

Improving SET by Increasing Shift Invariance in CNNs

Motivation

62 of 203

Is this really a problem? So it seems...

  • For freq shifts, consistency is even worse

62

Improving SET by Increasing Shift Invariance in CNNs

Motivation

63 of 203

How to increase shift invariance in CNNs?

  • One of the problems: wrongly executed subsampling operations
    • prevalent in CNNs (strided conv or pool)
  • Techniques to improve subsampling in CNNs
    1. low-pass filter based solutions
    2. architectural changes to explicitly enforce shift invariance

63

Improving SET by Increasing Shift Invariance in CNNs

Method

64 of 203

Proposed approach: overview

  • Focus: subsampling operations within max-pooling layers

64

Improving SET by Increasing Shift Invariance in CNNs

Method

65 of 203

Proposed approach: overview

  • Focus: subsampling operations within max-pooling layers
  • Max-pooling layer with squared size k and stride s
    • a unit-stride max-pooling operation with size k
    • a subsampling operation with stride s

65

Improving SET by Increasing Shift Invariance in CNNs

Method

stride-s

MPk,1

ymp

x

66 of 203

Proposed approach: overview

  • Focus: subsampling operations within max-pooling layers
  • Max-pooling layer with squared size k and stride s
    • a unit-stride max-pooling operation with size k
    • a subsampling operation with stride s
  • We can add a low-pass filter before subsampling

66

Improving SET by Increasing Shift Invariance in CNNs

Method

stride-s

LPFm,n

MPk,1

stride-s

MPk,1

ylpf

ymp

x

x

67 of 203

Proposed approach: overview

  • Focus: subsampling operations within max-pooling layers
  • Max-pooling layer with squared size k and stride s
    • a unit-stride max-pooling operation with size k
    • a subsampling operation with stride s
  • We add a low-pass filter before subsampling
  • We substitute naive subsampling by a more sophisticated strategy

67

Improving SET by Increasing Shift Invariance in CNNs

Method

stride-s

APS

LPFm,n

MPk,1

MPk,1

stride-s

MPk,1

yaps

ylpf

ymp

x

x

x

68 of 203

Low-pass filtering before subsampling

68

Improving SET by Increasing Shift Invariance in CNNs

Method

subsampling

max( )

max( )

69 of 203

Low-pass filtering before subsampling

Low-pass filters

  • Non-trainable binomial kernels (BlurPool)
  • Trainable kernel & softmax function (TLPF)

69

Improving SET by Increasing Shift Invariance in CNNs

Method

Zhang, Making convolutional networks shift-invariant again, ICML 2019

subsampling

max( )

max( )

max( )

conv( )

subsampling

LPFm,n

70 of 203

Adaptive polyphase sampling

  • APS is a downsampling mechanism
    • addresses lack of shift invariance caused by subsampling operations
  • Observation: subsampling a patch and its shifted-by-one-bin version
    • different results when bins are sampled at same fixed positions

70

Improving SET by Increasing Shift Invariance in CNNs

Method

Chaman & Dokmanic, Truly shift-invariant convolutional neural networks, CVPR 2021

original

feature map

shifted

feature map

naive subsampling

stride-s

MPk,1

71 of 203

Adaptive polyphase sampling

  • When subsampling a feature map, instead of always using the same grid
    • multiple candidate grids could actually be used

71

Improving SET by Increasing Shift Invariance in CNNs

Method

subsampling

max()

72 of 203

Adaptive polyphase sampling

  • When subsampling a feature map, instead of always using the same grid
    • multiple candidate grids could actually be used
  • Subsampling operation with stride s = 2
    • four possible grids → four possible candidate subsampled feature maps

72

Improving SET by Increasing Shift Invariance in CNNs

Method

max()

max()

max()

max()

73 of 203

Adaptive polyphase sampling

  • APS → select subsampling grid adaptively by maximizing output energy (l1 norm)
    • grid follows shift at the input
    • increase robustness to input shifts

73

Improving SET by Increasing Shift Invariance in CNNs

Method

Chaman & Dokmanic, Truly shift-invariant convolutional neural networks, CVPR 2021

naive subsampling

original

feature map

shifted

feature map

adaptive

polyphase

sampling

stride-s

MPk,1

APS

MPk,1

74 of 203

Experimental setup

  • Task: FSD50K multi-label classification
  • Baseline model: VGG41
    • features several max-pooling layers
  • VGG42 → width x2

74

Improving SET by Increasing Shift Invariance in CNNs

Experiments

Dense(200) + Sigmoid

2x Conv2D(3,3) + BN + ReLU

Max-Pool 2x2

Weights: 1.2M�

2x Conv2D(3,3) + BN + ReLU

Max-Pool 2x2

2x Conv2D(3,3) + BN + ReLU

Max-Pool 2x2

2x Conv2D(3,3) + BN + ReLU

Global-Pool max(avg(freq))

75 of 203

Evaluation using a small model

  • All methods outperform the baseline system

75

Improving SET by Increasing Shift Invariance in CNNs

Experiments

76 of 203

Evaluation using a small model

  • All methods outperform the baseline system
  • Low-pass filtering feature maps is helpful
  • Trainable vs. non-trainable? not critical, yet TLPF → slightly higher mAP

76

Improving SET by Increasing Shift Invariance in CNNs

Experiments

77 of 203

Evaluation using a small model

  • All methods outperform the baseline system
  • Low-pass filtering feature maps is helpful
  • Trainable vs. non-trainable? not critical, yet TLPF → slightly higher mAP
  • APS l1 and TLPF 5x5 → on par performance

77

Improving SET by Increasing Shift Invariance in CNNs

Experiments

78 of 203

Evaluation using a small model

  • All methods outperform the baseline system
  • Low-pass filtering feature maps is helpful
  • Trainable vs. non-trainable? not critical, yet TLPF → slightly higher mAP
  • APS l1 and TLPF 5x5 → on par performance
  • Joining them → small performance boost

78

Improving SET by Increasing Shift Invariance in CNNs

Experiments

79 of 203

Evaluation using regularization and a larger model

  • Why mixup?
    • improve generalization & analyze performance w/ strong regularizer
  • Boosts are solid → proposed methods beyond regularization

79

Improving SET by Increasing Shift Invariance in CNNs

Experiments

80 of 203

Evaluation using regularization and a larger model

  • VGG42 + mixup → proposed methods beneficial in more competitive conditions

80

Improving SET by Increasing Shift Invariance in CNNs

Experiments

81 of 203

Trainable vs. fixed LPF

  • Substitute TLPF by BlurPool in best system
    • performance differences are not large, yet TLPF → slightly higher mAP

81

Improving SET by Increasing Shift Invariance in CNNs

Experiments

82 of 203

Trainable vs. fixed LPF

  • BlurPool → all low-pass filters are fixed by construction
  • TLPF → multiple filters are learned w/ different patterns

82

Improving SET by Increasing Shift Invariance in CNNs

Experiments

Zhang, Making convolutional networks shift-invariant again, ICML 2019

BlurPool

TLPF

83 of 203

Characterizing the increase of shift invariance

  • “proposed” = TLPF & APS → higher robustness to the applied shifts
    • higher classification consistency in all cases

83

Improving SET by Increasing Shift Invariance in CNNs

Experiments

84 of 203

Characterizing the increase of shift invariance

time shift @ water dripping

frequency shift @ keyboard

84

Improving SET by Increasing Shift Invariance in CNNs

Experiments

time frames shifted

85 of 203

Characterizing the increase of shift invariance

time shift @ water dripping

frequency shift @ keyboard

85

Improving SET by Increasing Shift Invariance in CNNs

Experiments

86 of 203

Comparison with previous work

  • Our best system obtains state-of-the-art mAP, outperforming
    • baseline models by a large margin
    • PSLA approach (collection of training techniques)
    • slightly outperforming Transformer-based approaches

86

Improving SET by Increasing Shift Invariance in CNNs

Experiments

Gong et al., PSLA: Improving audio event classification with pretraining, sampling, labeling, and aggregation. arXiv 2021

Verma & Berger, Audio transformers: Transformer architectures for large scale audio understanding. Adieu convolutions, arXiv 2021

87 of 203

Takeaways

  • Models evaluated → only-partial shift invariance
  • Inserting proposed pooling methods into VGG variants
    • higher robustness to time/frequency shifts
    • recognition boosts

87

Improving SET by Increasing Shift Invariance in CNNs

Takeaways

88 of 203

Essentia TensorFlow models

  • In progress → adding models to the Essentia TF model zoo

88

Improving SET by Increasing Shift Invariance in CNNs

Takeaways

89 of 203

Outline

  • Introduction
  • The Freesound Dataset 50k (FSD50K)
  • Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks
  • Training Sound Event Classifiers With Noisy Labels
  • Self-Supervised Learning of Sound Event Representations
  • DCASE Challenge Tasks Organization
  • Summary and Conclusions

89

90 of 203

4. Training Sound Event Classifiers

with Noisy Labels

Fonseca, E., Plakal, M., Ellis, D. P. W., Font, F., Favory, X., Serra, X., Learning sound event classifiers from web audio with noisy labels. In Proceedings of the 44th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.

Fonseca, E., Font, F., Serra, X., Model-agnostic approaches to handling noisy labels when training sound event classifiers. In Proceedings of the 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019.

* Fonseca, E., Hershey, S., Plakal, M., Ellis, D. P. W., Jansen, A., Moore, R. C., Addressing missing labels in large-scale sound event recognition using a teacher-student framework with loss masking. IEEE Signal Processing Letters, 2020.

* Work done during an internship at Google Research.

  • Motivation
  • FSDnoisy18k
  • Loss functions
  • Addressing the problem of missing labels
  • Takeaways

91 of 203

Label noise in sound event classification

  • Performance decrease / increased complexity

91

Training Sound Event Classifiers with Noisy Labels

Motivation

3

ESC-50

7

Chime-home

9

UrbanSound8K

FSD50K

108

AudioSet

5800

labelling

exhaustive

less precise

hours

audio

92 of 203

FSDnoisy18k

  • 20 classes / 18k audio clips / 42.5h of audio

92

Training Sound Event Classifiers with Noisy Labels

FSDnoisy18k

93 of 203

The creation of FSDnoisy18k

  • Freesound
    • audio content & metadata (tags)

  • AudioSet Ontology
    • 20 class labels

93

Training Sound Event Classifiers with Noisy Labels

FSDnoisy18k

94 of 203

The creation of FSDnoisy18k

  • List of keywords for every class
  • Each class populated w/ corresponding clips

94

Training Sound Event Classifiers with Noisy Labels

FSDnoisy18k

95 of 203

The creation of FSDnoisy18k

95

Training Sound Event Classifiers with Noisy Labels

FSDnoisy18k

candidate

clips

1234.wav

5678.wav

7654.wav

...

Rain

validation task

clean

train set

test set

noisy

train set

96 of 203

Label noise distribution in FSDnoisy18k

  • in-vocabulary (IV) → events that are part of our target class set (closed-set)
  • out-of-vocabulary (OOV) → events not covered by the class set (open-set)

96

Training Sound Event Classifiers with Noisy Labels

FSDnoisy18k

97 of 203

FSDnoisy18k

  • 20 classes / 18k clips / 42.5 h
  • Single-label data
  • Proportion train_noisy / train_clean = 90% / 10%
  • Per-class varying degree of types and amount of label noise

97

Training Sound Event Classifiers with Noisy Labels

FSDnoisy18k

noisy

clean

test set

15,813 clips / 38.8 h

1772 / 2.4

947 / 1.4

train set

98 of 203

Noise-robust loss functions

  • Default loss function for multi-class setting → Categorical Cross-Entropy (CCE)

98

Training Sound Event Classifiers with Noisy Labels

Loss functions

target labels

predictions

99 of 203

Noise-robust loss functions

  • CCE is sensitive to label noise: emphasis on difficult examples (weighting)
    • beneficial for clean data
    • detrimental for noisy data

99

Training Sound Event Classifiers with Noisy Labels

Loss functions

100 of 203

Noise-robust loss functions

  • Generalized cross-entropy loss intuition
    • CCE → sensitive to noisy labels (weighting)
    • Mean Absolute Error (MAE)
      • avoid weighting
      • difficult convergence

100

Training Sound Event Classifiers with Noisy Labels

Loss functions

Ghosh et al., Robust loss functions under label noise for deep neural networks, AAAI 2017

101 of 203

Noise-robust loss functions

  • Lq loss is a generalization of CCE and MAE
    • q = 1 → Lq = MAE
    • q → 0 → Lq = CCE

101

Training Sound Event Classifiers with Noisy Labels

Loss functions

Zhang & Sabuncu, Generalized cross-entropy loss for training deep neural networks with noisy labels, NeurIPS 2018

102 of 203

Noise-robust loss functions

102

Training Sound Event Classifiers with Noisy Labels

Loss functions

103 of 203

Noise-robust loss functions

  • Supervision by user-provided tags can be useful for sound event classification

103

Training Sound Event Classifiers with Noisy Labels

Loss functions

104 of 203

Noise-robust loss functions

  • Supervision by user-provided tags can be useful for sound event classification
  • Lq works well for sound classification tasks with OOV (and some IV) noise

104

Training Sound Event Classifiers with Noisy Labels

Loss functions

L_soft defined in Reed,et al., Training deep neural networks on noisy labels with bootstrapping, ICLR 2015

105 of 203

Noise-robust loss functions

  • noisy set & Lq → 1.9% boost (little engineering effort)
  • noisy set & 2.4h curated data → 5.1% boost (significant manual effort)

105

Training Sound Event Classifiers with Noisy Labels

Loss functions

106 of 203

Loss-based instance selection

  • Deep networks in presence of label noise
    • problem is more severe as learning progresses

106

Training Sound Event Classifiers with Noisy Labels

Loss functions

Arpit et al., A closer look at memorization in deep networks, ICML 2017

learning

epoch

n1

learn easy &

general

patterns

memorize

label

noise

107 of 203

Loss-based instance selection

  • Learning process as a two-stage process
  • After n1 epochs
    • model has converged to some extent use it for instance selection
      • identify instances with large training loss
      • ignore them for gradient update

107

Training Sound Event Classifiers with Noisy Labels

Loss functions

learning

epoch

n1

stage1: regular training

Lq

108 of 203

Loss-based instance selection

  • Approach 1
    • discard large-loss instances from each mini-batch
    • dynamically at every iteration
    • time-dependent loss function

108

Training Sound Event Classifiers with Noisy Labels

Loss functions

learning

epoch

n1

stage1: regular training

Lq

stage2:

discard instances

@ mini-batch

109 of 203

Loss-based instance selection

  • Approach 2
    • use checkpoint to predict scores on whole dataset
    • convert to loss values
    • prune dataset, keeping a subset to continue learning

109

Training Sound Event Classifiers with Noisy Labels

Loss functions

learning

epoch

n1

stage1: regular training

Lq

stage2:

regular training

Lq

dataset pruning

110 of 203

Loss-based instance selection

110

Training Sound Event Classifiers with Noisy Labels

Loss functions

111 of 203

Loss-based instance selection

  • Pruning dataset slightly outperforms discarding at mini-batch

111

Training Sound Event Classifiers with Noisy Labels

Loss functions

112 of 203

Loss-based instance selection

  • Pruning dataset slightly outperforms discarding at mini-batch
  • Pruning dataset is more stable

112

Training Sound Event Classifiers with Noisy Labels

Loss functions

113 of 203

Addressing the problem of missing labels

113

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

114 of 203

Addressing the problem of missing labels

  • AudioSet creation process → similar to FSD50K’s
    • nomination & validation process can fail → missing labels

114

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

115 of 203

Addressing the problem of missing labels

  • AudioSet creation process → similar to FSD50K’s
    • nomination & validation process can fail → missing labels
  • Labels in AudioSet
    • explicit (positive or negative) → received human validation
    • implicit negative → no human validation

115

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

116 of 203

Addressing the problem of missing labels

  • AudioSet creation process → similar to FSD50K’s
    • nomination & validation process can fail → missing labels
  • Labels in AudioSet
    • explicit (positive or negative) → received human validation
    • implicit negative → no human validationmissing “Present” labels?

116

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

117 of 203

Addressing the problem of missing labels

  • AudioSet creation process → similar to FSD50K’s
    • nomination & validation process can fail → missing labels
  • Labels in AudioSet
    • explicit (positive or negative) → received human validation
    • implicit negative → no human validation → missing “Present” labels?

117

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

118 of 203

Addressing the problem of missing labels

  • AudioSet creation process → similar to FSD50K’s
    • nomination & validation process can fail → missing labels
  • Labels in AudioSet
    • explicit (positive or negative) → received human validation
    • implicit negative → no human validation → missing “Present” labels?

118

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

119 of 203

Teacher-student framework

  • Skeptical teacher → detect potential missing labels per class
    • train teacher

119

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

train

teacher

model

train

set

y

120 of 203

Teacher-student framework

  • Skeptical teacher → detect potential missing labels per class
    • train teacher
    • predict scores for train set

120

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

train

teacher

model

train

set

y

predict

scores

clips

527 labels

0.13 0.57 ... 0.86

0.29 0.41 ... 0.03

… … …

… … ...

121 of 203

Teacher-student framework

  • Skeptical teacher → detect potential missing labels per class
    • train teacher
    • predict scores for train set
    • hypothesis: top-scored negatives → missing “Present” labels

121

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

score distribution

train

teacher

model

train

set

discard

y

predict

implicit negatives

explicit

scores

clips

527 labels

0.13 0.57 ... 0.86

0.29 0.41 ... 0.03

… … …

… … ...

122 of 203

Teacher-student framework

  • Skeptical teacher → detect potential missing labels per class
    • train teacher
    • predict scores for train set
    • hypothesis: top-scored negatives → missing “Present” labels
    • flag top-scored negatives in new enhanced label set

122

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

score distribution

train

teacher

model

train

set

discard

enhanced label set

y

predict

implicit negatives

explicit

y

scores

clips

527 labels

0.13 0.57 ... 0.86

0.29 0.41 ... 0.03

… … …

… … ...

123 of 203

Teacher-student framework

  • Train student model using enhanced label set & loss masking

123

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

y

train

student

model

lwlrap

eval

set

predict

loss masking

train

set

124 of 203

Teacher-student framework

  • Train student model using enhanced label set & loss masking
    • create a binary mask using the information encoded in enhanced label set

124

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

y

train

student

model

lwlrap

eval

set

predict

loss masking

train

set

125 of 203

Teacher-student framework

  • Train student model using enhanced label set & loss masking
    • create a binary mask using the information encoded in enhanced label set

    • apply mask to negative term of loss → discard loss contributions of missing labels

125

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

y

train

student

model

lwlrap

eval

set

predict

loss masking

train

set

126 of 203

Experimental setup

  • AudioSet → two train sets
    • similar class distribution
    • size proportion 1:5
  • Two models of different capacity

126

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

127 of 203

Performance vs. discarded negatives

  • Train a ResNet-50 teacher
  • Generate 18 new label sets. For each:
    • different threshold tc [0, 20] %
    • train a student & report eval

127

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

128 of 203

Performance vs. discarded negatives

  • Train a ResNet-50 teacher
  • Generate 18 new label sets. For each:
    • different threshold tc [0, 20] %
    • train a student & report eval
  • Each point → one experiment trial

128

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

129 of 203

Insight from results

  • Method yields boosts in all cases
    • best operating points discarding 2 - 6%

129

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

130 of 203

Insight from results

  • Method yields boosts in all cases
    • best operating points discarding 2 - 6%
  • Main pattern consistent steep increase
    • most of the boost: removing just ≈1%

130

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

131 of 203

Insight from results

  • Method yields boosts in all cases
    • best operating points discarding 2 - 6%
  • Main pattern consistent steep increase
    • most of the boost: removing just ≈1%
  • Effect of train set size
    • boost is higher when train set is smaller
    • effect still observable with 6800h+

131

Training Sound Event Classifiers with Noisy Labels

Addressing the problem of missing labels

132 of 203

Summary & takeaways

  • FSDnoisy18k to support label noise research
    • empirical characterization of label noise
    • large amounts of Freesound & tags → feasible for training sound event classifiers

132

Training Sound Event Classifiers with Noisy Labels

Takeaways

133 of 203

Summary & takeaways

  • FSDnoisy18k to support label noise research
    • empirical characterization of label noise
    • large amounts of Freesound & tags → feasible for training sound event classifiers
  • Simple model-agnostic approaches → improve performance
    • noise-robust loss functions → effective in mitigating effect of label noise

133

Training Sound Event Classifiers with Noisy Labels

Takeaways

134 of 203

Summary & takeaways

  • FSDnoisy18k to support label noise research
    • empirical characterization of label noise
    • large amounts of Freesound & tags → feasible for training sound event classifiers
  • Simple model-agnostic approaches → improve performance
    • noise-robust loss functions → effective in mitigating effect of label noise
    • rejecting noisy samples during training → more effective

134

Training Sound Event Classifiers with Noisy Labels

Takeaways

135 of 203

Summary & takeaways

  • FSDnoisy18k to support label noise research
    • empirical characterization of label noise
    • large amounts of Freesound & tags → feasible for training sound event classifiers
  • Simple model-agnostic approaches → improve performance
    • noise-robust loss functions → effective in mitigating effect of label noise
    • rejecting noisy samples during training → more effective
    • addressing missing labels → pathology in AudioSet labelling
      • boost is higher when train set is smaller, but still observable with massive data

135

Training Sound Event Classifiers with Noisy Labels

Takeaways

136 of 203

Outline

  • Introduction
  • The Freesound Dataset 50k (FSD50K)
  • Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks
  • Training Sound Event Classifiers With Noisy Labels
  • Self-Supervised Learning of Sound Event Representations
  • DCASE Challenge Tasks Organization
  • Summary and Conclusions

136

137 of 203

5. Self-Supervised Learning of

Sound Event Representations

^ Fonseca, E., Ortego, D., McGuinness, K., O’Connor, N. E., Serra, X., Unsupervised contrastive learning of sound event representations. In Proceedings of the 46th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.

^ Work done in collaboration with Dublin City University.

* Fonseca, E., Jansen, A., Ellis, D. P. W., Wisdom, S., Tagliasacchi, M., Hershey, J. R., Plakal, M., Hershey, S., Moore, R. C., Serra, X., Self-supervised learning from automatically separated sound scenes. In Proceedings of the 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021.

* “Best Audio Representation Learning Paper Award” at WASPAA 2021. Work done during an internship at Google Research.

  • Motivation
  • Similarity maximization for sound event representation learning
  • Representation learning from automatically separated sound scenes
  • Takeaways

138 of 203

Why?

  • Common scenario in sound event research
    • few manually-labeled data but abundant unlabeled data
  • Self-supervised learning
    • proxy learning task
      • learn mapping from inputs to low dimensional representations
    • use representations for downstream tasks e.g. classification

138

Self-Supervised Learning of Sound Event Representations

Motivation

139 of 203

Self-supervised contrastive representation learning

  • Contrastive learning is learning by comparing
    • we compare pairs of input examples
      • positive pairs of similar inputs
      • negative pairs of unrelated inputs
  • Goal is an embedding space where representations …
    • of similar examples → close together
    • of dissimilar examples → further away

139

Self-Supervised Learning of Sound Event Representations

Motivation

140 of 203

Building a proxy learning task

To compare pairs of positive examples:

  • How to generate the pairs of positive examples?
    • Composition of data augmentation methods
  • Once generated, how to compare them?
    • Similarity maximization

140

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

141 of 203

Building a proxy learning task

To compare pairs of positive examples:

  • How to generate the pairs of positive examples?
    • Composition of data augmentation methods
  • Once generated, how to compare them?
    • Proxy task of similarity maximization

141

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

142 of 203

Proposed approach: overview

  • Similarity maximization, inspired by SimCLR
    • maximize similarity between differently augmented views of sound events

142

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

Chen et al., A simple framework for contrastive learning of visual representations, ICML 2020

DA′

Mix-back

Mix-back

DA

Encoder

Encoder

TPS

Shared weights

Head

Head

Augmentation

front-end

143 of 203

Proposed approach: Temporal Proximity sampling

  • Sample two patches at random within clip log-mel spectrogram
    • natural data augmentation

143

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

TPS

144 of 203

Proposed approach: mix-back

  • Mix incoming patch with a background patch
    • reduce mutual information & retain semantics by sound transparency

144

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

Mix-back

Mix-back

TPS

145 of 203

Proposed approach: other data augmentations

  • Simple methods for on-the-fly computation on T-F patches
  • Random resized cropping, compression, Gaussian noise addition, specAugment, random time/frequency shifts
  • Hyperparameters randomly sampled from a distribution

145

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

Park et al., SpecAugment: A simple data augmentation method for automatic speech recognition. InterSpeech 2019

DA′

Mix-back

Mix-back

DA

TPS

Augmentation

front-end

146 of 203

Proposed approach: encoder and head

  • Convolutional encoder → extract low-dimensional embeddings h (for downstream tasks)
  • MLP head → map h to metric embedding z

146

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

DA′

Mix-back

Mix-back

DA

Encoder

Encoder

TPS

Shared weights

Head

Head

Augmentation

front-end

147 of 203

Proposed approach: contrastive loss

147

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

Chen et al., A simple framework for contrastive learning of visual representations, ICML 2020

DA′

Mix-back

Mix-back

DA

Encoder

Encoder

TPS

Head

Head

Augmentation

front-end

148 of 203

Evaluation using FSDnoisy18k

  • Unsupervised representation learning
    • train on train_noisy without labels
    • validate on train_clean using labels in kNN Evaluation

148

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

149 of 203

Evaluation using FSDnoisy18k

  • Unsupervised representation learning
    • train on train_noisy without labels
    • validate on train_clean using labels in kNN Evaluation

  • Evaluation of the representation using supervised tasks
    • Model fine tuning after initializing w/ pre-trained weights on two downstream tasks
        • train on train_noisy
        • train on train_clean

149

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

150 of 203

Ablation study: Temporal Proximity sampling

  • Best: sampling at random
  • Worst: using same patch

150

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

TPS

151 of 203

Ablation study: mix-back

  • Lightly mixing patches with unrelated backgrounds helps
  • Adjusting patch energy is beneficial
    • foreground dominant over background

151

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

152 of 203

Ablation study: data augmentation

  • Explore DAs individually
    • random resized cropping (RRC): small stretch in time/freq & small freq transposition
    • SpecAugment (time/freq masking)
  • Explore DA compositions
    • RRC + compression + Gaussian noise addition
    • RRC + SpecAugment

152

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

Park et al., SpecAugment: A simple data augmentation method for automatic speech recognition. InterSpeech 2019

153 of 203

Evaluation of learned representations

  • Supervised baselines: CRNN ⩬ VGG-like > ResNet-18
    • ResNet-18: large capacity for not so much data & noisy labels

153

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

154 of 203

Evaluation of learned representations

  • End-to-end model fine tuning
    • Goal: measure benefit wrt training from scratch in noisy- & small-data regimes
    • Unsupervised contrastive pre-training is best in all cases

154

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

155 of 203

Evaluation of learned representations

  • End-to-end model fine tuning
    • Goal: measure benefit wrt training from scratch in noisy- & small-data regimes
    • Unsupervised contrastive pre-training is best in all cases
    • ResNet-18:
      • lowest accuracy trained from scratch (limited by data or label quality)
      • top accuracy w/ unsupervised pre-training (alleviate these problems)

155

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

156 of 203

Takeaways

  • Successful representation learning by tuning compound
    • positive patch sampling & mix-back & data augmentation
  • Unsupervised contrastive pre-training can
    • mitigate the impact of data scarcity
    • increase robustness against noisy labels

156

Self-Supervised Learning of Sound Event Representations

Similarity Maximization for Sound Event Representation Learning

157 of 203

Let’s recap for our next framework

To compare pairs of positive examples:

  • How to generate the pairs of positive examples?
    • Composition of data augmentation methods
  • Once generated, how to compare them?
    • Similarity maximization

157

Self-Supervised Learning of Sound Event Representations

158 of 203

Previously, how to generate pairs of positive examples?

  • Previously, composition of data augmentation methods
    • temporal proximity sampling
    • cropping
    • artificial mixing
    • time/freq masking
    • shifts
  • Artificial & handcrafted transformations with tunable hyperparameters
  • Risk of introducing somewhat unrealistic domain shift?

158

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

159 of 203

Sound separation to generate views for contrastive learning

  • Real-world sound scenes: time-varying collections of sound events
  • Association of sound events with mixture and each other is semantically constrained
    • Not all classes co-occur naturally

159

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

160 of 203

Sound separation to generate views for contrastive learning

  • Decompose sound scene (mixture) into
    • simpler separated channels share semantics with mixture and with each other

160

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

Sound Separation

channels

mixture

161 of 203

Sound separation to generate views for contrastive learning

  • Decompose sound scene (mixture) into
    • simpler separated channels share semantics with mixture and with each other
  • Unlike previous approaches to generate views, sound separation:
    • input-dependent & reduces need for parameter tuning
  • Comparing mixture vs channel meets recommended guidelines
    • mutual information between views is reduced
    • some relevant semantic information is preserved

161

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

Tian et al., What makes for good views for contrastive learning? NeurIPS 2020

Sound Separation

channels

mixture

162 of 203

How to compare pairs of examples?

Two popular proxy tasks:

  • Similarity Maximization (SimCLR)
    • maximize the similarity between differently-augmented views
  • Coincidence Prediction
    • predict whether a pair of examples occur within a temporal proximity

162

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

Jansen et al., Coincidence, categorization, and consolidation: learning to recognize sounds with minimal supervision. ICASSP 2020

163 of 203

How to compare pairs of examples?

Two popular proxy tasks:

  • Similarity Maximization (SimCLR)
    • maximize the similarity between differently-augmented views
  • Coincidence Prediction
    • predict whether a pair of examples occurs within a temporal proximity
  • We propose to optimize them jointly as a multi-task objective
  • Same goal → semantically structured embedding space, pursued in different way
    • SM: co-locate representations of positives
    • CP: weaker condition → get a representation that supports coincidence prediction

163

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

164 of 203

Proposed approach: overview

164

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

DA′

DA

Encoder

Encoder

Sim. Head

MixIT

DA′′′

Encoder

DA′′

Encoder

Coin. Head

concat

random

channel selection

Sim. Head

Sound separation

augmentation Front-end

Similarity maximization

Coincidence prediction

165 of 203

Augmentation Front-end

165

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

DA′

DA

Encoder

Encoder

Sim. Head

MixIT

DA′′′

Encoder

DA′′

Encoder

Coin. Head

concat

random

channel selection

Sim. Head

Sound separation

augmentation Front-end

Similarity maximization

Coincidence prediction

166 of 203

MixIT for unsupervised sound separation

  • Mixture invariant training (MixIT)
    • fully unsupervised
    • promising results in Universal Sound Separation

166

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

Wisdom et al., Unsupervised sound separation using mixture invariant training. NeurIPS 2020

167 of 203

A simple composition of data augmentation methods

  • Use more than one augmentation → more challenging proxy task
  • We combine sound separation with
    • Temporal proximity sampling
    • SpecAugment

167

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

Park et al., SpecAugment: A simple data augmentation method for automatic speech recognition. InterSpeech 2019

168 of 203

Proxy learning tasks

168

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

DA′

DA

Encoder

Encoder

Sim. Head

MixIT

DA′′′

Encoder

DA′′

Encoder

Coin. Head

concat

random

channel selection

Sim. Head

Sound separation

augmentation Front-end

Similarity maximization

Coincidence prediction

169 of 203

Coincidence prediction (CP)

  • Based on slowness prior: waveforms vary quickly ↔ semantics change slowly
    • stable representation to explain semantics
    • representation would support prediction of coincidence in temporal proximity

169

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

Wiskott & Sejnowski, Slow feature analysis: Unsupervised learning of invariances, Neural Computation 2002

Encoder

Encoder

Coin. Head

concat

Coincidence prediction

170 of 203

Coincidence prediction (CP)

  • Based on slowness prior: waveforms vary quickly ↔ semantics change slowly
    • stable representation to explain semantics
    • representation would support prediction of coincidence in temporal proximity
  • Encoder: extract low-dimensional embeddings h & concatenate pairs
  • Coincidence Head: map [hm, hc] to probability that pair is coinciding
    • binary classification task → binary cross entropy loss

170

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

Encoder

Encoder

Coin. Head

concat

Coincidence prediction

171 of 203

Evaluation

  • Downstream classification with shallow model on AudioSet (mAP)
    • train & eval shallow network on top of learned representation

171

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

172 of 203

Sound separation for contrastive learning

  • Comparing input mixture w/ separated channels
    • better representations than using only the input mixture
    • outperforms SpecAugment (SA)

172

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

173 of 203

How about using a separation model before convergence?

  • Four audio processors four training checkpoints of a single separation network

173

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

174 of 203

How about using a separation model before convergence?

  • All processors provide valid forms of augmentation
  • Combining some of them using OR rule can be helpful

174

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

175 of 203

Jointly optimizing both proxy tasks

  • Training entire framework: similarity maximization & coincidence prediction
    • small boosts across the board
  • Key ingredient:
    • combination of diverse processing by separation model as learning progresses

175

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

176 of 203

Comparison with previous work

  • Proposed framework
    • outperforms some past approaches - some of them MultiModal (MM)
    • competitive with SOTA

176

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

Jansen et al., Unsupervised learning of semantic audio representations. ICASSP 2018

Jansen et al., Coincidence, categorization, and consolidation: learning to recognize sounds with minimal supervision. ICASSP 2020

Alayrac et al., Self-supervised multimodal versatile networks. 2020

Wang & van der Oord. Multi-format contrastive learning of audio representations. SAS Workshop NeurIPS 2020

177 of 203

Takeaways

  • Sound separation valid augmentation to generate views for contrastive learning
  • Learning to associate sound mixtures w/ separated channels elicits semantic structure in learned representation
  • Transformations by different checkpoints of the same separation model
    • valid augmentations for generating positives
  • Benefit in jointly training similarity maximization and coincidence prediction

177

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

178 of 203

Best Audio Representation Learning Paper Award at WASPAA21

178

Self-Supervised Learning of Sound Event Representations

Representation Learning from Automatically Separated Sound Scenes

179 of 203

Outline

  • Introduction
  • The Freesound Dataset 50k (FSD50K)
  • Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks
  • Training Sound Event Classifiers With Noisy Labels
  • Self-Supervised Learning of Sound Event Representations
  • DCASE Challenge Tasks Organization
  • Summary and Conclusions

179

180 of 203

6. DCASE Challenge Tasks Organization

Fonseca, E., Plakal, M., Font, F., Ellis, D. P. W., Serra, X., Audio tagging with noisy labels and minimal supervision. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE), 2019.

Fonseca, E., Plakal, M., Font, F., Ellis, D. P. W., Favory, X., Pons, J., Serra, X., General-purpose tagging of Freesound audio with AudioSet labels: Task description, dataset, and baseline. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE), 2018.

  • DCASE 2018 Task 2: General-purpose Audio Tagging of Freesound Content with AudioSet Labels
  • DCASE 2019 Task 2: Audio Tagging with Noisy Labels and Minimal Supervision

181 of 203

DCASE Challenge Tasks organization

181

DCASE Challenge Tasks Organization

182 of 203

DCASE Challenge Tasks organization

  • Design of problem formulation
  • Development of audio datasets
  • Management of Kaggle platform & discussion forums

182

DCASE Challenge Tasks Organization

183 of 203

Kaggle platform

183

DCASE Challenge Tasks Organization

184 of 203

Open knowledge

  • 2 open datasets (FSDKaggle2018 & FSDKaggle2019) and baselines
  • Exchange of ideas & papers & code on Kaggle forums
  • Code for winning solutions was released

184

DCASE Challenge Tasks Organization

185 of 203

Outline

  • Introduction
  • The Freesound Dataset 50k (FSD50K)
  • Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks
  • Training Sound Event Classifiers With Noisy Labels
  • Self-Supervised Learning of Sound Event Representations
  • DCASE Challenge Tasks Organization
  • Summary and Conclusions

185

186 of 203

7. Summary and Conclusions

  • Technical contributions
  • Other academic contributions & merits
  • Publications

187 of 203

Technical contributions

  • A comprehensive review of sound event recognition & different topics covered

187

Summary and Conclusions

188 of 203

Technical contributions

  • A comprehensive review of sound event recognition & different topics covered
  • FSD50K → the largest fully-open dataset of human-labeled sound events

188

Summary and Conclusions

189 of 203

Technical contributions

  • A comprehensive review of sound event recognition & different topics covered
  • FSD50K → the largest fully-open dataset of human-labeled sound events
  • A comprehensive characterization of FSD50K

189

Summary and Conclusions

190 of 203

Technical contributions

  • A comprehensive review of sound event recognition & different topics covered
  • FSD50K → the largest fully-open dataset of human-labeled sound events
  • A comprehensive characterization of FSD50K
  • Architectural modifications to increase shift invariance in CNNs

190

Summary and Conclusions

191 of 203

Technical contributions

  • A comprehensive review of sound event recognition & different topics covered
  • FSD50K → the largest fully-open dataset of human-labeled sound events
  • A comprehensive characterization of FSD50K
  • Architectural modifications to increase shift invariance in CNNs
  • FSDnoisy18k → first dataset to provide for the investigation of label noise in SET

191

Summary and Conclusions

192 of 203

Technical contributions

  • A comprehensive review of sound event recognition & different topics covered
  • FSD50K → the largest fully-open dataset of human-labeled sound events
  • A comprehensive characterization of FSD50K
  • Architectural modifications to increase shift invariance in CNNs
  • FSDnoisy18k → first dataset to provide for the investigation of label noise in SET
  • Techniques to mitigate the effect of label noise during training of sound event classifiers

192

Summary and Conclusions

193 of 203

Technical contributions

  • A comprehensive review of sound event recognition & different topics covered
  • FSD50K → the largest fully-open dataset of human-labeled sound events
  • A comprehensive characterization of FSD50K
  • Architectural modifications to increase shift invariance in CNNs
  • FSDnoisy18k → first dataset to provide for the investigation of label noise in SET
  • Techniques to mitigate the effect of label noise during training of sound event classifiers
  • Self-supervised contrastive learning frameworks for learning audio representations

193

Summary and Conclusions

194 of 203

Other academic contributions & merits

  • Best Audio Representation Learning Paper Award at WASPAA 2021
    • “Self-Supervised Learning From Automatically Separated Sound Scenes”
  • Developer of awarded proposals for Google Faculty Research Awards 2017 & 2018
  • Technical Program Co-Chair of DCASE Workshop 2021
  • Challenge co-organizer
    • DCASE2018 Task2, DCASE2019 Task2, DCASE2021 Task4
    • Holistic Evaluation of Audio Representations (HEAR) 2021
  • Reviewer:
    • EURASIP, IEEE SPL, ICASSP, WASPAA, DCASE, EUSIPCO, IJCNN, MMSP, ISMIR
  • Resources for reproducibility
    • 4 datasets available from Zenodo
    • 5 code repositories available from GitHub

194

Summary and Conclusions

195 of 203

19 peer-reviewed publications

  • 3 Journal Articles as first author in TASLP and IEEE SPL
    • Fonseca, E., Favory, X., Pons, J., Font, F., & Serra, X. (2020). FSD50K: an Open Dataset of Human-Labeled Sound Events. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
    • Fonseca, E., Hershey, S., Plakal, M., Ellis, D. P. W., Jansen, A., & Moore, R. C. (2020). Addressing Missing Labels in Large-Scale Sound Event Recognition Using a Teacher-Student Framework With Loss Masking. IEEE Signal Processing Letters
    • Fonseca, E., Ferraro, A., & Serra, X. (2021). Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks. Under review in IEEE Signal Processing Letters.

195

Summary and Conclusions

196 of 203

19 peer-reviewed publications

  • 9 Conference Articles as first author in ICASSP, WASPAA, DCASE, SMC, ISMIR
    • Fonseca, E., Jansen, A., Ellis, D. P. W., Wisdom, S., Tagliasacchi, M., Hershey, J. R., Plakal, M., Hershey, S., Moore, R. C., & Serra, X. (2021). Self-Supervised Learning from Automatically Separated Sound Scenes. In Proceedings of the 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).
    • Fonseca, E., Ortego, D., McGuinness, K., O’Connor, N. E., & Serra, X. (2021). Unsupervised Contrastive Learning of Sound Event Representations. In Proceedings of the 46th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
    • Fonseca, E., Font, F., & Serra, X. (2019). Model-agnostic Approaches to Handling Noisy Labels When Training Sound Event Classifiers. In Proceedings of the 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).
    • Fonseca, E., Plakal, M., Font, F., Ellis, D. P. W., & Serra, X. (2019). Audio Tagging with Noisy Labels and Minimal Supervision. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)
    • Fonseca, E., Plakal, M., Ellis, D. P. W., Font, F., Favory, X., & Serra, X. (2019). Learning Sound Event Classifiers from Web Audio with Noisy Labels. In Proceedings of the 44th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
    • Fonseca, E., Plakal, M., Font, F., Ellis, D. P. W., Favory, X., Pons, J., & Serra, X. (2018). General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018).
    • Fonseca, E., Gong, R., & Serra, X. (2018). A Simple Fusion of Deep and Shallow Learning for Acoustic Scene Classification. In Proceedings of 5th Sound & Music Computing Conference (SMC 2018).
    • Fonseca, E., Pons, J., Favory, X., Font, F., Bogdanov, D., Ferraro, A., Oramas, S., Porter, A., & Serra, X. (2017). Freesound Datasets: A Platform for the Creation of Open Audio Datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR).
    • Fonseca, E., Gong, R., Bogdanov, D., Slizovskaia, O., Gómez, E., & Serra, X. (2017). Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017).

196

Summary and Conclusions

197 of 203

19 peer-reviewed publications

  • 7 Conference Articles through collaborations in ICASSP, DCASE, FRUCT
    • Zinemanas, P., Rocamora, M., Fonseca, E., Font, F., & Serra, X. (2021). Towards Interpretable Sound Event Detection with Attention Based on Prototypes. Accepted for publication in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021).
    • Hershey, S., Ellis, D. P. W., Fonseca, E., Jansen, A., Liu, C., Moore, R. C., & Plakal, M. (2021). The Benefit of Temporally-Strong Labels in Audio Event Classification. In Proceedings of the 46th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
    • Turpault, N., Serizel, R., Wisdom, S., Erdogan, H., Hershey, J.R., Fonseca, E., Seetharaman, P., & Salamon, J. (2021). Sound Event Detection and Separation: a Benchmark on DESED Synthetic Soundscapes. In Proceedings of the 46th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
    • Wisdom, S., Erdogan, H., Ellis, D. P. W., Serizel, R., Turpault, N., Fonseca, E., Salamon, J., Seetharaman, P., & Hershey, J.R. (2021). What’s All the FUSS About Free Universal Sound Separation Data?. In Proceedings of the 46th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
    • Turpault, N., Wisdom, S., Erdogan, H., Hershey, J.R., Serizel, R., Fonseca, E., Seetharaman, P., & Salamon, J. (2020). Improving Sound Event Detection In Domestic Environments Using Sound Separation. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020).
    • Pérez-López, A., Fonseca, E., & Serra, X. (2019). A Hybrid Parametric-Deep Learning Approach for Sound Event Localization and Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019).
    • Favory, F., Fonseca, E., & Serra, X. (2018). Facilitating the Manual Annotation of Sounds When Using Large Taxonomies. In Proceedings of the 23rd Conference of Open Innovations Association, FRUCT.

197

Summary and Conclusions

198 of 203

Evolution of research context in sound event classification

198

Summary and Conclusions

small vocabulary

supervision weakness

Supervised classification

clean labels

199 of 203

Evolution of research context in sound event classification

199

Summary and Conclusions

small vocabulary

large vocabulary

Supervised classification

clean labels

supervision weakness

Supervised classification

noisy labels

Self-supervised

representation

learning

Supervised classification

clean labels

200 of 203

Evolution of research context in sound event classification

200

Summary and Conclusions

small vocabulary

large vocabulary

Supervised classification

clean labels

This Thesis

supervision weakness

Supervised classification

noisy labels

Self-supervised

representation

learning

Supervised classification

clean labels

201 of 203

Thank you!

201

202 of 203

Training Sound Event Classifiers Using Different Types of Supervision

Eduardo Fonseca

December 1st, 2021

Supervisors:

Dr. Xavier Serra i Casals

Dr. Frederic Font Corbera

Board:

Dr. Emmanouil Benetos

Dr. Annamaria Mesaros

Dr. Marius Miron

203 of 203

Credits