1 of 16

Rainforest Connection Species Audio Detection

Automate the detection of bird and frog species in a tropical soundscape

Ryan Chesler�

San Diego Machine Learning Meetup

2 of 16

3 of 16

About me

  • Data Scientist @ H2O.ai�
  • Kaggle Grandmaster�
  • Organizer of SDML

4 of 16

Problem

  • Detect 24 different species of frogs and birds from audio recordings�
  • Outputting the ordering of likely species

5 of 16

Data

  • 57GB of 1 minute flac audio files�
  • Partially strongly labeled�
  • Given true positives and false positive labels

6 of 16

Labelling procedure

  • Scaled labeling with template matching → expert verification�
  • Selected golden samples of each species then searched for similar clips�
  • Had experts listen to clips found from template matching and label if they were actually the bird/frog or not�
  • Labels given to us incomplete

https://www.sciencedirect.com/science/article/pii/S1574954120300637

7 of 16

Metric (LWLRAP)

  • Metric cares about order of predictions

Label = [1, 0, 0]

Pred = [0.5, .7, .1]

Score 0.5

Label = [1, 0, 0]

Pred = [0.6, .7, .1]

Score 0.5

Label = [1, 0, 0]

Pred = [0.8, .7, .1]

Score 1.0

8 of 16

Baseline Solution

  • Convert waveform to spectrogram�
  • Train convolutional neural network (CNN) on labeled subsets of data�
  • Take ~5-sec clips cropped near where data was labeled�
  • At prediction time make predictions on rolling 5s windows

Waveform

Spectrogram

CNN

Dimensions 2880000 x 1

Dimensions 257 x 11249

9 of 16

Validation

  • Difficult to do properly because data only partially labeled�
  • Trading between over optimistically cropping the region of interest and validating against unlabeled regions of data

10 of 16

Our solution

  • Similar to baseline, ensemble of various different CNN models�
  • Trained on varying clip sizes�
  • Probed leaderboard for distribution correction�
  • Only used true positive labels

11 of 16

Things we explored

  • Mimicking template matching procedure�
  • Frequency specific models�
  • Alternate aggregations instead of max over clips�
  • Transformers/LSTMs other time series models

12 of 16

#1 Solution

  • Ensemble of CNNs predicting at recording level with weak labels and more granularly at time level with hard labels�
  • Masked loss so only training against labeled species and times�
  • Relied on LB for validation�
  • Hand labeled some missing annotations�
  • Pseudolabels

13 of 16

#1 Solution (Continued)

  • Weak label models
    • Train/Predict on full 60 seconds
    • Pseudolabels from hard model + hand labels
    • Noisy-student model sampling in several teacher models�
  • Post processing
    • Distribution shift for test set

14 of 16

#2 Solution

  • Masked loss on true positives and false positives�
  • Soft pseudolabels created from .5 sliding windows�
  • Mixed training between pseudolabels and cropping on original labels (4-10 pseudolabel loops used)�
  • Added channel to spectrograms to define position, translation invariance in this task is bad

15 of 16

#3 Solution

  • Blend of 8 models with TPs, FPs and hand labels�
  • Postprocessed predictions by thresholding species 3�
  • Big hand labelling effort of the training set�
  • Found external data with species and labeled that as well�
  • Many augmentations used, different sampling methods, injected audio from Freesound50k set�
  • Pseudolabels weren’t as useful here

16 of 16

Other interesting solutions

  • #11 solution trained on frequency masked and time cropped inputs�
  • Correcting distributions based on meta information from the paper associated with the dataset