1 of 16

Rainforest Connection Species Audio Detection

Automate the detection of bird and frog species in a tropical soundscape

Ryan Chesler�

San Diego Machine Learning Meetup

3 of 16

About me

Data Scientist @ H2O.ai�
Kaggle Grandmaster�
Organizer of SDML

4 of 16

Problem

Detect 24 different species of frogs and birds from audio recordings�
Outputting the ordering of likely species

5 of 16

Data

57GB of 1 minute flac audio files�
Partially strongly labeled�
Given true positives and false positive labels

6 of 16

Labelling procedure

Scaled labeling with template matching → expert verification�
Selected golden samples of each species then searched for similar clips�
Had experts listen to clips found from template matching and label if they were actually the bird/frog or not�
Labels given to us incomplete

https://www.sciencedirect.com/science/article/pii/S1574954120300637

7 of 16

Metric (LWLRAP)

Metric cares about order of predictions

Label = [1, 0, 0]

Pred = [0.5, .7, .1]

Score 0.5

Label = [1, 0, 0]

Pred = [0.6, .7, .1]

Score 0.5

Label = [1, 0, 0]

Pred = [0.8, .7, .1]

Score 1.0

8 of 16

Baseline Solution

Convert waveform to spectrogram�
Train convolutional neural network (CNN) on labeled subsets of data�
Take ~5-sec clips cropped near where data was labeled�
At prediction time make predictions on rolling 5s windows

Waveform

Spectrogram

CNN

Dimensions 2880000 x 1

Dimensions 257 x 11249

9 of 16

Validation

Difficult to do properly because data only partially labeled�
Trading between over optimistically cropping the region of interest and validating against unlabeled regions of data

10 of 16

Our solution

Similar to baseline, ensemble of various different CNN models�
Trained on varying clip sizes�
Probed leaderboard for distribution correction�
Only used true positive labels

11 of 16

Things we explored

Mimicking template matching procedure�
Frequency specific models�
Alternate aggregations instead of max over clips�
Transformers/LSTMs other time series models

12 of 16

#1 Solution

Ensemble of CNNs predicting at recording level with weak labels and more granularly at time level with hard labels�
Masked loss so only training against labeled species and times�
Relied on LB for validation�
Hand labeled some missing annotations�
Pseudolabels

13 of 16

#1 Solution (Continued)

Weak label models

Train/Predict on full 60 seconds
Pseudolabels from hard model + hand labels
Noisy-student model sampling in several teacher models�

Post processing

Distribution shift for test set

14 of 16

#2 Solution

Masked loss on true positives and false positives�
Soft pseudolabels created from .5 sliding windows�
Mixed training between pseudolabels and cropping on original labels (4-10 pseudolabel loops used)�
Added channel to spectrograms to define position, translation invariance in this task is bad

15 of 16

#3 Solution

Blend of 8 models with TPs, FPs and hand labels�
Postprocessed predictions by thresholding species 3�
Big hand labelling effort of the training set�
Found external data with species and labeled that as well�
Many augmentations used, different sampling methods, injected audio from Freesound50k set�
Pseudolabels weren’t as useful here

1 of 16

2 of 16

3 of 16

4 of 16

5 of 16

6 of 16

7 of 16

8 of 16

9 of 16

10 of 16

11 of 16

12 of 16

13 of 16

14 of 16

15 of 16

16 of 16