1 of 1

TL;DR

  • Human annotation is expensive - automatically detect and classify great ape calls from continuous audio recordings.
  • Three data sets of different great ape lineages collected during field research: chimpanzees, orangutans, and bonobos.
  • Wav2vec 2.0, pretrained on 1000 hours human speech, transfers surprisingly well as an acoustic feature extractor used with LSTM.

Discussion

  • Wav2vec 2.0 transfers from human speech (high resource) to great ape calls (low resource) - what about other animals?
  • Do the same observations hold if we move from the vocal-auditory channel to the manual-visual channel (gestures)?
  • A broader picture of decoding the communication systems of non-human animals - e.g., the Earth Species Project.

1221

Automatic Sound Event Detection and Classification

of Great Ape Calls Using Neural Networks

Zifan Jiang1,2, Adrian Soldati1,3, Isaac Schamberg2, Adriano R. Lameira4, Steven Moran1,5

1 University of Neuchâtel, 2 University of Zurich, 3 University of St Andrews,

4 University of Warwick, 5 University of Miami

Contact Email: j22melody@gmail.com

We gratefully acknowledge funding from: UK Research & Innovation, Future Leaders Fellowship (MR/T04229X/1; ARL), Swiss National Science Foundation (PCEFP1_186841: SM, ZJ; 310030_185324: AS), St Leonard College (AS), and Swissuniversities (AS). We thank also Klaus Zuberbühler, Josep Call, and the field assistants of the Budongo Conservation Field Station.

Method

Data

Results

Table 2: We run all experiments three times based on different random seeds and report the mean and standard deviation. acc. stands for frame-level accuracy, f1 stands for the frame-level average F1- score weighted by the number of true instances per class. For hyper-parameters, we start E1 with batch_size = 1, dropout = 0.4 and keep them by default, if not otherwise specified in the table.