RECAP: RETRIEVAL-AUGMENTED AUDIO CAPTIONING
Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru,
Ramani Duraiswami, Dinesh Manocha
Primary Motivation
Audio captioning is the fundamental task of describing the contents of an audio sample using natural language.
Introduction and Main Contributions
We propose RECAP, REtrieval Augmented Audio CAPtioning, a simple and scalable audio captioning system. Similar to other audio captioning systems in literature, RECAP is built on an audio encoder and a language decoder (GPT-2 in our setting). We introduce three novel changes to :
RECAP is lightweight, fast to train (as we only optimize the cross-attention layers), and can exploit any large text-caption datastore in a training-free fashion.
Proposed Methodology
RECAP Advantages
Experimental Setting
Training Datasets
Evaluation Datasets
(Both on in-domain and our-of-domain settings)
Datastore
(Both on in-domain and our-of-domain settings)
Evaluation Metrics
Training Hyper-parameters
Results
Results
Results Analysis
Comparison of RECAP with Kim et al. [6] (SOTA) on compositional instances from Clotho (1.) and AudioCaps (4.) test set.
OOD comparison with a model trained on AudioCaps and inferred on a Clotho test instance with an audio event never seen during training (2.), and vice-versa (3.).
Code, Data and Checkpoints
Datastore size promised in paper: ~660k vs Open-Sourced:>10M (2.2M real + >8M synthetic)
Strong CLAP model trained on 10M+ audio-caption pairs