1 of 10

RECAP: RETRIEVAL-AUGMENTED AUDIO CAPTIONING

Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru,

Ramani Duraiswami, Dinesh Manocha

2 of 10

Primary Motivation

Most prior audio captioning systems proposed in literature do not perform well on cross-domain settings. The primary reason behind this phenomenon is the shift of occurrence of unique audio events with a domain shift. For example, the AudioCaps benchmark dataset has several audio concepts (e.g., the sound of jazz or an interview) that Clotho, another benchmark dataset, does not.

Models need to be re-trained for captioning novel audio concepts as new audio concepts also keep emerging (e.g., new versions of an online game).

Long-tailed audio concepts are difficult to learn. Certain audio concepts that are abundant in the training data are better learnt by captioning models (e.g., human speaking) than concepts that are not so abundant (e.g., champagne pop).

Training Audio Captioning Systems is computationally expensive. Average captioning systems proposed in literature have ~100-500 million parameters and continually training these models for newer domains proves to be expensive.

Audio captioning is the fundamental task of describing the contents of an audio sample using natural language.

3 of 10

Introduction and Main Contributions

We propose RECAP, REtrieval Augmented Audio CAPtioning, a simple and scalable audio captioning system. Similar to other audio captioning systems in literature, RECAP is built on an audio encoder and a language decoder (GPT-2 in our setting). We introduce three novel changes to :

Instead of employing an audio encoder pre-trained only on audio, we use CLAP [1] as our audio encoder. CLAP is pre-trained on audio-text pairs to learn the correspondence between audio and text by projecting them into a shared latent space. Thus, CLAP hidden state representations are better suited for captioning due to their enhanced linguistic comprehension.
We condition the audio for caption generation by introducing new cross-attention layers between CLAP and the GPT-2.
Finally, beyond just conditioning audio, we also condition a custom-constructed prompt for training and inference. Inspired by retrieval-augmented generation (RAG), we construct the prompt using the top-k captions most similar to the audio from a datastore retrieved using CLAP.

RECAP is lightweight, fast to train (as we only optimize the cross-attention layers), and can exploit any large text-caption datastore in a training-free fashion.

4 of 10

Proposed Methodology

5 of 10

RECAP Advantages

Training Time Efficiency: RECAP builds on retrieval-augmented generation (RAG) and does not need all information to be stored in its weights as it has access to external knowledge from a datastore of text. Additionally, CLAP generates an audio embedding that correlates well with its corresponding textual description, thus further lowering training time due to its superior understanding of the audio content.

Test Time Advantages: Being conditioned on a prompt that is constructed using top-k in-domain captions allows RECAP to caption novel concepts never before seen during training and improves the captioning of audio with multiple events. Additionally, RECAP can switch to multiple domains by just switching the datastore.

6 of 10

Experimental Setting

Training Datasets

AudioCaps
Clotho
AudioCaps + Clotho

Evaluation Datasets

AudioCaps
Clotho

(Both on in-domain and our-of-domain settings)

Datastore

DS
DS_caps or DS_clotho
DS_large

(Both on in-domain and our-of-domain settings)

Evaluation Metrics

BLEU₁
BLEU₂
BLEU₃
BLEU₄
METEOR
ROUGE_L
CIDEr
SPICE
SPIDEr

Training Hyper-parameters

Adam Optimizer
Learning Rate of 5e-5
Trained for 100 epochs

9 of 10

Results Analysis

Comparison of RECAP with Kim et al. [6] (SOTA) on compositional instances from Clotho (1.) and AudioCaps (4.) test set.

OOD comparison with a model trained on AudioCaps and inferred on a Clotho test instance with an audio event never seen during training (2.), and vice-versa (3.).

10 of 10

Code, Data and Checkpoints

https://github.com/Sreyan88/RECAP

Datastore size promised in paper: ~660k vs Open-Sourced:>10M (2.2M real + >8M synthetic)

Strong CLAP model trained on 10M+ audio-caption pairs