1 of 39

Contrastive Language-Audio Pre-training:

Expanding CLIP to Audio

Yusong Wu*, Tianyu Zhang*, Ke Chen*, Richard Vencu, Christoph Schuhmann

*: equal contribution

2 of 39

Collaborators

3 of 39

CLIP: Unsupervised Multi-modal Contrastive learning

4 of 39

CLIP: SOTA Performance

Dominate ResNet-101 on ImageNet
Robust on various domains
More natural object detection

5 of 39

CLIP: Fundation of Other Models

6 of 39

Expanding CILP to Other Modalities

Human perceive more than 2 modalites
Learning multiple modalites enhance each other
Enable audio-related down-stream tasks

7 of 39

Expanding CILP to Other Modalities

Expand CLIP to audio

8 of 39

Overview

Larger dataset with natural text

9 of 39

Overview

Larger dataset with natural text
CLAP model with audio encoder

10 of 39

Overview

Larger dataset with natural text
CLAP model with audio encoder
Experiments and takeaways

11 of 39

Previous Works have limited Data

Train on small dataset
Or train with only label data

Cat, Dog, Cars, …

12 of 39

Ours: Mixture Dataset with Natural Text

~2,000,000 samples, 5893 hours
audio-text pair & audio-label pair

Cat, Dog, Cars, …

A muddled noise of broken channel of the TV. …

13 of 39

Convert Lable to Text Prompt

For label data, construct prompt:

Label: cat, dog, car

Text prompt: Sound of cat, dog, and car.

14 of 39

CLAP: Adding Audio Encoder to CLIP

MLP

Text encoder

Audio encoder

Cross-entropy Loss

Frozen Parameters

Training Parameters

Losses

Text / Audio Networks

15 of 39

Audio Encoder

CNN-based audio encoder:

Pretrained Audio Neural Networks (PANN)

81M parameters

16 of 39

Audio Encoder

CNN-based: Pretrained Audio Neural Networks (PANN), 81M parameters
Transformer-based audio encoder:

Hierarchical Token-Semantic Audio Transformer (HTS-AT), 28M parameters

17 of 39

Data Loading

Audio: 10s segement

Random select segement for longer audio
Pad 0 for shorter audio

18 of 39

Data Loading

Audio: 10s segement

Random select segement for longer audio
Pad 0 for shorter audio
Input representation: Mel-spectrogram, 64 freq bins
Sample rate: 48kHz, Frame size: 480, → 1000 input frames

19 of 39

Data Loading

Audio: 10s segement

Random select segement for longer audio
Pad 0 for shorter audio
Input representation: Mel-spectrogram, 64 freq bins
Sample rate: 48kHz, Frame size: 480, → 1000 input frames

Text:

For more than one captions, random select one
Tokenized by sub-word tokenizer (same as GPT)

20 of 39

Multi-GPU Training with Efficient Dataloader

Server: AWS instance with 8 A100 GPU

21 of 39

Multi-GPU Training with Efficient Dataloader

Server: AWS instance with 8 A100 GPU
Pre-process data into unified format

22 of 39

Multi-GPU Training with Efficient Dataloader

Server: AWS instance with 8 A100 GPU
Pre-process data into unified format
Pack samples into tars read by webdataset

23 of 39

Multi-GPU Training with Efficient Dataloader

Server: AWS instance with 8 A100 GPU
Pre-process data into unified format
Pack samples into tars read by webdataset
Avoid process data in dataloader or training loop

24 of 39

Training

Keep most hyperparameters same as CLIP
1e-3 learning rate, warm-up, cosine learning rate decay
AdamW optimizer, 0.1 weight decay

25 of 39

Evaluation Metrics

Text-to-Audio R@1 / R@5 / R@10

26 of 39

Evaluation Metrics

Text-to-Audio R@1 / R@5 / R@10: larger is better

Top 1/5/10 Accuarcy

27 of 39

Evaluation Metrics

Text-to-Audio R@1 / R@5 / R@10: larger is better
Audio-to-Text R@1 / R@5 / R@10: larger is better

28 of 39

Evaluation Metrics

Text-to-Audio R@1 / R@5 / R@10: larger is better
Audio-to-Text R@1 / R@5 / R@10: larger is better
Validation Loss

29 of 39

Exp set 1: Small dataset overfit?

Train loss vs. Val loss: clearly overfit

30 of 39

Exp set 1: Small dataset overfit?

Train loss vs. Val loss: clearly overfit

31 of 39

Exp set 1: Small dataset overfit?

Retrieval-based eval metric shows trend of quasi-monotonic increasing

32 of 39

Exp set 2: Large dataset vs small dataset

Introduced a 30x larger dataset: audioset
Less overfitting seen in val loss on large dataset

33 of 39

Exp set 2: Large dataset vs small dataset

Introduced a 30x larger dataset: audioset
Metric increases fast at first small dataset, but plateaus.
Metric continue to increase in large dataset.

34 of 39

Exp set 3: Transformer > CNN

Although differnt in size, transformer outperforms CNN at last

35 of 39

Exp set 3: Transformer > CNN

Although differnt in size, transformer outperforms CNN at last
Transformer: faster run time

36 of 39

Exp set 4: Model Scaling wrt. Dataset Size

1x, 0.5x, 0.25x largest dataset
Smaller dataset overfits more in val loss

37 of 39

Exp set 4: Model Scaling wrt. Dataset Size

1x, 0.5x, 0.25x largest dataset
Larger dataset has better performance & growth rate

38 of 39

Conclusion

Collected a large audio dataset
Successfully build and trained a CLAP model
Premitive experiment results

39 of 39

Limitations and Future works

Collect more dataset
More evaluation and down-stream task
More experiments
Better audio encoder