1 of 39

Contrastive Language-Audio Pre-training:

Expanding CLIP to Audio

Yusong Wu*, Tianyu Zhang*, Ke Chen*, Richard Vencu, Christoph Schuhmann

*: equal contribution

2 of 39

Collaborators

3 of 39

CLIP: Unsupervised Multi-modal Contrastive learning

4 of 39

CLIP: SOTA Performance

  • Dominate ResNet-101 on ImageNet
  • Robust on various domains
  • More natural object detection

5 of 39

CLIP: Fundation of Other Models

6 of 39

Expanding CILP to Other Modalities

  • Human perceive more than 2 modalites
  • Learning multiple modalites enhance each other
  • Enable audio-related down-stream tasks

7 of 39

Expanding CILP to Other Modalities

  • Expand CLIP to audio

8 of 39

Overview

  • Larger dataset with natural text

9 of 39

Overview

  • Larger dataset with natural text
  • CLAP model with audio encoder

10 of 39

Overview

  • Larger dataset with natural text
  • CLAP model with audio encoder
  • Experiments and takeaways

11 of 39

Previous Works have limited Data

  • Train on small dataset
  • Or train with only label data

Cat, Dog, Cars, …

12 of 39

Ours: Mixture Dataset with Natural Text

  • ~2,000,000 samples, 5893 hours
  • audio-text pair & audio-label pair

Cat, Dog, Cars, …

A muddled noise of broken channel of the TV.

13 of 39

Convert Lable to Text Prompt

For label data, construct prompt:

Label: cat, dog, car

Text prompt: Sound of cat, dog, and car.

14 of 39

CLAP: Adding Audio Encoder to CLIP

MLP

MLP

Text encoder

Audio encoder

Cross-entropy Loss

Cross-entropy Loss

Frozen Parameters

Training Parameters

Losses

Text / Audio Networks

15 of 39

Audio Encoder

  • CNN-based audio encoder:

Pretrained Audio Neural Networks (PANN)

81M parameters

16 of 39

Audio Encoder

  • CNN-based: Pretrained Audio Neural Networks (PANN), 81M parameters
  • Transformer-based audio encoder:

Hierarchical Token-Semantic Audio Transformer (HTS-AT), 28M parameters

17 of 39

Data Loading

  • Audio: 10s segement
    • Random select segement for longer audio
    • Pad 0 for shorter audio

18 of 39

Data Loading

  • Audio: 10s segement
    • Random select segement for longer audio
    • Pad 0 for shorter audio
    • Input representation: Mel-spectrogram, 64 freq bins
    • Sample rate: 48kHz, Frame size: 480, → 1000 input frames

19 of 39

Data Loading

  • Audio: 10s segement
    • Random select segement for longer audio
    • Pad 0 for shorter audio
    • Input representation: Mel-spectrogram, 64 freq bins
    • Sample rate: 48kHz, Frame size: 480, → 1000 input frames
  • Text:
    • For more than one captions, random select one
    • Tokenized by sub-word tokenizer (same as GPT)

20 of 39

Multi-GPU Training with Efficient Dataloader

  • Server: AWS instance with 8 A100 GPU

21 of 39

Multi-GPU Training with Efficient Dataloader

  • Server: AWS instance with 8 A100 GPU
  • Pre-process data into unified format

22 of 39

Multi-GPU Training with Efficient Dataloader

  • Server: AWS instance with 8 A100 GPU
  • Pre-process data into unified format
  • Pack samples into tars read by webdataset

23 of 39

Multi-GPU Training with Efficient Dataloader

  • Server: AWS instance with 8 A100 GPU
  • Pre-process data into unified format
  • Pack samples into tars read by webdataset
  • Avoid process data in dataloader or training loop

24 of 39

Training

  • Keep most hyperparameters same as CLIP
  • 1e-3 learning rate, warm-up, cosine learning rate decay
  • AdamW optimizer, 0.1 weight decay

25 of 39

Evaluation Metrics

  • Text-to-Audio R@1 / R@5 / R@10

26 of 39

Evaluation Metrics

  • Text-to-Audio R@1 / R@5 / R@10: larger is better

Top 1/5/10 Accuarcy

27 of 39

Evaluation Metrics

  • Text-to-Audio R@1 / R@5 / R@10: larger is better
  • Audio-to-Text R@1 / R@5 / R@10: larger is better

28 of 39

Evaluation Metrics

  • Text-to-Audio R@1 / R@5 / R@10: larger is better
  • Audio-to-Text R@1 / R@5 / R@10: larger is better
  • Validation Loss

29 of 39

Exp set 1: Small dataset overfit?

  • Train loss vs. Val loss: clearly overfit

30 of 39

Exp set 1: Small dataset overfit?

  • Train loss vs. Val loss: clearly overfit

31 of 39

Exp set 1: Small dataset overfit?

  • Retrieval-based eval metric shows trend of quasi-monotonic increasing

32 of 39

Exp set 2: Large dataset vs small dataset

  • Introduced a 30x larger dataset: audioset
  • Less overfitting seen in val loss on large dataset

33 of 39

Exp set 2: Large dataset vs small dataset

  • Introduced a 30x larger dataset: audioset
  • Metric increases fast at first small dataset, but plateaus.
  • Metric continue to increase in large dataset.

34 of 39

Exp set 3: Transformer > CNN

  • Although differnt in size, transformer outperforms CNN at last

35 of 39

Exp set 3: Transformer > CNN

  • Although differnt in size, transformer outperforms CNN at last
  • Transformer: faster run time

36 of 39

Exp set 4: Model Scaling wrt. Dataset Size

  • 1x, 0.5x, 0.25x largest dataset
  • Smaller dataset overfits more in val loss

37 of 39

Exp set 4: Model Scaling wrt. Dataset Size

  • 1x, 0.5x, 0.25x largest dataset
  • Larger dataset has better performance & growth rate

38 of 39

Conclusion

  • Collected a large audio dataset
  • Successfully build and trained a CLAP model
  • Premitive experiment results

39 of 39

Limitations and Future works

  • Collect more dataset
  • More evaluation and down-stream task
  • More experiments
  • Better audio encoder