Contrastive Language-Audio Pre-training:
Expanding CLIP to Audio
Yusong Wu*, Tianyu Zhang*, Ke Chen*, Richard Vencu, Christoph Schuhmann
*: equal contribution
Collaborators
CLIP: Unsupervised Multi-modal Contrastive learning
CLIP: SOTA Performance
CLIP: Fundation of Other Models
Expanding CILP to Other Modalities
Expanding CILP to Other Modalities
Overview
Overview
Overview
Previous Works have limited Data
Cat, Dog, Cars, …
Ours: Mixture Dataset with Natural Text
Cat, Dog, Cars, …
A muddled noise of broken channel of the TV. …
Convert Lable to Text Prompt
For label data, construct prompt:
Label: cat, dog, car
Text prompt: Sound of cat, dog, and car.
CLAP: Adding Audio Encoder to CLIP
MLP
MLP
Text encoder
Audio encoder
Cross-entropy Loss
Cross-entropy Loss
Frozen Parameters
Training Parameters
Losses
Text / Audio Networks
Audio Encoder
Pretrained Audio Neural Networks (PANN)
81M parameters
Audio Encoder
Hierarchical Token-Semantic Audio Transformer (HTS-AT), 28M parameters
Data Loading
Data Loading
Data Loading
Multi-GPU Training with Efficient Dataloader
Multi-GPU Training with Efficient Dataloader
Multi-GPU Training with Efficient Dataloader
Multi-GPU Training with Efficient Dataloader
Training
Evaluation Metrics
Evaluation Metrics
Top 1/5/10 Accuarcy
Evaluation Metrics
Evaluation Metrics
Exp set 1: Small dataset overfit?
Exp set 1: Small dataset overfit?
Exp set 1: Small dataset overfit?
Exp set 2: Large dataset vs small dataset
Exp set 2: Large dataset vs small dataset
Exp set 3: Transformer > CNN
Exp set 3: Transformer > CNN
Exp set 4: Model Scaling wrt. Dataset Size
Exp set 4: Model Scaling wrt. Dataset Size
Conclusion
Limitations and Future works