MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions
Suhwan Choi, Kyu Won Kim, Myungjoo Kang
Artificial Intelligence for Music Workshop at AAAI 2025
2
Background
3
Background
4
Image-music Alignment Example
Music 1
Music 2
5
Background
“scary music”
6
Related Work: How to quantify emotion?
S Zhao, et al. ACM MM. 2020
7
IMEMNet Dataset Sample
Matching Score 0.095
Matching Score 0.9965
8
Dataset: IMEMNet-C
9
IMEMNet-C: Dataset Preparation
10
IMEMNet-C: Dataset Preparation
11
Method: Random Multimodal Matching
Requires 1:1 matching
multimodal data
Does not require matching
Mutimodal data
IMEMNet (CDCML): used 144435 predefined image-music pairs
IMEMNET-C
MMVA: Multimodal Matching based on Valence and Arousal
12
Method: Training Objective
Asymmetrical matching
1. Emotion (VA) Prediction
2. Zeroshot Experiments
3.Ablation Experiments
13
Experiments
14
Emotion Prediction Results
15
Image-to-Music Prompt Generation
Prompt Template for Music Prompt Generation
17
| This is a ambient music piece. The melody is being played by the female vocalist while no instrument is playing in the background. The rhythm is slow. The atmosphere is easy going. | |
| 'This is a pop funk music piece. The melody is being played by the male vocalist while synthesiser is playing in the background. The rhythm is fast. The atmosphere is energetic.' | |
| 'This is a country music piece. The melody is being played by the male vocalist while synthesiser is playing in the background. The rhythm is fast. The atmosphere is energetic.' | |
Music Prompt Generation Results
18
Text-Music Retrieval
19
Video Summarization
20
Ablation Experiments
21
Ablation Experiments
22
Conclusions & Limitations
Thank you!