1 of 23

MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions

Suhwan Choi, Kyu Won Kim, Myungjoo Kang

Artificial Intelligence for Music Workshop at AAAI 2025

2 of 23

Multimodal Data: Matching data of different modalities
Conventional Multimodal Learning/Data requires 1:1 matching between multimodal data
Difficult to obtain such precise 1:1 matching data
Music: subjective, emotional interpretation?

Background

3 of 23

Background

Our multimodal experience is complex.

view, taste, music, smell
What does it mean for multimodal senses to be “matching”?

4 of 23

Image-music Alignment Example

Music 1

Music 2

5 of 23

Background

Matching image and music means that their emotions are well-aligned.

Vision-music interaction: hardly explored in AI

How can we generate/recommend music based on images and videos?

“scary music”

6 of 23

Cross-modal Deep Continuous Metric Learning (CDCML)

Image-Music-Emotion-Matching-Net (IMEMNet) Dataset

image-music matching based on valence and arousal

Score emotion as two-dimensional vector
Valence: positive or negative emotion
Arousal: intensity of emotion

Related Work: How to quantify emotion?

S Zhao, et al. ACM MM. 2020

7 of 23

IMEMNet Dataset Sample

Matching Score 0.095

Matching Score 0.9965

8 of 23

Dataset: IMEMNet-C

Added Text modality to the IMEMNet dataset.

Add musical caption (genre, instruments, tempo, etc.). Example:

Used LP-MusicCaps to generate captions from songs.
1:1 matching between songs and captions

Shares VA scores with corresponding songs.

Incorporating texts expands the scope of zero-shot tasks.

Retrieve music when given text
Musical Prompt Generation

9 of 23

IMEMNet-C: Dataset Preparation

Generate musical caption with audio-to-text model (LP-MusicCaps)
Generated musical caption includes redundant phrases mentioning the quality of audio. Examples:

The low quality recoding features a mellow synth pad playing in the background. It sounds calming, relaxing and hypnotic.
The low quality recording features a rock song that consists of a passionate male vocal singing over sustained melody, groovy bass, and punchy snare.
This song contains an acoustic piano playing a minor chord composition … The audio quality is poor.

10 of 23

IMEMNet-C: Dataset Preparation

11 of 23

Method: Random Multimodal Matching

Requires 1:1 matching

multimodal data

Does not require matching

Mutimodal data

IMEMNet (CDCML): used 144435 predefined image-music pairs

IMEMNET-C

24756 images and 25944 songs
#Images x #songs = 642 million pairs!

MMVA: Multimodal Matching based on Valence and Arousal

12 of 23

Image encoder: CLIP image encoder
Music encoder: MERT
Text encoder: RoBERTa

Method: Training Objective

Asymmetrical matching

Image-text: not 1:1 matching
Image-song: not 1:1 matching
music-text: 1:1 matching

13 of 23

1. Emotion (VA) Prediction

supervised task

2. Zeroshot Experiments

Image-to-Music prompt generation
Text-music retrieval
Video summarization

3.Ablation Experiments

Experiments

14 of 23

Emotion Prediction Results

Metric

MSE: Mean Squared Error
MAE: Mean Absolute Error

Split ratio for image and music data

Train: 85%, validation 5%, test 10%

Outperforms across all metrics except image arousal MAE.

15 of 23

Genre

Template: ‘This is a {genre} music piece’
Candidates : 'rock', 'pop funk’, 'ambient', 'metal’, 'electronic', 'hip hop', 'jazz', 'classic', 'country’

Image-to-Music Prompt Generation

16 of 23

Prompt Template for Music Prompt Generation

17 of 23

	This is a ambient music piece. The melody is being played by the female vocalist while no instrument is playing in the background. The rhythm is slow. The atmosphere is easy going.
	'This is a pop funk music piece. The melody is being played by the male vocalist while synthesiser is playing in the background. The rhythm is fast. The atmosphere is energetic.'
	'This is a country music piece. The melody is being played by the male vocalist while synthesiser is playing in the background. The rhythm is fast. The atmosphere is energetic.'

Music Prompt Generation Results

18 of 23

Text-Music Retrieval

Competitive Performance against specialized methods

19 of 23

Video Summarization

Assumption: the emotion in the highlight of the movie is heightened or extreme.

For the frames in the video, predict emotion scores (arousal) using MMVA image encoder.

Score the frame based on arousal every 2 seconds.

Use Knapsack algorithm to remove unimportant frames (low score) until 15% of the length is left.

20 of 23

Ablation Experiments

No Random Multimodal Matching

Use predefined train/val/test splits from the original IMEMNET dataset

No similarity Predictor

No similarity predictor during training

21 of 23

Ablation Experiments

The baseline outperforms all ablation settings.

Shows that random multimodal sampling and similarity predictor are effective design choices.

22 of 23

Propose multimodal matching methods that do not require predefined multimodal pairs.
Outperform previous multimodal VA prediction methods.
Introduced new zeroshot tasks using valence and arousal.
Limitations:

MMVA may not capture other, non-emotional aspects of multimodal relationships.

e.g. semantic alignment

Emotion similarity may not be generalizable across diverse cultural and demographic contexts.

Conclusions & Limitations

1 of 23

2 of 23

3 of 23

4 of 23

5 of 23

6 of 23

7 of 23

8 of 23

9 of 23

10 of 23

11 of 23

12 of 23

13 of 23

14 of 23

15 of 23

16 of 23

17 of 23

18 of 23

19 of 23

20 of 23

21 of 23

22 of 23

23 of 23