1 of 23

MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions

Suhwan Choi, Kyu Won Kim, Myungjoo Kang

 Artificial Intelligence for Music Workshop at AAAI 2025

2 of 23

2

  • Multimodal Data: Matching data of different modalities
  • Conventional Multimodal Learning/Data requires 1:1 matching between multimodal data
  • Difficult to obtain such precise 1:1 matching data
  • Music: subjective, emotional interpretation?

Background

3 of 23

3

Background

  • Our multimodal experience is complex.
    • view, taste, music, smell
    • What does it mean for multimodal senses to be “matching”?

4 of 23

4

Image-music Alignment Example

Music 1

Music 2

5 of 23

5

Background

  • Matching image and music means that their emotions are well-aligned.

  • Vision-music interaction: hardly explored in AI

  • How can we generate/recommend music based on images and videos?

“scary music”

6 of 23

6

  • Cross-modal Deep Continuous Metric Learning (CDCML)
    • Image-Music-Emotion-Matching-Net (IMEMNet) Dataset
      • image-music matching based on valence and arousal
    • Score emotion as two-dimensional vector
    • Valence: positive or negative emotion
    • Arousal: intensity of emotion

Related Work: How to quantify emotion?

S Zhao, et al. ACM MM. 2020

7 of 23

7

IMEMNet Dataset Sample

Matching Score 0.095

Matching Score 0.9965

8 of 23

8

Dataset: IMEMNet-C

  • Added Text modality to the IMEMNet dataset.
    • Add musical caption (genre, instruments, tempo, etc.). Example:

    • Used LP-MusicCaps to generate captions from songs.
    • 1:1 matching between songs and captions
      • Shares VA scores with corresponding songs.
    • Incorporating texts expands the scope of zero-shot tasks.
      • Retrieve music when given text
      • Musical Prompt Generation

9 of 23

9

IMEMNet-C: Dataset Preparation

  • Generate musical caption with audio-to-text model (LP-MusicCaps)
  • Generated musical caption includes redundant phrases mentioning the quality of audio. Examples:
    • The low quality recoding features a mellow synth pad playing in the background. It sounds calming, relaxing and hypnotic.
    • The low quality recording features a rock song that consists of a passionate male vocal singing over sustained melody, groovy bass, and punchy snare.
    • This song contains an acoustic piano playing a minor chord composition … The audio quality is poor.

10 of 23

10

IMEMNet-C: Dataset Preparation

11 of 23

11

Method: Random Multimodal Matching

Requires 1:1 matching

multimodal data

Does not require matching

Mutimodal data

IMEMNet (CDCML): used 144435 predefined image-music pairs

IMEMNET-C

  • 24756 images and 25944 songs
  • #Images x #songs = 642 million pairs!

MMVA: Multimodal Matching based on Valence and Arousal

12 of 23

12

  • Image encoder: CLIP image encoder
  • Music encoder: MERT
  • Text encoder: RoBERTa

Method: Training Objective

Asymmetrical matching

  • Image-text: not 1:1 matching
  • Image-song: not 1:1 matching
  • music-text: 1:1 matching

13 of 23

1. Emotion (VA) Prediction

      • supervised task

2. Zeroshot Experiments

      • Image-to-Music prompt generation
      • Text-music retrieval
      • Video summarization

3.Ablation Experiments

13

Experiments

14 of 23

14

Emotion Prediction Results

  • Metric
    • MSE: Mean Squared Error
    • MAE: Mean Absolute Error

  • Split ratio for image and music data
    • Train: 85%, validation 5%, test 10%

  • Outperforms across all metrics except image arousal MAE.

15 of 23

  • Genre
    • Template: ‘This is a {genre} music piece’
    • Candidates : 'rock', 'pop funk’, 'ambient', 'metal’, 'electronic', 'hip hop', 'jazz', 'classic', 'country’

15

Image-to-Music Prompt Generation

16 of 23

Prompt Template for Music Prompt Generation

17 of 23

17

This is a ambient music piece. The melody is being played by the female vocalist while no instrument is playing in the background. The rhythm is slow. The atmosphere is easy going.

'This is a pop funk music piece. The melody is being played by the male vocalist while synthesiser is playing in the background. The rhythm is fast. The atmosphere is energetic.'

'This is a country music piece. The melody is being played by the male vocalist while synthesiser is playing in the background. The rhythm is fast. The atmosphere is energetic.'

Music Prompt Generation Results

18 of 23

18

Text-Music Retrieval

  • Competitive Performance against specialized methods

19 of 23

19

Video Summarization

  • Assumption: the emotion in the highlight of the movie is heightened or extreme.

  • For the frames in the video, predict emotion scores (arousal) using MMVA image encoder.

  • Score the frame based on arousal every 2 seconds.

  • Use Knapsack algorithm to remove unimportant frames (low score) until 15% of the length is left.

20 of 23

20

Ablation Experiments

  • No Random Multimodal Matching
    • Use predefined train/val/test splits from the original IMEMNET dataset
  • No similarity Predictor
    • No similarity predictor during training

21 of 23

21

Ablation Experiments

  • The baseline outperforms all ablation settings.

  • Shows that random multimodal sampling and similarity predictor are effective design choices.

22 of 23

  • Propose multimodal matching methods that do not require predefined multimodal pairs.
  • Outperform previous multimodal VA prediction methods.
  • Introduced new zeroshot tasks using valence and arousal.
  • Limitations:
    • MMVA may not capture other, non-emotional aspects of multimodal relationships.
      • e.g. semantic alignment
    • Emotion similarity may not be generalizable across diverse cultural and demographic contexts.

22

Conclusions & Limitations

23 of 23

Thank you!