1 of 25

Audio-driven Motion Synthesis

2024. 3. 22.

Yongwoo Lee

2 of 25

Index

Background : Audio features

Gesture generation

Objectives
Challenges
Recent works

Dance generation

Objectives
Challenges
Recent works

3 of 25

Progress of this domain

Gesture Synthesis

[Neff, et al., TOG 2008 ]

[GestureDiffuCLIP, SIGGRAPH Asia 2023]

(text-driven)

4 of 25

Progress of this domain

Dance Synthesis

[Transflower, SIGGRAPH ASIA 2021 ]

[Listen, denoise, action, SIGGRAPH 2023 ]

5 of 25

Audio-driven motion synthesis

Gesture & Dance

Generate a realistic motion from given sound.
Extract audio features from sound, then search the appropriate motion for the audio feature.
Input : Audio (& Control input such as text)� (+) motion if it is auto-regressive.

Output : Motion

[Audio]

[Motion]

[Data-driven approach]

[Text](Optional)

6 of 25

Audio-driven motion synthesis

How do we handle acoustic signals?

7 of 25

Audio-driven motion synthesis

Audio features

How should we analyze the audio signal?

From time domain to frequency domain

What kinds of sound do we hear at time t?

Spectrogram helps you analyze the sound.

[Overview of audio processing]

8 of 25

Audio-driven motion synthesis

Audio features - MFCC(Mel-Frequency Cepstral Coefficient)�1. Take STFT(Short Time Fourier Transform) of the audio signal.�2. From Hz scale to mel scale based on audio perception of human (Apply mel-filter banks)�3. Apply logs & DCT (Discrete cosine transform)�4. MFCC are coefficients that express amplitudes of the resulting spectrum

[Overview of audio processing]

9 of 25

Audio-driven motion synthesis

Mel-spectogram & MFCC

Both translate audio data to feature vectors.
Audio frequency vs coefficients about frequency features
MFCC decouples the correlation between frequencies.

[Overview of audio processing]

10 of 25

Audio-driven motion synthesis

+) audio features from pre-trained model

Rhythmic Gesticulator : Used vq-wav2vec encoder, and fine-tuned the model.
EDGE : extract audio representations from Jukebox(memory efficiency and fast extraction speed)

[Rhythmic Gesticulator]

[EDGE]

11 of 25

Audio-driven motion synthesis

Gesture synthesis & Dance synthesis,�How much are they different?

12 of 25

Gesture Synthesis

Objectives

Harmony between vocalization and motion, for natural looking)
Emphasize speech as Non-verbal conversation.
Game(NPC), films, digital humans

Ubisoft, Neo NPC
NVIDIA 2022, Audio2Gesture

13 of 25

Gesture Synthesis

Challenge

2 modalities can be given : Audio & Text
Should match temporal relations between 2 modalities: audio & motion.

Speech has irregular beats.

Understand language semantics (e.g. Sign-language 수화)
Someone says there is no direct correlations between speech and motion

There is no true gesture match, instead there are many appropriate gesture options.
Someone says gestures can be classified 6 categories

Adaptors, emblems, deictics, iconics, metaphorics, and beats

14 of 25

Gesture Synthesis

Recent work - ZeroEGGs (Eurographics 2023)

A short example motion clip can provide motion style without additional training.
A speech can generate various gestures depending on the example motion clip(even un-seen motion.)
Enable style manipulation directly in the latent space.

Speech Encoder : audio -> 1D convolution layers -> speech vectors
Style Encoder : motion -> 1D convolution layers -> style vectors

15 of 25

Gesture Synthesis

Recent work - Rhythmic Gesticulator (SIGGRAPH ASIA 2022)

Disentangle both information of Rhythmic & Semantic.�Extract high-level and low-level audio features, and then correlate the information
Unsupervised learning => Learning gesture semantic, style without detailed annotations.

16 of 25

Gesture Synthesis

Recent work - Rhythmic Gesticulator (SIGGRAPH ASIA 2022)

Segments audio sequences based on beat identification.
Makes gesture categories using VQ-VAE�Extract high-level and low-level audio features, and then correlate the information
Silent period hint : because LSTM cannot stop gestures on time.

17 of 25

Dance Synthesis

Objectives

Dance can take various forms even for a single clip of music and genre.

Anyway, it means it is too vague to be deterministic.

A dance motion can be utilized by various musics,�A music can include diverse types of dance motions.
However, there have been cultural transmission of dance.
Freestyle vs Choreography.

18 of 25

Dance Synthesis

Challenge

Multi-modal (audio , motion, respectively)

Genres, styles can be additional conditions

Match the temporal correlation between audio & motion.
Recent works show just repetitive or local results

Hard to reflect global audio feature (global consistency)

Temporal consistency
Depends on Datasets
Foot sliding
Many papers consider genres as discrete condition.

Music cannot be classified into a specific genre. (Genre is not a discrete feature.)

19 of 25

Dance Synthesis

Recent work - EDGE: Editable Dance generation from music (CVPR 2023)

The advantage of diffusion models

In-betweening & Joint-wise editing (like in-painting)
Arbitrary long sequence: Generate 5-second clips,�2.5s overlaps when generate the next clip.

Easy to use?

20 of 25

Dance Synthesis

Recent work - EDGE: Editable Dance generation from music (CVPR 2023)

Uses a frozen Jukebox model, to encode input music into embedding.
Eliminate foot-sliding physical implausibilities.

Is trained with physical realism in mind.
Predict contact of heel and toe of each foot.�+ maintain consistency its own predictions.

21 of 25

Dance Synthesis

Recent work - Rhythm is a dancer (TVCG 2021)

Generate motions with the long-term consistency of global context.

Also consider audio beats, music features.

Extract and Divide audio features into 2 groups manually.

Rhythmic features : rhythm and tempo
Spectral features : notes, pitch, melody

Hierarchical system with 3 levels

Pose : per frame
Motif : per motion block
Choreography : per music. global context

LSTM + AdaINLayer

22 of 25

Dance Synthesis

Recent work - Rhythm is a dancer (TVCG 2021)

Generate motions with the long-term consistency of global context.

Also consider audio beats, music features.

Extract and Divide audio features into 2 groups manually.

Rhythmic features : rhythm and tempo
Spectral features : notes, pitch, melody

Hierarchical system with 3 levels

Pose : per frame
Motif : per motion block
Choreography : per music. global context

LSTM + AdaINLayer

23 of 25

Dance Synthesis

Recent works - Audio-driven motion synthesis with diffusion models (SIGGRAPH 2023)

Novel network structures and models using diffusion models (inspired by DiffWave)
Used Conformer instead of Transformer
Blending and style interpolation : Guided diffusion
Do not learn any explicit semantics of purposeful gesturing and dancing.

24 of 25

Etc

Evaluation methods. How natural is it?

Several methods are proposed.

Every paper suggests its own metric to validate own model.
e.g. Frechet Gesture distance, Frechet Template Distance.
Beat Alignment Score

Physical accuracy of ground contact behaviors

Motion quality seems to highly depends on datasets.

1 of 25

2 of 25

3 of 25

4 of 25

5 of 25

6 of 25

7 of 25

8 of 25

9 of 25

10 of 25

11 of 25

12 of 25

13 of 25

14 of 25

15 of 25

16 of 25

17 of 25

18 of 25

19 of 25

20 of 25

21 of 25

22 of 25

23 of 25

24 of 25

25 of 25