1 of 25

Audio-driven Motion Synthesis

2024. 3. 22.

Yongwoo Lee

2 of 25

Index

  • Background : Audio features

  • Gesture generation
    • Objectives
    • Challenges
    • Recent works

  • Dance generation
    • Objectives
    • Challenges
    • Recent works

3 of 25

Progress of this domain

  • Gesture Synthesis

[Neff, et al., TOG 2008 ]

[GestureDiffuCLIP, SIGGRAPH Asia 2023]

(text-driven)

4 of 25

Progress of this domain

  • Dance Synthesis

[Transflower, SIGGRAPH ASIA 2021 ]

[Listen, denoise, action, SIGGRAPH 2023 ]

5 of 25

Audio-driven motion synthesis

  • Gesture & Dance
    • Generate a realistic motion from given sound.
    • Extract audio features from sound, then search the appropriate motion for the audio feature.
    • Input : Audio (& Control input such as text)� (+) motion if it is auto-regressive.

Output : Motion

[Audio]

[Motion]

[Data-driven approach]

[Text](Optional)

6 of 25

Audio-driven motion synthesis

  • How do we handle acoustic signals?

7 of 25

Audio-driven motion synthesis

  • Audio features
    • How should we analyze the audio signal?
      • From time domain to frequency domain
    • What kinds of sound do we hear at time t?
      • Spectrogram helps you analyze the sound.

[Overview of audio processing]

8 of 25

Audio-driven motion synthesis

  • Audio features - MFCC(Mel-Frequency Cepstral Coefficient)�1. Take STFT(Short Time Fourier Transform) of the audio signal.�2. From Hz scale to mel scale based on audio perception of human (Apply mel-filter banks)�3. Apply logs & DCT (Discrete cosine transform)�4. MFCC are coefficients that express amplitudes of the resulting spectrum

[Overview of audio processing]

9 of 25

Audio-driven motion synthesis

  • Mel-spectogram & MFCC
    • Both translate audio data to feature vectors.
    • Audio frequency vs coefficients about frequency features
    • MFCC decouples the correlation between frequencies.

[Overview of audio processing]

10 of 25

Audio-driven motion synthesis

  • +) audio features from pre-trained model
    • Rhythmic Gesticulator : Used vq-wav2vec encoder, and fine-tuned the model.
    • EDGE : extract audio representations from Jukebox(memory efficiency and fast extraction speed)

[Rhythmic Gesticulator]

[EDGE]

11 of 25

Audio-driven motion synthesis

  • Gesture synthesis & Dance synthesis,�How much are they different?

12 of 25

Gesture Synthesis

  • Objectives
    • Harmony between vocalization and motion, for natural looking)
    • Emphasize speech as Non-verbal conversation.
    • Game(NPC), films, digital humans
      • Ubisoft, Neo NPC
      • NVIDIA 2022, Audio2Gesture

13 of 25

Gesture Synthesis

  • Challenge
    • 2 modalities can be given : Audio & Text
    • Should match temporal relations between 2 modalities: audio & motion.
      • Speech has irregular beats.
    • Understand language semantics (e.g. Sign-language 수화)
    • Someone says there is no direct correlations between speech and motion
      • There is no true gesture match, instead there are many appropriate gesture options.
      • Someone says gestures can be classified 6 categories
        • Adaptors, emblems, deictics, iconics, metaphorics, and beats

14 of 25

Gesture Synthesis

  • Recent work - ZeroEGGs (Eurographics 2023)
    • A short example motion clip can provide motion style without additional training.
    • A speech can generate various gestures depending on the example motion clip(even un-seen motion.)
    • Enable style manipulation directly in the latent space.
      • Speech Encoder : audio -> 1D convolution layers -> speech vectors
      • Style Encoder : motion -> 1D convolution layers -> style vectors

15 of 25

Gesture Synthesis

  • Recent work - Rhythmic Gesticulator (SIGGRAPH ASIA 2022)
    • Disentangle both information of Rhythmic & Semantic.�Extract high-level and low-level audio features, and then correlate the information
    • Unsupervised learning => Learning gesture semantic, style without detailed annotations.

16 of 25

Gesture Synthesis

  • Recent work - Rhythmic Gesticulator (SIGGRAPH ASIA 2022)
    • Segments audio sequences based on beat identification.
    • Makes gesture categories using VQ-VAE�Extract high-level and low-level audio features, and then correlate the information
    • Silent period hint : because LSTM cannot stop gestures on time.

17 of 25

Dance Synthesis

  • Objectives
    • Dance can take various forms even for a single clip of music and genre.
      • Anyway, it means it is too vague to be deterministic.
    • A dance motion can be utilized by various musics,�A music can include diverse types of dance motions.
    • However, there have been cultural transmission of dance.
    • Freestyle vs Choreography.

18 of 25

Dance Synthesis

  • Challenge
    • Multi-modal (audio , motion, respectively)
      • Genres, styles can be additional conditions
    • Match the temporal correlation between audio & motion.
    • Recent works show just repetitive or local results
      • Hard to reflect global audio feature (global consistency)
    • Temporal consistency
    • Depends on Datasets
    • Foot sliding
    • Many papers consider genres as discrete condition.
      • Music cannot be classified into a specific genre. (Genre is not a discrete feature.)

19 of 25

Dance Synthesis

  • Recent work - EDGE: Editable Dance generation from music (CVPR 2023)
    • The advantage of diffusion models
      • In-betweening & Joint-wise editing (like in-painting)
      • Arbitrary long sequence: Generate 5-second clips,�2.5s overlaps when generate the next clip.
    • Easy to use?

20 of 25

Dance Synthesis

  • Recent work - EDGE: Editable Dance generation from music (CVPR 2023)
    • Uses a frozen Jukebox model, to encode input music into embedding.
    • Eliminate foot-sliding physical implausibilities.
      • Is trained with physical realism in mind.
      • Predict contact of heel and toe of each foot.+ maintain consistency its own predictions.

21 of 25

Dance Synthesis

  • Recent work - Rhythm is a dancer (TVCG 2021)
    • Generate motions with the long-term consistency of global context.
      • Also consider audio beats, music features.
    • Extract and Divide audio features into 2 groups manually.
      • Rhythmic features : rhythm and tempo
      • Spectral features : notes, pitch, melody
    • Hierarchical system with 3 levels
      • Pose : per frame
      • Motif : per motion block
      • Choreography : per music. global context
    • LSTM + AdaINLayer

22 of 25

Dance Synthesis

  • Recent work - Rhythm is a dancer (TVCG 2021)
    • Generate motions with the long-term consistency of global context.
      • Also consider audio beats, music features.
    • Extract and Divide audio features into 2 groups manually.
      • Rhythmic features : rhythm and tempo
      • Spectral features : notes, pitch, melody
    • Hierarchical system with 3 levels
      • Pose : per frame
      • Motif : per motion block
      • Choreography : per music. global context
    • LSTM + AdaINLayer

23 of 25

Dance Synthesis

  • Recent works - Audio-driven motion synthesis with diffusion models (SIGGRAPH 2023)
    • Novel network structures and models using diffusion models (inspired by DiffWave)
    • Used Conformer instead of Transformer
    • Blending and style interpolation : Guided diffusion
    • Do not learn any explicit semantics of purposeful gesturing and dancing.

24 of 25

Etc

  • Evaluation methods. How natural is it?
    • Several methods are proposed.
      • Every paper suggests its own metric to validate own model.
      • e.g. Frechet Gesture distance, Frechet Template Distance.
      • Beat Alignment Score
    • Physical accuracy of ground contact behaviors

  • Motion quality seems to highly depends on datasets.

25 of 25