1 of 13

Humans communicate with multimodalities

Gestures

Speech

Language

Facial expression

Emotion

Can we model humans’ multimodal behavior with one single model?

2 of 13

The Language of Motion: �Unifying Verbal and Non-verbal Language of 3D Human Motion��

Changan Chen*, Juze Zhang*, Shrinidhi Kowshika Lakshmikanth*, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei-Fei, Ehsan Adeli��Stanford University��*indicates equal contribution.

3 of 13

“I’m excited!”

Face Codebook

…

<face_id_1><face_id_2>

Multimodal vocabulary

Hand Codebook

…

<hand_id_1><hand_id_2>

Upper Codebook

…

<upper_id_1><upper_id_2>

Lower Codebook

…

<lower_id_1><lower_id_2>

Audio Codebook

…

<audio_id_1><audio_id_2>

Text Codebook

…

<text_id_1><text_id_2>

HuBERT

Motion tokenizer

Audio tokenizer

Sentence�Piece

Text tokenizer

Motion

Codebook x 4

Motion Encoder

Motion Decoder

... …

…

Audio Codebook

Text Codebook

…

Output mixed tokens

…

Language Encoders

Autoregressive Decoders

…

<bos>

Next token prediction

Input mixed tokens

Key & Value

We utilize language models to connect the language, speech and motion modalities.

4 of 13

With generative pre-training and task-specific post-training, our model supports an array of tasks for motion understanding and generation.

5 of 13

Co-speech Gesture Generation

6 of 13

EMAGE

Ours

Liu et al. Emage.CVPR'24

7 of 13

Editable Gesture Generation

8 of 13

Can you generate the lower body movement of a man walking in a circular path and generate the upper body movement according to the speech ?

9 of 13

Text-to-motion Generation

10 of 13

T2M-GPT

MDM

MotionGPT

Ours

A person is walking normally in a circle

Zhang et al. T2M-GPT CVPR'23

Tevet et al. MDM. ICLR'23

Jiang et al. Motiongpt. NeurIPS'23

11 of 13

Emotion Understanding

12 of 13

What emotion is conveyed by the movements in the body motion?

The person expresses the emotion of “contempt”

Ours

13 of 13

Thank you for watching!