Humans communicate with multimodalities
Gestures
Speech
Language
Facial expression
Emotion
Can we model humans’ multimodal behavior with one single model?
The Language of Motion: �Unifying Verbal and Non-verbal Language of 3D Human Motion��
Changan Chen*, Juze Zhang*, Shrinidhi Kowshika Lakshmikanth*, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei-Fei, Ehsan Adeli��Stanford University��*indicates equal contribution.
“I’m excited!”
Face Codebook
…
<face_id_1><face_id_2>
Multimodal vocabulary
Hand Codebook
…
<hand_id_1><hand_id_2>
Upper Codebook
…
<upper_id_1><upper_id_2>
Lower Codebook
…
<lower_id_1><lower_id_2>
Audio Codebook
…
<audio_id_1><audio_id_2>
Text Codebook
…
<text_id_1><text_id_2>
HuBERT
Motion tokenizer
Audio tokenizer
Sentence�Piece
Text tokenizer
Motion
Codebook x 4
Motion Encoder
Motion Decoder
... …
0
1
2
3
…
0
…
0
Audio Codebook
Text Codebook
…
Output mixed tokens
…
…
Language Encoders
Autoregressive Decoders
…
…
<bos>
Next token prediction
Input mixed tokens
Key & Value
We utilize language models to connect the language, speech and motion modalities.
With generative pre-training and task-specific post-training, our model supports an array of tasks for motion understanding and generation.
Co-speech Gesture Generation
GT
EMAGE
Ours
Liu et al. Emage.CVPR'24
Editable Gesture Generation
Can you generate the lower body movement of a man walking in a circular path and generate the upper body movement according to the speech ?
Text-to-motion Generation
T2M-GPT
MDM
MotionGPT
Ours
A person is walking normally in a circle
Zhang et al. T2M-GPT CVPR'23
Tevet et al. MDM. ICLR'23
Jiang et al. Motiongpt. NeurIPS'23
Emotion Understanding
What emotion is conveyed by the movements in the body motion?
The person expresses the emotion of “contempt”
Ours
Thank you for watching!