Motion Lens
Achyut Kowshik
Bada Kwon
Nikita Demidov
Prajnan Goswami
1
Interpreting Text Encoder’s role in Text-to-Motion Generative Models
What is Text – to – Motion ?
2
“A person walks forward”
Generative�Motion Model
Why Text – to – Motion ?
3
Motivation for Motion Lens
How does the text encoder influence diverse human motions?�
For example:�
4
Diffusion Lens Recap
5
Visualizing Text encoder’s intermediate states.
Motion Lens
6
Layer 0
Layer i
Layer 11 (final)
Prompt
"a person playing violin"
Generative Motion Model
Human Motion Representation
7
Generative Human Motion Models
MotionDiffuse:�
8
Generative Human Motion Models
T2M-GPT:�
9
Generative Human Motion Models
10
*More details on Model architecture in the APPENDIX section.
Method
Testing Prompts Focus:
11
Results
Layer 0 and 1
Exception:��T2M-GPT shows forward movement in layer 1 in some cases.
12
Results: “a person does hand stand”
13
Layer 0 Layer 1 Final layer
Motion Diffuse
T2M - GPT
Results: “an injured person slowly hobbles forward”
14
Motion Diffuse
T2M - GPT
Layer 0 Layer 1 Final layer
Results
Layer 2
15
Results: “a person walks forward”
16
Motion Diffuse
T2M - GPT
Layer 2 Final layer
Results: “a person jumps”
17
Motion Diffuse
T2M - GPT
Layer 2 Final layer
Results: “a person raises their right arm and waves slowly side-to-side five time, lowers it gently”
18
Motion Diffuse
T2M - GPT
Layer 2 Final layer
Results: “a person jogs forward and stops, then jumps”
19
Layer 2 Final layer
Motion Diffuse
T2M - GPT
Results
Layer 3, 4 and 5 (early-middle layers)�
20
Results: “a person performs jumping jacks, he then jogs in place”
21
(early middle layers) Final layer
Motion Diffuse
T2M - GPT
Results: “a person does hand stand”
22
(early middle layers) Final layer
Motion Diffuse
T2M - GPT
Results: “a person playing violin”
23
(early middle layers) Final layer
Motion Diffuse
T2M - GPT
Results
Layer 6, 7, 8 and 9�
24
Result: “a person jogs forward and stops, then jumps”
25
T2M - GPT
Layer 6 Layer 7 Layer 8 Layer 9 Final layer
Result: “a person jumps”
26
Layer 6 Layer 7 Layer 8 Layer 9 Final layer
Motion Diffuse
Results
Layer 10 and 11 (Final Layers)�
27
Results: “a person hops on his right foot, he then hops on his left foot”
28
Layer 10 Final layer
Motion Diffuse
T2M - GPT
Results: “a person walks backward”
29
Layer 10 Final layer
Motion Diffuse
T2M - GPT
Discussion
What can we interpret to some extent with Motion Lens?
*more examples in the Appendix
30
Discussion
What is still NOT interpretable with Motion Lens?
31
Discussion
What is still NOT interpretable with Motion Lens?
32
Thank you!
33
Appendix
34
35
MotionDiffuse
Motion Denoiser
CLIP
“A person walks forward”
T = 1000 steps
Noisy motion representation
Clean motion representation
36
T2M-GPT
37
T2M-GPT
Motion Language Datasets
���
38
T2M-GPT Results
39
Result: “a person takes a free kick with his left foot, he then celebrates with his arms up in the air”
40
Layer 0 Layer 3 Layer 6 Layer 9 Final layer
T2M-GPT
Result: “A person spins in place clockwise two full rotations, slows down, then carefully sits cross-legged on the ground”
41
Layer 0 Layer 3 Layer 6 Layer 9 Final layer
T2M-GPT
MotionDiffuse Results
42
Result: “a person slowly walks backwards, he then sits down on a chair”
43
Layer 0 Layer 3 Layer 6 Layer 9 Final layer
MotionDiffuse
Result: “an injured person slowly hobbles forward”
44
Layer 0 Layer 3 Layer 6 Layer 9 Final layer
MotionDiffuse
Result: “a person playing violin”
45
Layer 0 Layer 3 Layer 6 Layer 9 Final layer
MotionDiffuse