1 of 45

Motion Lens

Achyut Kowshik

Bada Kwon

Nikita Demidov

Prajnan Goswami

1

Interpreting Text Encoder’s role in Text-to-Motion Generative Models

2 of 45

What is Text – to – Motion ?

2

“A person walks forward”

Generative�Motion Model

3 of 45

Why Text – to – Motion ?

3

  • Animating human motion is a time-consuming task even for skilled professionals.�
  • Eliminate skill gap for non-experts.

4 of 45

Motivation for Motion Lens

How does the text encoder influence diverse human motions?�

For example:�

  • When is directional control achieved ?�
  • How are multiple motions resolved in a complex prompt?�
  • Left and right dexterity �����

4

5 of 45

Diffusion Lens Recap

5

Visualizing Text encoder’s intermediate states.

6 of 45

Motion Lens

6

Layer 0

Layer i

Layer 11 (final)

Prompt

"a person playing violin"

Generative Motion Model

7 of 45

Human Motion Representation

    • Consist of sequence of frames (1, 2, 3, ………….. F).��
    • Each frame consist of 22 joints such as shoulder, hip, knee etc.��
    • Each joint is represented by: 3D position, 6D rotation and linear velocity. �

    • Hip as the anchored or root joint:��The motion's overall angular velocity and linear velocity.��

7

8 of 45

Generative Human Motion Models

MotionDiffuse:

    • Mingyuan Zhang, PhD student at Nanyang Technological University, Singapore.�
    • Zhongang Cai, Senior Research Scientist at SenseTime Research.�
    • Ziwei Liu (Advisor): Professor at Nanyang Tech�
    • Other collaborators: Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang.

8

9 of 45

Generative Human Motion Models

T2M-GPT:

    • Jianrong Zhang with Jilin University, China.�
    • Yangsong Zhang with Shanghai Jiao Tong University, China.�
    • Xiaodong Cun, Assistant Professor at Great Bay University, China.�
    • Shaoli Huang: Senior researcher at Tencent AI Lab.�
    • Xi Shen: Chief Scientist at Intellindust.���

9

10 of 45

Generative Human Motion Models

  • MotionDiffuse denoises noisy motion to generate clean motion.���
  • T2M-GPT uses an autoregressive approach to generate motion sequence.���
  • Both MotionDiffuse and T2M-GPT uses the same CLIP Text Encoder.

10

*More details on Model architecture in the APPENDIX section.

11 of 45

Method

Testing Prompts Focus:

    • Simple motion such as “walking”, “jumping” etc.�
    • Human object interaction such as playing “violin”, “kicking a ball” etc.�
    • Interactive or Gestural Actions such as “clapping”, “waving” etc. �
    • Stylized motion such as “walking sadly”, “injured” etc.�
    • Conjugate motions such as “jump and then walk”.

11

12 of 45

Results

Layer 0 and 1

      • Hands waving in random directions and arbitrary motions��

Exception:��T2M-GPT shows forward movement in layer 1 in some cases.

12

13 of 45

Results: “a person does hand stand”

13

Layer 0 Layer 1 Final layer

Motion Diffuse

T2M - GPT

14 of 45

Results: “an injured person slowly hobbles forward”

14

Motion Diffuse

T2M - GPT

Layer 0 Layer 1 Final layer

15 of 45

Results

Layer 2

      • Sensitive to the complexity of the prompt

      • Tends to initiate movement for simple prompts involving motion with feet

15

16 of 45

Results: “a person walks forward”

16

Motion Diffuse

T2M - GPT

Layer 2 Final layer

17 of 45

Results: “a person jumps”

17

Motion Diffuse

T2M - GPT

Layer 2 Final layer

18 of 45

Results: “a person raises their right arm and waves slowly side-to-side five time, lowers it gently”

18

Motion Diffuse

T2M - GPT

Layer 2 Final layer

19 of 45

Results: “a person jogs forward and stops, then jumps”

19

Layer 2 Final layer

Motion Diffuse

T2M - GPT

20 of 45

Results

Layer 3, 4 and 5 (early-middle layers)�

      • Starts the initial motion consistent with the prompt

20

21 of 45

Results: “a person performs jumping jacks, he then jogs in place”

21

(early middle layers) Final layer

Motion Diffuse

T2M - GPT

22 of 45

Results: “a person does hand stand”

22

(early middle layers) Final layer

Motion Diffuse

T2M - GPT

23 of 45

Results: “a person playing violin”

23

(early middle layers) Final layer

Motion Diffuse

T2M - GPT

24 of 45

Results

Layer 6, 7, 8 and 9�

      • Mixed results depending on prompt complexity�
      • Introduces more upper body movement in these layers�
      • In most cases some parts of the motion get resolved by layer 9��

24

25 of 45

Result: “a person jogs forward and stops, then jumps”

25

T2M - GPT

Layer 6 Layer 7 Layer 8 Layer 9 Final layer

26 of 45

Result: “a person jumps”

26

Layer 6 Layer 7 Layer 8 Layer 9 Final layer

Motion Diffuse

27 of 45

Results

Layer 10 and 11 (Final Layers)�

      • Refinement of overall motion sequence in most cases.�
      • Better directional control and positioning.�
      • In few cases, layer 10 does not align with the input prompt.�For example: “a person walks backward”

27

28 of 45

Results: “a person hops on his right foot, he then hops on his left foot”

28

Layer 10 Final layer

Motion Diffuse

T2M - GPT

29 of 45

Results: “a person walks backward”

29

Layer 10 Final layer

Motion Diffuse

T2M - GPT

30 of 45

Discussion

What can we interpret to some extent with Motion Lens?

  • Initial layers(0 to 2) shows foot movement to some extent for both simple and complex prompts.�
  • Early-middle layers (3 to 5) shows signs of initiating the motion.�
  • Upper body movement, human object interaction and better positioning of arms are refined in the later layers (6 to 9).

  • Layers 10 and 11 adds directional control and refines overall motion to align with input prompt.�

*more examples in the Appendix

30

31 of 45

Discussion

What is still NOT interpretable with Motion Lens?

  • Left and right dexterity is not interpretable. (Wave right hand, kick with left foot etc)��
  • Counting an action is also not interpretable. (Wave hand 5 times)��
  • When does stylized motion appear? (an injured person walking)��Occasionally later layers(7,8,9,10) may show some stylized motion.��*more examples in the Appendix

31

32 of 45

Discussion

What is still NOT interpretable with Motion Lens?

  • Why does diffusion and autoregressive approach generate different motion ��sequence for the same Intermediate text embeddings?���
  • How do the models reconstruct the entire motion in cases where the intermediate ��result at layer 10 was still not aligning with the prompt?

32

33 of 45

Thank you!

33

34 of 45

Appendix

34

35 of 45

35

MotionDiffuse

Motion Denoiser

CLIP

“A person walks forward”

T = 1000 steps

Noisy motion representation

Clean motion representation

36 of 45

36

T2M-GPT

37 of 45

37

T2M-GPT

38 of 45

Motion Language Datasets

���

38

39 of 45

T2M-GPT Results

39

40 of 45

Result: “a person takes a free kick with his left foot, he then celebrates with his arms up in the air

40

Layer 0 Layer 3 Layer 6 Layer 9 Final layer

T2M-GPT

41 of 45

Result: “A person spins in place clockwise two full rotations, slows down, then carefully sits cross-legged on the ground

41

Layer 0 Layer 3 Layer 6 Layer 9 Final layer

T2M-GPT

42 of 45

MotionDiffuse Results

42

43 of 45

Result: “a person slowly walks backwards, he then sits down on a chair

43

Layer 0 Layer 3 Layer 6 Layer 9 Final layer

MotionDiffuse

44 of 45

Result: “an injured person slowly hobbles forward

44

Layer 0 Layer 3 Layer 6 Layer 9 Final layer

MotionDiffuse

45 of 45

Result: “a person playing violin

45

Layer 0 Layer 3 Layer 6 Layer 9 Final layer

MotionDiffuse