1 of 23

REINFORCEMENT LEARNING OF MOTION FROM VIDEOS

Xuefei Li

Advisor: Stephen J. Guy

1

2 of 23

Introduction

  • Physics-based simulation of passive phenomena of objects is common, while modeling motion of humans and animals is challenging.
  • Manually made controller could be stylized and engaging, however it has many drawbacks,

It doesn’t have generality;

It requiring considerable effort to construct, the range of motions is limited to the space of all possible reactions

2

3 of 23

Introduction

  • Adopting motions from real-world could be a solution
  • How to get motion?

1. Motion capture(Mocap)

Mocap system uses tracking cameras or non-optical approaches to measure inertia or mechanical motions. It is capable of recreating complex movement and realistic physical interactions in physically accurate manners.

However, mocap requires specific hardware and special software programs to obtain and process the data, expensive to obtain a large dataset.

3

4 of 23

Introduction

  • Adopting motions from real-world could be a solution
  • How to get motion?

2. In-the-wild video

Easy to obtain, abundant and flexible source of motions to learn from.

Low quality, inaccurate, unstable

4

5 of 23

Introduction

  • Goal: train physically simulated characters to imitating reference motion from daily videos

5

6 of 23

Related work

  • Monocular Human Pose Estimation�2D: OpenPose� HMR�3D VideoPose2D� VIBE
  • Pose transfer
  • Reinforcement Learning

6

7 of 23

OpenPose

  • Real-time multi-person keypoint detection library for body, face, hands, and foot estimation
  • OpenPose uses a multi-stage CNN network architecture, that predicts the 2D confident maps of body parts as well as the part affinity fields (PAFs)

7

Testing the Crazy Uptown Funk flashmob in Sydney 

video sequence with OpenPose

8 of 23

Challenges of 3D pose estimation

  • Lack of largescale ground truth 3D annotation for in-the-wild images
  • Inherent ambiguities in single-view 2D-to-3D mapping
  • Depth ambiguity where multiple 3D body configurations explain the same 2D projections

Temporal information of motions and body shape would be helpful

8

9 of 23

VideoPose3D

  • 3D human pose estimation in video using 2D keypoint trajectories
  • temporal dilated convolutional model with residual connections, which takes in a sequence of 2D poses and output 3D pose estimations
  • training process is semi-supervised

9

10 of 23

HMR: Human Mesh Recovery

  •  

10

11 of 23

VIBE: Video Inference for Human Body Pose and Shape Estimation

  • Inspired by HMR, while making use of temporal information. After extracting features from images, VIBE uses a temporal generation network to preserve some temporal information before regression.
  • Refer to an existing large-scale motion capture dataset (AMASS).

11

12 of 23

Pose transfer

12

Pose format of COCO Pose format of SMPL Humanoid

13 of 23

Reinforcement Learning (RL)

  • In physical-based simulation environment, the character does not simply imitate the motion, it should actually learn to perform the skill.
  • RL allows computational algorithms to learn from interactions by mapping states and rewards to actions. 
  • Value iteration methods are applied to develop kinematic controllers that can mimic motion clip
  • The application of deep neural networks to building agents has improved the performance of simulating a series of challenging tasks
  • Policy gradient methods have been shown to be more reliable in continuous control problem
  • Additional objectives such as penalties to undesirable behaviors are used
  • DeepMimic is a state of art RL method in character controlling that takes a character model with kinematic reference motion as input and enables the characters to imitate the reference motion by satisfying the task objectives.

13

14 of 23

Implementation

  • Pose estimation
  • Motion reconstruction
  • Motion imitation

14

15 of 23

Pose estimation

  • Video take from iPhone X
  • Built with Pytorch and Caffe2
  • 2D pose estimation using OpenPose
  • Feed into temporal model

15

16 of 23

Motion reconstruction

  • Input: 161 consecutive frames as a receptive field with J=17 joints, (161, 17x2)
  • Dilated convolutions
  • 4 ResNet-style blocks surrounded by a skip-connection
  • Output: (1, 51)

16

17 of 23

Motion reconstruction

  • VIBE

17

 

18 of 23

Motion reconstruction

  •  

18

19 of 23

Motion reconstruction

  • Comparison between VideoPose3D and VIBE

19

20 of 23

Motion imitation

  • Motions obtained by VIBE: sequences of poses

Transformed into:

root position (3D), root rotation (4D), chest rotation (4D), neck rotation (4D), right hip rotation (4D), right knee rotation (1D), right ankle rotation (4D), right shoulder rotation (4D), right elbow rotation (1D), left hip rotation (4D), left knee rotation (1D), left ankle rotation (4D), left shoulder rotation (4D), left elbow rotation (1D)

  • Reinforcement Learning

Simulation environment: PyBullet

State: relative positions, rotations, linear and angular velocities of each link with respect to the root

Action: target orientations for PD controllers at each joint

Policy determines which actions should be applied at each timestep in order to reproduce the desired motion

20

21 of 23

Motion imitation

  •  

21

22 of 23

Motion imitation

22

Test on mocap data after training with 8 workers for 1 day

Test on in-the-wild video after training with 8 workers for 1 day

23 of 23

Motion imitation

23