1 of 23

REINFORCEMENT LEARNING OF MOTION FROM VIDEOS

Xuefei Li

Advisor: Stephen J. Guy

2 of 23

Introduction

Physics-based simulation of passive phenomena of objects is common, while modeling motion of humans and animals is challenging.
Manually made controller could be stylized and engaging, however it has many drawbacks,

It doesn’t have generality;

It requiring considerable effort to construct, the range of motions is limited to the space of all possible reactions

3 of 23

Introduction

Adopting motions from real-world could be a solution
How to get motion?

1. Motion capture(Mocap)

Mocap system uses tracking cameras or non-optical approaches to measure inertia or mechanical motions. It is capable of recreating complex movement and realistic physical interactions in physically accurate manners.

However, mocap requires specific hardware and special software programs to obtain and process the data, expensive to obtain a large dataset.

4 of 23

Introduction

Adopting motions from real-world could be a solution
How to get motion?

2. In-the-wild video

Easy to obtain, abundant and flexible source of motions to learn from.

Low quality, inaccurate, unstable

5 of 23

Introduction

Goal: train physically simulated characters to imitating reference motion from daily videos

6 of 23

Related work

Monocular Human Pose Estimation�2D: OpenPose� HMR�3D VideoPose2D� VIBE
Pose transfer
Reinforcement Learning

7 of 23

OpenPose

Real-time multi-person keypoint detection library for body, face, hands, and foot estimation
OpenPose uses a multi-stage CNN network architecture, that predicts the 2D confident maps of body parts as well as the part affinity fields (PAFs)

Testing the Crazy Uptown Funk flashmob in Sydney

video sequence with OpenPose

8 of 23

Challenges of 3D pose estimation

Lack of largescale ground truth 3D annotation for in-the-wild images
Inherent ambiguities in single-view 2D-to-3D mapping
Depth ambiguity where multiple 3D body configurations explain the same 2D projections
…

Temporal information of motions and body shape would be helpful

9 of 23

VideoPose3D

3D human pose estimation in video using 2D keypoint trajectories
temporal dilated convolutional model with residual connections, which takes in a sequence of 2D poses and output 3D pose estimations
training process is semi-supervised

10 of 23

HMR: Human Mesh Recovery

11 of 23

VIBE: Video Inference for Human Body Pose and Shape Estimation

Inspired by HMR, while making use of temporal information. After extracting features from images, VIBE uses a temporal generation network to preserve some temporal information before regression.
Refer to an existing large-scale motion capture dataset (AMASS).

12 of 23

Pose transfer

Pose format of COCO Pose format of SMPL Humanoid

13 of 23

Reinforcement Learning (RL)

In physical-based simulation environment, the character does not simply imitate the motion, it should actually learn to perform the skill.
RL allows computational algorithms to learn from interactions by mapping states and rewards to actions.
Value iteration methods are applied to develop kinematic controllers that can mimic motion clip
The application of deep neural networks to building agents has improved the performance of simulating a series of challenging tasks
Policy gradient methods have been shown to be more reliable in continuous control problem
Additional objectives such as penalties to undesirable behaviors are used
DeepMimic is a state of art RL method in character controlling that takes a character model with kinematic reference motion as input and enables the characters to imitate the reference motion by satisfying the task objectives.

14 of 23

Implementation

Pose estimation
Motion reconstruction
Motion imitation

15 of 23

Pose estimation

Video take from iPhone X
Built with Pytorch and Caffe2
2D pose estimation using OpenPose
Feed into temporal model

16 of 23

Motion reconstruction

Input: 161 consecutive frames as a receptive field with J=17 joints, (161, 17x2)
Dilated convolutions
4 ResNet-style blocks surrounded by a skip-connection
Output: (1, 51)

17 of 23

Motion reconstruction

VIBE

18 of 23

Motion reconstruction

19 of 23

Motion reconstruction

Comparison between VideoPose3D and VIBE

20 of 23

Motion imitation

Motions obtained by VIBE: sequences of poses

Transformed into:

root position (3D), root rotation (4D), chest rotation (4D), neck rotation (4D), right hip rotation (4D), right knee rotation (1D), right ankle rotation (4D), right shoulder rotation (4D), right elbow rotation (1D), left hip rotation (4D), left knee rotation (1D), left ankle rotation (4D), left shoulder rotation (4D), left elbow rotation (1D)

Reinforcement Learning

Simulation environment: PyBullet

State: relative positions, rotations, linear and angular velocities of each link with respect to the root

Action: target orientations for PD controllers at each joint

Policy determines which actions should be applied at each timestep in order to reproduce the desired motion

21 of 23

Motion imitation

22 of 23

Motion imitation

Test on mocap data after training with 8 workers for 1 day

Test on in-the-wild video after training with 8 workers for 1 day

1 of 23

2 of 23

3 of 23

4 of 23

5 of 23

6 of 23

7 of 23

8 of 23

9 of 23

10 of 23

11 of 23

12 of 23

13 of 23

14 of 23

15 of 23

16 of 23

17 of 23

18 of 23

19 of 23

20 of 23

21 of 23

22 of 23

23 of 23