1 of 23

Temporal Representation Learning on Monocular Videos for 3D Human Pose Estimation

1

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022

Sina Honari, Victor Constantin,

Helge Rhodin, Mathieu Salzmann, Pascal Fua

CVLab, EPFL

2 of 23

Self-Supervised Learning of Temporal Features

Motivation:

Extract temporal information from single-view RGB videos without supervision.

2

We need to:

    • Detect the person
    • Extract temporal (time-variant) information

Goal:

Use temporal information for 3D human pose estimation

3 of 23

Self-Supervised Learning of Temporal Features

Extracting time-variant features through image reconstruction

3

 

4 of 23

Self-Supervised Learning of Temporal Features

Enforcing similarity of nearby frames and dissimilarity of away frames

4

5 of 23

Contrastive Learning

  •  

5

 

 

 

6 of 23

Gravity Guided Detection

  •  

6

 

7 of 23

We assume the center of bbox corresponds to the center of gravity

7

 

 

 

We assume the diver travels in a plane parallel to the camera

Gravity Guided Detection

8 of 23

Gravity Guided Detection

8

 

 

 

 

 

 

 

 

 

 

9 of 23

9

 

10 of 23

Gravity Guided Detection

10

 

11 of 23

Pose Estimation

11

 

12 of 23

Results on Pose Estimation – H36M PoseTrack

12

Detects?

Decodes?

Contrastive loss?

Latent split?

Percentage of labelled Pose data

0.3%

1%

14%

50%

100% (35K)

DSL

227.24

220.64

198.34

193.08

191.21

DSL

236.32

211.85

202.57

199.16

198.22

DSL

185.75

159.34

135.93

129.62

127.52

DSL

149.32

122.44

100.23

97.38

95.39

-

187. 25

161.53

130.21

117.044

114.73

CSS

163.61

137.52

110.61

104.23

99.32

NMPJPE (Normalized Mean Per Joint Error) – lower is better

13 of 23

Comparison with Self-Supervised Learning Models

13

NMPJPE (Normalized Mean Per Joint Error) – lower is better

MV: Multi-View approaches

SV: Single-View approaches

Each model extracts latent features without supervision, which is then evaluated on pose estimation

14 of 23

Leveraging Weak Labels

When 2D pose is available, we can use 2D detection together with image re-synthesis

In the second phase time-variant features can be used (optionally with 2D) to lift to 3D pose

15 of 23

Ablation – Results on H36M (in mm)

1st-phase

2nd-phase

MPJPE ↓

N-MPJPE ↓

P-MPJPE ↓

Resynthesis

100.3

99.3

74.9

Resynthesis + 2D

74.5

74.1

57.3

Resynthesis + 2D

73.3

72.9

56.1

Resynthesis + 2D *

64.2

63.7

50.1

Resynthesis + 2D *

62.6

62.3

48.2

Resynthesis + 2D *

60.1

59.7

46.0

Models with * use ResNet 101 instead of ResNet 50

16 of 23

Comparison with SOTA

Results on H36M

Results on MPI-INF-3DHP

17 of 23

Mask Prediction + Reconstructed Input

18 of 23

Mask Prediction + Reconstructed Input

19 of 23

Mask Prediction + Reconstructed Input

20 of 23

Single View 3D Pose Prediction

21 of 23

Single View 3D Pose Prediction

22 of 23

Single View 3D Pose Prediction

23 of 23

  • The model learns to capture temporal information on monocular videos
  • The temporal information is then used for 3D human pose estimation

23

Conclusion:

Self-Supervised Learning of Temporal Features