Temporal Representation Learning on Monocular Videos for 3D Human Pose Estimation
1
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Sina Honari, Victor Constantin,
Helge Rhodin, Mathieu Salzmann, Pascal Fua
CVLab, EPFL
Self-Supervised Learning of Temporal Features
Motivation:
Extract temporal information from single-view RGB videos without supervision.
2
We need to:
Goal:
Use temporal information for 3D human pose estimation
Self-Supervised Learning of Temporal Features
Extracting time-variant features through image reconstruction
3
Self-Supervised Learning of Temporal Features
Enforcing similarity of nearby frames and dissimilarity of away frames
4
Contrastive Learning
5
Gravity Guided Detection
6
We assume the center of bbox corresponds to the center of gravity
7
We assume the diver travels in a plane parallel to the camera
Gravity Guided Detection
Gravity Guided Detection
8
9
Gravity Guided Detection
10
Pose Estimation
11
Results on Pose Estimation – H36M PoseTrack
12
Detects? | Decodes? | Contrastive loss? | Latent split? | Percentage of labelled Pose data | ||||
0.3% | 1% | 14% | 50% | 100% (35K) | ||||
✘ | ✘ | DSL | ✘ | 227.24 | 220.64 | 198.34 | 193.08 | 191.21 |
✓ | ✘ | DSL | ✘ | 236.32 | 211.85 | 202.57 | 199.16 | 198.22 |
✓ | ✓ | DSL | ✘ | 185.75 | 159.34 | 135.93 | 129.62 | 127.52 |
✓ | ✓ | DSL | ✓ | 149.32 | 122.44 | 100.23 | 97.38 | 95.39 |
✓ | ✓ | - | ✘ | 187. 25 | 161.53 | 130.21 | 117.044 | 114.73 |
✓ | ✓ | CSS | ✓ | 163.61 | 137.52 | 110.61 | 104.23 | 99.32 |
NMPJPE (Normalized Mean Per Joint Error) – lower is better
Comparison with Self-Supervised Learning Models
13
NMPJPE (Normalized Mean Per Joint Error) – lower is better
MV: Multi-View approaches
SV: Single-View approaches
Each model extracts latent features without supervision, which is then evaluated on pose estimation
Leveraging Weak Labels
When 2D pose is available, we can use 2D detection together with image re-synthesis
In the second phase time-variant features can be used (optionally with 2D) to lift to 3D pose
Ablation – Results on H36M (in mm)
1st-phase | 2nd-phase | MPJPE ↓ | N-MPJPE ↓ | P-MPJPE ↓ |
Resynthesis | | 100.3 | 99.3 | 74.9 |
Resynthesis + 2D | | 74.5 | 74.1 | 57.3 |
Resynthesis + 2D | | 73.3 | 72.9 | 56.1 |
Resynthesis + 2D * | | 64.2 | 63.7 | 50.1 |
Resynthesis + 2D * | | 62.6 | 62.3 | 48.2 |
Resynthesis + 2D * | | 60.1 | 59.7 | 46.0 |
Models with * use ResNet 101 instead of ResNet 50
Comparison with SOTA
Results on H36M
Results on MPI-INF-3DHP
Mask Prediction + Reconstructed Input
Mask Prediction + Reconstructed Input
Mask Prediction + Reconstructed Input
Single View 3D Pose Prediction
Single View 3D Pose Prediction
Single View 3D Pose Prediction
23
Conclusion:
Self-Supervised Learning of Temporal Features