1 of 1

Towards Universal 3D Lifting

Keerthan Bhat Hekkadka, Roshan Roy

Advisors - Prof. Laszlo Jeni, Mosam Dabhi

Motivation

Architecture

Experiments

References

Extracting 3D structure from casual captures of non-rigid, deformable objects is of great relevance in 3D research.

Proposed Approach

  1. Image-to-2D keypoints

II. 2D-to-3D keypoints

Limitation - Massive motion capture rigs are expensive, inflexible and require accurate multi-view camera calibration.

Can we lift 3D structure in-the-wild without prior object knowledge and without 2D-3D semantic correspondences?

Stable Keypoints [1]

Detects semantic 2D keypoints via knowledge from Stable Diffusion model

3D-LFM: Lifting Foundation Model [2]

Universal 2D to 3D lifting of rigid & non-rigid deformable objects

[1] E. Hedlin et al., “Unsupervised Keypoints from Pretrained Diffusion Models,” arXiv preprint arXiv:2312.00065, 2023.

[2] M. Dabhi, L. A. Jeni, and S. Lucey, 3D-LFM: Lifting Foundation Model. 2023.

[3] W. Zhu, X. Ma, Z. Liu, L. Liu, W. Wu, and Y. Wang, “MotionBERT: A Unified Perspective on Learning Human Motion Representations,” 2023.

[4] Dwivedi, S. K., Sun, Y., Patel, P., Feng, Y., & Black, M. J. TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation. CVPR 2024.

Noise: Detecting non object-centric keypoints

3D-LFM++ performs close to SOTA without using temporal information!

2. Adding noise robustness to 3D-LFM

  1. Image-to-2D analysis

3. 3D pose-to-mesh

Noise: Keypoints switching, right and left leg

IV. 3D pose-to-mesh

Noisy input to [2] affects MPJPE

3D-LFM++ robust against missing keypoints

3D-LFM++ robust against noisy keypoints

Tokenization of human pose to obtain SMPL pose from pre-trained codebook

tokenHMR [4]

III. Robustness against noise

Leverages train-time noise augmentation to improve test-time robustness

motionBERT [3]

lift tokenized 3D

to mesh

lift 2D to 3D

lift 2D to 3D

tokenized 3D

to mesh

extract

2D keypoints

Universal image-to-mesh foundation model

Image-to-mesh

(minimal human supervision)

OOD Generalization

(foundational)

Temporal Consistency

(preserve semantics)

Universal image-to-mesh lifting

Human mesh reconstructed using [4]