Towards Universal 3D Lifting
Keerthan Bhat Hekkadka, Roshan Roy
Advisors - Prof. Laszlo Jeni, Mosam Dabhi
Motivation
Architecture
Experiments
References
Extracting 3D structure from casual captures of non-rigid, deformable objects is of great relevance in 3D research.
Proposed Approach
II. 2D-to-3D keypoints
Limitation - Massive motion capture rigs are expensive, inflexible and require accurate multi-view camera calibration.
Can we lift 3D structure in-the-wild without prior object knowledge and without 2D-3D semantic correspondences?
Stable Keypoints [1]
Detects semantic 2D keypoints via knowledge from Stable Diffusion model
3D-LFM: Lifting Foundation Model [2]
Universal 2D to 3D lifting of rigid & non-rigid deformable objects
[1] E. Hedlin et al., “Unsupervised Keypoints from Pretrained Diffusion Models,” arXiv preprint arXiv:2312.00065, 2023.
[2] M. Dabhi, L. A. Jeni, and S. Lucey, 3D-LFM: Lifting Foundation Model. 2023.
[3] W. Zhu, X. Ma, Z. Liu, L. Liu, W. Wu, and Y. Wang, “MotionBERT: A Unified Perspective on Learning Human Motion Representations,” 2023.
[4] Dwivedi, S. K., Sun, Y., Patel, P., Feng, Y., & Black, M. J. TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation. CVPR 2024.
Noise: Detecting non object-centric keypoints
3D-LFM++ performs close to SOTA without using temporal information!
2. Adding noise robustness to 3D-LFM
3. 3D pose-to-mesh
Noise: Keypoints switching, right and left leg
IV. 3D pose-to-mesh
Noisy input to [2] affects MPJPE
3D-LFM++ robust against missing keypoints
3D-LFM++ robust against noisy keypoints
Tokenization of human pose to obtain SMPL pose from pre-trained codebook
tokenHMR [4]
III. Robustness against noise
Leverages train-time noise augmentation to improve test-time robustness
motionBERT [3]
lift tokenized 3D
to mesh
lift 2D to 3D
lift 2D to 3D
tokenized 3D
to mesh
extract
2D keypoints
Universal image-to-mesh foundation model
Image-to-mesh
(minimal human supervision)
OOD Generalization
(foundational)
Temporal Consistency
(preserve semantics)
Universal image-to-mesh lifting
Human mesh reconstructed using [4]