1 of 58

1

Towards Universal State Estimation and

Reconstruction in the Wild

Team 15 - Bhuvan Jhamb, Chenwei Lyu

Mentors - Nikhil Keetha, Dr. Sebastian Scherer

2 of 58

2

· Imagine what’s in the future…

What is common in all of these? - SLAM

Motivation

https://medium.com/@recogni/autonomous-vehicles-and-a-system-of-connected-cars-944f86275663

https://www.reuters.com/technology/apple-ramps-up-vision-pro-production-plans-february-launch-bloomberg-news-2023-12-20/

https://www.af.mil/News/Article-Display/Article/2551037/robot-dogs-arrive-at-tyndall-afb/

Autonomous Vehicle

AR/VR

Robotics

3 of 58

3

Motivation

· SLAM: Simultaneous Localization and Mapping

  • Localization: Determining the robot's position in the environment.
  • Mapping: Creating a map of the environment.

4 of 58

4

https://www.amazon.science/latest-news/how-zoox-vehicles-find-themselves-in-an-ever-changing-world

· Autonomous Vehicle

  • Cars need real-time reconstruction of the scene and real-time localization

Motivation

5 of 58

5

Motivation

· Autonomous Vehicle

· AR/VR

  • The headset needs to know the orientation and create a map of surroundings

https://mobilesyrup.com/2023/06/05/apple-unveils-mixed-reality-headset-wwdc-2023/

6 of 58

6

· Autonomous Vehicle

· AR/VR

· Robotics

  • Robot needs to understand and interact with their environment

Motivation

https://m.kangnamtimes.com/tech/article/73975/

7 of 58

7

· What is common in all of these?

  • All these applications need precise camera tracking and high-fidelity reconstruction - and often we need to do these together (SLAM)
  • Dense reconstructions significantly enhance interactivity and utility - especially for AR/VR and robotics as humans and robots start to share workspaces

Motivation

https://medium.com/@recogni/autonomous-vehicles-and-a-system-of-connected-cars-944f86275663

https://www.reuters.com/technology/apple-ramps-up-vision-pro-production-plans-february-launch-bloomberg-news-2023-12-20/

https://www.af.mil/News/Article-Display/Article/2551037/robot-dogs-arrive-at-tyndall-afb/

8 of 58

8

Motivation

· Why Are Existing Methods Insufficient?

  • Classical SLAM: brittle, × use priors, extensive fine tuning
  • Implicit SLAM × ⇒ Robotic systems
  • No explicit dense learning based SLAM ⇒ real time, high accuracy VO and dense mapping

Hence this leads to our project:

Towards Universal State Estimation and Reconstruction in the Wild

9 of 58

9

Literature Review

  1. SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM
  2. AnyLoc: Towards Universal Visual Place Recognition
  3. DUSt3R: Geometric 3D Vision Made Easy

10 of 58

10

11 of 58

11

A scene can be represented in an explicit manner as a collection of 3d gaussians

Image Source: https://rmurai.co.uk/projects/GaussianSplattingSLAM/

Gaussian Splatting: Brief Review

12 of 58

12

13 of 58

13

RGB Images

GT Poses

Initial PCL

3D Gaussians�representing the scene

Offline

3D Gaussian �Splatting

Poses

Incremental RGB-D Frames

3D Gaussians�representing the scene

Online

SplaTAM

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

14 of 58

14

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

Gaussian Splatting meets SLAM

15 of 58

15

Assumptions

Gaussians are modelled as

  • Isotropic
  • View Independent

With these assumptions each gaussian has 8 parameter:

(compared to 59 in original 3DGS)

  • View Independent Color - 3 params (rgb)
  • Position - 3 params (xyz)
  • Radius - 1 param (r)
  • Opacity - 1 param (o)

Image source (link)

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

16 of 58

16

How SplaTAM works:

Initialization

For every new frame:

  • Tracking
  • Gaussian Densification

Map Update

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

17 of 58

17

  • Color - same as pixel
  • Position - obtained by unprojecting rgb-d values
  • Radius - (Depth value)/(focal length)
  • Opacity - 0.5

Initialization

How SplaTAM works:

Initialization

For every new frame:

  • Tracking
  • Gaussian Densification

Map Update

Camera Pose:

Map Initialization:

  • Modelled as Identity for 1st Frame

For first frame, for each pixel - we add a new gaussian with the following parameters

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

18 of 58

18

Tracking

  • Initialize the pose using constant velocity assumption

  • Update camera pose with gradient descent while keeping the Gaussian parameters to minimize �Rendering loss

How SplaTAM works:

Initialization

For every new frame:

  • Tracking
  • Gaussian Densification

Map Update

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

19 of 58

19

Gaussian Densification

  • Identify Regions where Gaussians need to be appended by calculating densification mask
  • Insert the gaussians using same initialization strategy as used for 1st frame

How SplaTAM works:

Initialization

For every new frame:

  • Tracking
  • Gaussian Densification

Map Update

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

20 of 58

20

Map Update

  • This step happens every k keyframes

  • Updates/Optimizes the parameters of the 3D Gaussian Map

  • This is same as fitting gaussians to views with known camera poses (Vanilla Gaussian Splatting)

How SplaTAM works:

Initialization

For every new frame:

  • Tracking
  • Gaussian Densification

Map Update

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

21 of 58

21

  • Results

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

22 of 58

22

  • Low Speed/Not Real Time (0.5-1 FPS)

  • Need accurate/pixel-perfect depth

  • Subpar tracking compared to feature based/indirect methods

  • No support for relocalization

  • Both memory and compute requirements increase very fast as number of points increase

Limitations/Potential Improvements:

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

23 of 58

23

Proof of concept that 3D Gaussians can be a useful representation for Dense SLAM

Why This Paper

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

24 of 58

24

Some Progress:

Achieved ~1.3x improvement in fps - by replacing pytorch autodiff with a custom CUDA kernel

Just making implementation more efficient won’t help

We are exploring algorithmic changes to overcome the limitations of SplaTAM

  • One such direction is Feature Metric SLAM
  • Another possible direction is training a network to do feed forward SplaTAM

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

25 of 58

25

Literature Review

  • SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM
  • AnyLoc: Towards Universal Visual Place Recognition
  • DUSt3R: Geometric 3D Vision Made Easy

26 of 58

26

27 of 58

27

AnyLoc: Towards Universal Visual Place Recognition

Humans & Robots alike need to know where they are

for Scene Understanding & Navigation

28 of 58

28

AnyLoc: Towards Universal Visual Place Recognition

29 of 58

29

AnyLoc: Towards Universal Visual Place Recognition

Current SOTA: Perform well in Training Distribution (Urban)

Do not generalize to diverse conditions

30 of 58

30

AnyLoc: Towards Universal Visual Place Recognition

AnyLoc Solution:

· Use Intermediate Features from Self-Supervised ViT

31 of 58

31

AnyLoc: Towards Universal Visual Place Recognition

AnyLoc Solution:

· Use Intermediate Features from Self-Supervised ViT

· Unsupervised Local Feature Aggregation

32 of 58

32

AnyLoc: Towards Universal Visual Place Recognition

AnyLoc Results:

· Achieves up to 4X wider performance

33 of 58

33

AnyLoc: Towards Universal Visual Place Recognition

Visually Degraded Environment (Hawkins)

500 Km Aerial Dataset (VP-Air)

34 of 58

34

Why This Paper:

  • Insights into utilizing ViT features for robust universal SLAM

Some Progress:

  • Completed distillation from DINO-v2 to Efficient ViT.
  • The distilled model will be evaluated by AnyLoc benchmarking.

AnyLoc: Towards Universal Visual Place Recognition

35 of 58

35

Literature Review

  • SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM
  • AnyLoc: Towards Universal Visual Place Recognition
  • DUSt3R: Geometric 3D Vision Made Easy

36 of 58

36

(Dense and Unconstrained Stereo 3D Reconstruction)

DUSt3R: Geometric 3D Vision Made Easy

37 of 58

37

From these 2 images alone, can we infer

  • Dense 3d reconstruction
  • Camera Extrinsics
  • Camera Intrinsics
  • Pixel Correspondences

DUSt3R can!!

DUSt3R: Geometric 3D Vision Made Easy

38 of 58

38

DUSt3R: Geometric 3D Vision Made Easy

39 of 58

39

Classical Structure From Motion Pipeline:

  • Multiple modules

  • Usually no or minimal information sharing

  • No Priors are utilized

Image source (link)

DUSt3R: Geometric 3D Vision Made Easy

40 of 58

40

The DUSt3R Pipeline

DUSt3R: Geometric 3D Vision Made Easy

41 of 58

41

The DUSt3R Pipeline

(r,g,b)

(x,y,z)

Image

Pointmap

DUSt3R: Geometric 3D Vision Made Easy

42 of 58

42

The DUSt3R Pipeline

DUSt3R: Geometric 3D Vision Made Easy

43 of 58

43

The DUSt3R Architecture - Pretraining

Reference Image

Image from a 2nd viewpoint but masked

CroCo Prediction

Source: CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion

DUSt3R: Geometric 3D Vision Made Easy

44 of 58

44

The DUSt3R Architecture - Details

  • Dataset: Mixture of datasets with 8.5M pairs in total

  • Loss: Confidence weighted L2 Loss
  • Training Costs: 8A100 GPU * 3 days

DUSt3R: Geometric 3D Vision Made Easy

45 of 58

45

Experiments - Minimal Overlap

DUSt3R: Geometric 3D Vision Made Easy

46 of 58

46

DUSt3R: Geometric 3D Vision Made Easy

47 of 58

47

Experiments - Cross View

DUSt3R: Geometric 3D Vision Made Easy

48 of 58

48

Experiments - Cross View

DUSt3R: Geometric 3D Vision Made Easy

49 of 58

49

The DUSt3R Architecture - Scaling to >2 images

The architecture can be extended to support multiple images at once by

  • Doing pairwise prediction

  • Bringing them all in world coordinate frame

  • Form a minimum spanning tree and run 1st order optimization

DUSt3R: Geometric 3D Vision Made Easy

50 of 58

50

The DUSt3R Architecture - Scaling to >2 images

DUSt3R: Geometric 3D Vision Made Easy

51 of 58

51

DUSt3R - Failure Cases (Illumination Changes)

source (link)

DUSt3R: Geometric 3D Vision Made Easy

52 of 58

52

DUSt3R - Failure Cases (Humans)

DUSt3R: Geometric 3D Vision Made Easy

53 of 58

53

DUSt3R - Failure Cases (Long term Sequential Data)

DUSt3R: Geometric 3D Vision Made Easy

54 of 58

54

  • Strong utilization of priors in geometric vision

  • Beats SOTA on the two view registration - foundational block for SLAM

  • The architecture can be easily extended to incorporate other deep learning modules ( think Semantic DUSt3R, DUSt3R + Gaussians etc.)

Why This Paper

DUSt3R: Geometric 3D Vision Made Easy

55 of 58

55

SplaTAM:

Proof of concept that 3D Gaussians can be a very useful representation for Dense SLAM

AnyLoc:

Insights into utilizing ViT features for robust universal SLAM

�DUSt3R:

An influential way to bake in geometry in a deep neural network

Conclusion

56 of 58

56

Featuremetic SLAM

  • Use features to track, while retaining gaussian splatting backend

Feed Forward SLAM

  • Extending Dust3R framework to sequential data to enable feed forward localization and mapping

Current Work Directions

57 of 58

57

Nikhil

Keetha

Jay

Karhade

Sebastian Scherer

Sourav Garg

Akash

Sharma

Shibo Zhao

Yao He

Thanks to Mentors and Collaborators

58 of 58

58

Thanks For Listening!