1 of 58

Towards Universal State Estimation and

Reconstruction in the Wild

Team 15 - Bhuvan Jhamb, Chenwei Lyu

Mentors - Nikhil Keetha, Dr. Sebastian Scherer

2 of 58

· Imagine what’s in the future…

What is common in all of these? - SLAM

Motivation

https://medium.com/@recogni/autonomous-vehicles-and-a-system-of-connected-cars-944f86275663

https://www.reuters.com/technology/apple-ramps-up-vision-pro-production-plans-february-launch-bloomberg-news-2023-12-20/

https://www.af.mil/News/Article-Display/Article/2551037/robot-dogs-arrive-at-tyndall-afb/

Autonomous Vehicle

AR/VR

Robotics

3 of 58

Motivation

· SLAM: Simultaneous Localization and Mapping

Localization: Determining the robot's position in the environment.
Mapping: Creating a map of the environment.

4 of 58

https://www.amazon.science/latest-news/how-zoox-vehicles-find-themselves-in-an-ever-changing-world

· Autonomous Vehicle

Cars need real-time reconstruction of the scene and real-time localization

Motivation

5 of 58

Motivation

· Autonomous Vehicle

· AR/VR

The headset needs to know the orientation and create a map of surroundings

https://mobilesyrup.com/2023/06/05/apple-unveils-mixed-reality-headset-wwdc-2023/

6 of 58

· Autonomous Vehicle

· AR/VR

· Robotics

Robot needs to understand and interact with their environment

Motivation

https://m.kangnamtimes.com/tech/article/73975/

7 of 58

· What is common in all of these?

All these applications need precise camera tracking and high-fidelity reconstruction - and often we need to do these together (SLAM)
Dense reconstructions significantly enhance interactivity and utility - especially for AR/VR and robotics as humans and robots start to share workspaces

Motivation

https://medium.com/@recogni/autonomous-vehicles-and-a-system-of-connected-cars-944f86275663

https://www.reuters.com/technology/apple-ramps-up-vision-pro-production-plans-february-launch-bloomberg-news-2023-12-20/

https://www.af.mil/News/Article-Display/Article/2551037/robot-dogs-arrive-at-tyndall-afb/

8 of 58

Motivation

· Why Are Existing Methods Insufficient?

Classical SLAM: brittle, × use priors, extensive fine tuning
Implicit SLAM × ⇒ Robotic systems
No explicit dense learning based SLAM ⇒ real time, high accuracy VO and dense mapping

Hence this leads to our project:

Towards Universal State Estimation and Reconstruction in the Wild

9 of 58

Literature Review

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM
AnyLoc: Towards Universal Visual Place Recognition
DUSt3R: Geometric 3D Vision Made Easy

11 of 58

A scene can be represented in an explicit manner as a collection of 3d gaussians

Image Source: https://rmurai.co.uk/projects/GaussianSplattingSLAM/

Gaussian Splatting: Brief Review

13 of 58

RGB Images

GT Poses

Initial PCL

3D Gaussians�representing the scene

Offline

3D Gaussian �Splatting

Poses

Incremental RGB-D Frames

3D Gaussians�representing the scene

Online

SplaTAM

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

14 of 58

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

Gaussian Splatting meets SLAM

15 of 58

Assumptions

Gaussians are modelled as

Isotropic
View Independent

With these assumptions each gaussian has 8 parameter:

(compared to 59 in original 3DGS)

View Independent Color - 3 params (rgb)
Position - 3 params (xyz)
Radius - 1 param (r)
Opacity - 1 param (o)

Image source (link)

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

16 of 58

How SplaTAM works:

Initialization

For every new frame:

Tracking
Gaussian Densification

Map Update

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

17 of 58

Color - same as pixel
Position - obtained by unprojecting rgb-d values
Radius - (Depth value)/(focal length)
Opacity - 0.5

Initialization

How SplaTAM works:

Initialization

For every new frame:

Tracking
Gaussian Densification

Map Update

Camera Pose:

Map Initialization:�

Modelled as Identity for 1st Frame

For first frame, for each pixel - we add a new gaussian with the following parameters

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

18 of 58

Tracking

Initialize the pose using constant velocity assumption

Update camera pose with gradient descent while keeping the Gaussian parameters to minimize �Rendering loss

How SplaTAM works:

Initialization

For every new frame:

Tracking
Gaussian Densification

Map Update

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

19 of 58

Gaussian Densification

Identify Regions where Gaussians need to be appended by calculating densification mask

Insert the gaussians using same initialization strategy as used for 1st frame

How SplaTAM works:

Initialization

For every new frame:

Tracking
Gaussian Densification

Map Update

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

20 of 58

Map Update

This step happens every k keyframes

Updates/Optimizes the parameters of the 3D Gaussian Map

This is same as fitting gaussians to views with known camera poses (Vanilla Gaussian Splatting)

How SplaTAM works:

Initialization

For every new frame:

Tracking
Gaussian Densification

Map Update

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

21 of 58

Results

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

22 of 58

Low Speed/Not Real Time (0.5-1 FPS)

Need accurate/pixel-perfect depth

Subpar tracking compared to feature based/indirect methods

No support for relocalization

Both memory and compute requirements increase very fast as number of points increase

Limitations/Potential Improvements:

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

23 of 58

Proof of concept that 3D Gaussians can be a useful representation for Dense SLAM

Why This Paper

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

24 of 58

Some Progress:

Achieved ~1.3x improvement in fps - by replacing pytorch autodiff with a custom CUDA kernel

Just making implementation more efficient won’t help

We are exploring algorithmic changes to overcome the limitations of SplaTAM

One such direction is Feature Metric SLAM
Another possible direction is training a network to do feed forward SplaTAM

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

25 of 58

Literature Review

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM
AnyLoc: Towards Universal Visual Place Recognition
DUSt3R: Geometric 3D Vision Made Easy

26 of 58

27 of 58

AnyLoc: Towards Universal Visual Place Recognition

Humans & Robots alike need to know where they are

for Scene Understanding & Navigation

28 of 58

AnyLoc: Towards Universal Visual Place Recognition

29 of 58

AnyLoc: Towards Universal Visual Place Recognition

Current SOTA: Perform well in Training Distribution (Urban)

Do not generalize to diverse conditions

30 of 58

AnyLoc: Towards Universal Visual Place Recognition

AnyLoc Solution:

· Use Intermediate Features from Self-Supervised ViT

31 of 58

AnyLoc: Towards Universal Visual Place Recognition

AnyLoc Solution:

· Use Intermediate Features from Self-Supervised ViT

· Unsupervised Local Feature Aggregation

32 of 58

AnyLoc: Towards Universal Visual Place Recognition

AnyLoc Results:

· Achieves up to 4X wider performance

33 of 58

AnyLoc: Towards Universal Visual Place Recognition

Visually Degraded Environment (Hawkins)

500 Km Aerial Dataset (VP-Air)

34 of 58

Why This Paper:

Insights into utilizing ViT features for robust universal SLAM

Some Progress:

Completed distillation from DINO-v2 to Efficient ViT.
The distilled model will be evaluated by AnyLoc benchmarking.

AnyLoc: Towards Universal Visual Place Recognition

35 of 58

Literature Review

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM
AnyLoc: Towards Universal Visual Place Recognition
DUSt3R: Geometric 3D Vision Made Easy

36 of 58

(Dense and Unconstrained Stereo 3D Reconstruction)

DUSt3R: Geometric 3D Vision Made Easy

37 of 58

From these 2 images alone, can we infer

Dense 3d reconstruction
Camera Extrinsics
Camera Intrinsics
Pixel Correspondences

DUSt3R can!!

DUSt3R: Geometric 3D Vision Made Easy

38 of 58

DUSt3R: Geometric 3D Vision Made Easy

39 of 58

Classical Structure From Motion Pipeline:

Multiple modules

Usually no or minimal information sharing

No Priors are utilized

Image source (link)

DUSt3R: Geometric 3D Vision Made Easy

40 of 58

The DUSt3R Pipeline

DUSt3R: Geometric 3D Vision Made Easy

41 of 58

The DUSt3R Pipeline

(r,g,b)

(x,y,z)

Image

Pointmap

DUSt3R: Geometric 3D Vision Made Easy

42 of 58

The DUSt3R Pipeline

DUSt3R: Geometric 3D Vision Made Easy

43 of 58

The DUSt3R Architecture - Pretraining

Reference Image

Image from a 2nd viewpoint but masked

CroCo Prediction

Source: CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion

DUSt3R: Geometric 3D Vision Made Easy

44 of 58

The DUSt3R Architecture - Details

Dataset: Mixture of datasets with 8.5M pairs in total

Loss: Confidence weighted L2 Loss

Training Costs: 8A100 GPU * 3 days

DUSt3R: Geometric 3D Vision Made Easy

45 of 58

Experiments - Minimal Overlap

DUSt3R: Geometric 3D Vision Made Easy

46 of 58

DUSt3R: Geometric 3D Vision Made Easy

47 of 58

Experiments - Cross View

DUSt3R: Geometric 3D Vision Made Easy

48 of 58

Experiments - Cross View

DUSt3R: Geometric 3D Vision Made Easy

49 of 58

The DUSt3R Architecture - Scaling to >2 images

The architecture can be extended to support multiple images at once by

Doing pairwise prediction

Bringing them all in world coordinate frame

Form a minimum spanning tree and run 1st order optimization

DUSt3R: Geometric 3D Vision Made Easy

50 of 58

The DUSt3R Architecture - Scaling to >2 images

DUSt3R: Geometric 3D Vision Made Easy

51 of 58

DUSt3R - Failure Cases (Illumination Changes)

source (link)

DUSt3R: Geometric 3D Vision Made Easy

52 of 58

DUSt3R - Failure Cases (Humans)

DUSt3R: Geometric 3D Vision Made Easy

53 of 58

DUSt3R - Failure Cases (Long term Sequential Data)

DUSt3R: Geometric 3D Vision Made Easy

54 of 58

Strong utilization of priors in geometric vision

Beats SOTA on the two view registration - foundational block for SLAM

The architecture can be easily extended to incorporate other deep learning modules ( think Semantic DUSt3R, DUSt3R + Gaussians etc.)

Why This Paper

DUSt3R: Geometric 3D Vision Made Easy

55 of 58

SplaTAM:

Proof of concept that 3D Gaussians can be a very useful representation for Dense SLAM

AnyLoc:

Insights into utilizing ViT features for robust universal SLAM

�DUSt3R:

An influential way to bake in geometry in a deep neural network

Conclusion

56 of 58

Featuremetic SLAM

Use features to track, while retaining gaussian splatting backend

Feed Forward SLAM

Extending Dust3R framework to sequential data to enable feed forward localization and mapping

Current Work Directions

57 of 58

Nikhil

Keetha

Jay

Karhade

Sebastian Scherer

Sourav Garg

Akash

Sharma

Shibo Zhao

Yao He

Thanks to Mentors and Collaborators

1 of 58

2 of 58

3 of 58

4 of 58

5 of 58

6 of 58

7 of 58

8 of 58

9 of 58

10 of 58

11 of 58

12 of 58

13 of 58

14 of 58

15 of 58

16 of 58

17 of 58

18 of 58

19 of 58

20 of 58

21 of 58

22 of 58

23 of 58

24 of 58

25 of 58

26 of 58

27 of 58

28 of 58

29 of 58

30 of 58

31 of 58

32 of 58

33 of 58

34 of 58

35 of 58

36 of 58

37 of 58

38 of 58

39 of 58

40 of 58

41 of 58

42 of 58

43 of 58

44 of 58

45 of 58

46 of 58

47 of 58

48 of 58

49 of 58

50 of 58

51 of 58

52 of 58

53 of 58

54 of 58

55 of 58

56 of 58

57 of 58

58 of 58