1 of 36

1

Towards Universal State Estimation and

Reconstruction in the Wild

Team 15 - Bhuvan Jhamb, Chenwei Lyu

Mentors - Nikhil Keetha, Dr. Sebastian Scherer

2 of 36

2

· Imagine what’s in the future…

Motivation

https://medium.com/@recogni/autonomous-vehicles-and-a-system-of-connected-cars-944f86275663

https://www.reuters.com/technology/apple-ramps-up-vision-pro-production-plans-february-launch-bloomberg-news-2023-12-20/

https://www.af.mil/News/Article-Display/Article/2551037/robot-dogs-arrive-at-tyndall-afb/

Autonomous Vehicle

AR/VR

Robotics

What is common in all of these? - SLAM

3 of 36

3

Motivation

· SLAM: Simultaneous Localization and Mapping

Localization: Determining the robot's position in the environment.
Mapping: Creating a map of the environment.

4 of 36

4

Motivation

· Why Are Existing Methods Insufficient?

Classical SLAM: brittle, × use priors, extensive fine tuning
Implicit SLAM × ⇒ Robotic systems
No explicit dense learning based SLAM ⇒ real time, high accuracy VO and dense mapping

Hence this leads to our project:

Towards Universal State Estimation and Reconstruction in the Wild

5 of 36

5

Classical Structure From Motion Pipeline:

Multiple modules

Usually no or minimal information sharing

Priors are not well utilized `

Image source (link)

Recap:

6 of 36

6

The DUSt3R Pipeline

DUSt3R: Geometric 3D Vision Made Easy

Recap:

(Dense and Unconstrained Stereo 3D Reconstruction)

7 of 36

7

The DUSt3R Pipeline

DUSt3R: Geometric 3D Vision Made Easy

Recap:

8 of 36

8

DUSt3R: Geometric 3D Vision Made Easy

Experiments - Minimal Overlap

Recap:

9 of 36

9

Experiments - Cross View

DUSt3R: Geometric 3D Vision Made Easy

Recap:

10 of 36

10

Generalizable and Robust SLAM for “In the Wild” Settings:

Is DUSt3R the ideal way to exploit priors in pose estimation and mapping

- in ill posed settings (very sparse views/pure rotation/planar data etc.)

- or challenging (extreme lightening variations etc.) scenarios

Directions Explored

Recap:

How to extend Dust3R framework to sequential data

11 of 36

11

Scaling to >2 images

DUSt3R: Geometric 3D Vision Made Easy

Failure Cases (Long term Sequential Data)

12 of 36

12

Other works extending Dust3R to SLAM settings

Inherit Limitations of Dust3R….

13 of 36

13

DUSt3R fails on Out Of Distribution Data

DUSt3R: Geometric 3D Vision Made Easy

(Humans)

source (link)

14 of 36

14

Limitations of DUSt3R :

Not generalizable enough for precise large scale reconstruction in the wild settings (yet)

Is this just a scale issue?

15 of 36

15

New models came out over the summer!

Dust3R

Geometrical

Matching

Feature-based

matching

Mast3R

16 of 36

16

Precise correspondences enable precise reconstruction!

Given precise correspondences- reconstruction is a geometrically grounded task

Instead of teaching the network to reason about geometry, we focus on learning precise correspondences, which can be utilized for reconstruction

17 of 36

17

Precise correspondences enable precise reconstruction!

Tartan VO

Flowmap

Classical SfM Pipeline

18 of 36

18

Image Matching

Sparse

Matching

Dense

Matching

Generalised

Matching

e.g. GMFlow

e.g. SAM mets Point Tracking

19 of 36

19

Our solution : Match Anything

One architecture for both dense and sparse matching

Match Anything

Simple architecture trained with large scale data

20 of 36

20

Architecture

Geometrical

Matching

Feature-based

matching

Mast3R

Match Anything

1. Simplify the architecture to be more generalizable

Include more dataset for training

21 of 36

21

Architecture

2. Only predict flow in the covisible region

This makes matching a 2D problem, as the network don’t need to reason about occlusions/geometry!

22 of 36

22

Co-visible Mask Generation

Project pixels from camera 1 into 3D space and then back to camera 2

FoV mask: If the coordinates of projected pixel is within image boundaries, then it’s in FOV

Occlusion mask: If the depth of projected pixel is close to the real depth of that pixel, then it’s visible

23 of 36

1. Accumulate Pointcloud:

From posed depth image, upproject into point cloud and accumulate.

23

Sampling Method

· To get image pairs, we need to design a sampling method.

2. Voxelize:

Voxel downsample point clouds & camera positions to create enumerable scene representation.

3. Calculate Covisibility:

For all camera position to all voxels, determine if the camera can see the voxel. Save to a list.

24 of 36

4. Generate Samples:

Randomly select a base camera and a target voxel, filter all candidate camera position that can have required angle with the base camera when looking at the target voxel.

Score all candidates based on visibility from the preprocessed covisibility list. Keep N candidates and add to pair list.

24

Reference ray

Sampling Method

25 of 36

25

Datasets

26 of 36

26

Results

Season

Day-Night

27 of 36

27

Results

Scale

Perspective

28 of 36

WxBS result

28

Tested: 90

58

0.69

29 of 36

Profiling Match Anything (On AGX Orin)

29

Batch Size	Time Taken in forward pass (ms)	Time taken per image (ms)
1	92.12	92.12
2	103.54	51.72
4	186.96	46.74
8	349.27	43.65

~20 FPS Performance

30 of 36

30

Limitations

Extreme Dark

Comprehensive Change

31 of 36

31

Utilizing match anything for reconstruction

End to End approaches�Robust but not Generalizable enough

Very Modular approaches�Generalizable but not robust enough

Somewhere in between totally modular vs totally E2E

32 of 36

32

Utilizing match anything for reconstruction

Tartan VO

Flowmap

Mac VO

33 of 36

33

Utilizing match anything for reconstruction

Replace with Match Anything

34 of 36

34

Ongoing Work

Current model training is in progress�
Start another training with all 10 datasets

MAC VO Integration

Comprehensive benchmarking and profiling

Source - ChatGPT

35 of 36

35

Nikhil

Keetha

Jay

Karhade

Sebastian Scherer

Thanks to Mentors and Collaborators

Yuchen Zhang

Yutian

Chen

Yuheng Qiu

36 of 36

36

Thanks For Listening!