1 of 67

Nice-slam: Neural implicit scalable encoding for slam.

CVPR 2022

Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hujun Bao, Zhaopeng Cui, Martin R. Oswald, Marc Pollefeys

2 of 67

Introduction

  • SLAM?

Simultaneous Localization and Mapping

Robot pose

Map points

(in Real time)

3 of 67

Introduction

RADAR

event camera

Ultrasound

LiDAR

Wheel encoder

infrared camera

GNSS

monocular camera

RGB-D camera

Photometric sensors

  • For SLAM..

4 of 67

Introduction

event camera

infrared camera

monocular camera

RGB-D camera

Localization

Mapping

  • Visual SLAM

Photometric sensors

5 of 67

Introduction

  • Output

nice slam 영상

nice slam 영상

6 of 67

Method

  • iMAP: Implicit mapping and positioning in real-time[1]

[1] Sucar, Edgar, et al. "iMAP: Implicit mapping and positioning in real-time." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

Map information

Mesh, Point clouds, Voxel...

Storage

7 of 67

Method

  • iMAP: Implicit mapping and positioning in real-time[1]

[1] Sucar, Edgar, et al. "iMAP: Implicit mapping and positioning in real-time." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

Map information

R

Storage

NeRF[2]

[2] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

8 of 67

Method

  • Concept of iMAP(Mapping)

[1] Sucar, Edgar, et al. "iMAP: Implicit mapping and positioning in real-time." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[2] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

NeRF[2]

RGB-D camera

9 of 67

Method

  • Concept of iMAP(Mapping)

[1] Sucar, Edgar, et al. "iMAP: Implicit mapping and positioning in real-time." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[2] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

Ray

Camera origin:

Ray

Viewing direction:

iMAP

position

MLP

color

density

inter-sample dist:

Occupancy:

Transmittance

color

Depth

10 of 67

Method

  • Concept of iMAP(Mapping)

[1] Sucar, Edgar, et al. "iMAP: Implicit mapping and positioning in real-time." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[2] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

NeRF[2]

RGB-D camera

iMAP

position

MLP

color

density

color

Depth

11 of 67

Method

  • Concept of iMAP(Mapping)

[1] Sucar, Edgar, et al. "iMAP: Implicit mapping and positioning in real-time." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[2] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

NeRF[2]

RGB-D camera

iMAP

position

MLP

color

density

color

Depth

Mapping

12 of 67

Method

  • NeRF[2]

[1] Sucar, Edgar, et al. "iMAP: Implicit mapping and positioning in real-time." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[2] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

Bound!

13 of 67

Method

  • Concept of iMAP(Tracking)

[1] Sucar, Edgar, et al. "iMAP: Implicit mapping and positioning in real-time." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[2] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

NeRF[2]

RGB-D camera

iMAP

position

MLP

color

density

color

Depth

14 of 67

Method

  • Concept of iMAP(Tracking)

[1] Sucar, Edgar, et al. "iMAP: Implicit mapping and positioning in real-time." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[2] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

NeRF[2]

RGB-D camera

iMAP

position

MLP

color

density

color

Depth

Update camera pose (10Hz)

15 of 67

Method

  • Concept of iMAP(Tracking)

[1] Sucar, Edgar, et al. "iMAP: Implicit mapping and positioning in real-time." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[2] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

NeRF[2]

RGB-D camera

iMAP

position

MLP

color

density

color

Depth

Tracking

16 of 67

Method

  • iMAP: Implicit mapping and positioning in real-time[1]

[1] Sucar, Edgar, et al. "iMAP: Implicit mapping and positioning in real-time." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[2] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

17 of 67

Method

  • NICE SLAM

NICE SLAM

  • iMAP 보다는 큰 Scene 가능
  • iMAP 보다 Reconstruction 성능 향상
  • 큰 network 사용 불가능

NICE SLAM

position

color

iMAP

position

MLP

color

density

  • Large Scene 불가능
  • 샘플링 개수 제한-> Reconstruction 성능 저하
  • 큰 network 사용 불가능

iMAP

Occupancy

18 of 67

Method

  • NICE SLAM
  • Sample point 에 대한 Occupancy를 추정하기 위해 pretrained decoder 사용
  • Sample point 와 인접한 Feature grid로부터 tri-linear interpolation한 input을 사용

NICE SLAM

position

color

Occupancy

Feature grid 에 속한 feature는 모두 32 dimension인 learnable parameter입니다.

19 of 67

Method

  • NICE SLAM
  • Convolutional Occupancy Network[1]pretrained decoder사용
  • CNN을 통한 Translation Equivariance을 확보 가능
  • 사용할 CNN으로 U-Net을 사용함으로서 Local-global integration 가능

[1] Peng, Songyou, et al. "Convolutional occupancy networks." European Conference on Computer Vision. Springer, Cham, 2020.

20 of 67

Method

  • NICE SLAM

NICE SLAM

position

color

Occupancy

  • Sample point 에 대한 Color를 추정하기 위해 color decoder 사용
  • Sample point 와 인접한 Feature grid로부터 tri-linear interpolation한 input을 사용

Feature grid 에 속한 feature는 모두 32 dimension인 learnable parameter입니다

Color�Decoder

Color

21 of 67

Method

  • NICE SLAM

22 of 67

Method

  • NICE SLAM

Occupancy

Color

23 of 67

Method

  • NICE SLAM

Occupancy

Color

24 of 67

Method

  • NICE SLAM

25 of 67

Method

  • Good quality의 Mapping을 하는 것이 목적
  • Coarse-to-fine 방법으로 occupancy를 예측
  • Detail한 geometric 정보를 학습
  • Pretrained된 decoder는 update하지 않음

0.16m

0.32m

mid

fine

Mid-level feature

High-level feature

26 of 67

Method

  • NICE SLAM

27 of 67

Method

  • Tracking을 위한 mapping이 주 목적, extrapolated 되는 geometric 정보들이 tracking에 도움을 줌
  • Partial한 부분만 보더라도 나머지 부분을 예측
  • High-level의 geometric 정보를 학습(capture)
  • Mid/fine feature하고는 independent하게 optimize

부분만 보더라도 �나머지를 채울 수 있다..!

2m

28 of 67

Method

  • NICE SLAM

29 of 67

Method

0.16m

fine

Fully Connected

Color

Pretrained 되지 않았습니다!

  • Tracking performance를 향상시키기 위해 사용
  • 학습 가능한 Decoder를 사용

30 of 67

Method

  • NICE SLAM

31 of 67

Method

  • Ray위에서 uniform하게 sample (단, Bound 내에 있어야함)
  • Depth를 알고 있으므로, near depth에서 uniform하게 sample

Ray

Ray

32 of 67

Method

  • 들어오는 frame 중 중요한 frame을 keyframe으로 등록(keyframe은 여러 개입니다)
  • 현재 frame + keyframe 중에서 총 M개(200개)의 픽셀(ray)만 sampling

Ray

Ray

keyframe

현재 frame

33 of 67

Method

  • 들어오는 frame 중 중요한 frame을 keyframe으로 설정 (keyframe은 여러 개입니다)
  • Keyframe을 통해 forgetting을 방지

34 of 67

34

,

,

1. To select keyframe, get proportion of depth error smaller than threshold

2. If P is under a threshold , this frame is keyframe and added to the keyframe set

Continual Neural Mapping: Learning An Implicit Scene Representation from Sequential Observations

F1

F2

F0

F3

F4

F5

F6

Method

35 of 67

35

,

,

1. To select keyframe, get proportion of depth error smaller than threshold

2. If P is under a threshold , this frame is keyframe and added to the keyframe set

Continual Neural Mapping: Learning An Implicit Scene Representation from Sequential Observations

F1

F2

F0

F3

F4

F5

F6

Method

36 of 67

36

,

,

1. To select keyframe, get proportion of depth error smaller than threshold

2. If P is under a threshold , this frame is keyframe and added to the keyframe set

Continual Neural Mapping: Learning An Implicit Scene Representation from Sequential Observations

F1

F2

F0

F3

F4

F5

F6

Method

37 of 67

37

,

,

1. To select keyframe, get proportion of depth error smaller than threshold

2. If P is under a threshold , this frame is keyframe and added to the keyframe set

Continual Neural Mapping: Learning An Implicit Scene Representation from Sequential Observations

F1

F2

F0

F3

F4

F5

F6

Method

38 of 67

Method

  • Rendering

Ray

Camera origin:

Ray

Viewing direction:

iMAP

position

MLP

color

density

inter-sample dist:

Occupancy:

color

Depth

  • 뽑은 point에 대해서 각 network에(+feature grid)에 통과시켜 color와 occupancy 획득

39 of 67

Method

  • Rendering

Ray

Camera origin:

Ray

Viewing direction:

Coarse-level Occupancy:

  • 뽑은 point에 대해서 각 network에(+feature grid)에 통과시켜 color와 occupancy 획득

NICE SLAM

position

color

Occupancy

fine-level Occupancy:

40 of 67

Method

  • Rendering

Ray

Camera origin:

Ray

Viewing direction:

Coarse-level Occupancy:

  • 뽑은 point에 대해서 각 network에(+feature grid)에 통과시켜 color와 occupancy 획득

NICE SLAM

position

color

Occupancy

fine-level Occupancy:

Transformation

41 of 67

Method

  • NICE SLAM

42 of 67

Method

  • Loss

Geometric loss

Photometric loss

GT

Rendered image

Normalize

43 of 67

Method

  • Loss

Geometric loss

Photometric loss

GT

Rendered image

Normalize

Coarse-level Occupancy:

fine-level Occupancy:

44 of 67

Method

  • Loss

Geometric loss

Photometric loss

GT

Rendered image

Normalize

45 of 67

Method

  • Coarse-to-fine 방법으로 occupancy를 예측

  • Detail한 geometric 정보를 학습

0.16m

0.32m

mid

fine

46 of 67

Method

mid/fine-level

position

Occupancy

Fisrt stage

Second stage

Geometric loss

Coarse-level Occupancy:

fine-level Occupancy:

  • Fine geometric loss는 2stage 각각 적용

Mid,fine feature grid를 optimizationc

47 of 67

Method

  • Fine geometric loss는 2stage 각각 적용

mid/fine-level

position

Occupancy

Fisrt stage

Second stage

Geometric loss

Coarse level

position

Occupancy

Coarse-level Occupancy:

fine-level Occupancy:

Coarse Feature grid를 optimization

Mid,fine feature grid를 optimization

48 of 67

Method

  • Feature grid와 color decoder 둘 다 optimize
  • Photometric 정보는 Tracking에 도움됨

Photometric loss

Coarse-level Occupancy:

fine-level Occupancy:

Fully Connected

Color

GT

Rendered image

49 of 67

Method

Coarse-level Occupancy:

fine-level Occupancy:

Loss

Rendering image

Get sample

Fed to network

NICE SLAM

position

color

Occupancy

Camera origin:

Viewing direction:

Transformation:

50 of 67

Method

Loss

  • Local Bundle adjustment

: Keyframe들의 rotation, translation

:weighting factor

51 of 67

Method

  • NICE SLAM

52 of 67

Method

  • 현재 frame의 pose를 estimate
  • 첫 번째 frame을 Global frame으로 설정
  • Photometric loss 와 geometric loss둘 다 사용

53 of 67

Method

  • Depth는 camera motion의 변화에 따라 매우 민감하게 변할 수 있음�이를 방지하고자 특정 부분에만 집중하지 않도록 loss를 변형

54 of 67

Method

  • NICE SLAM

55 of 67

Method

  • NICE SLAM

Mid-&fine-level geometric and color mapping� Thread (Mapping)

Coarse-level geometric mapping Thread

Camera tracking Thread

56 of 67

Experiments

  • NICE SLAM
  • Reconstruction
  • Tracking
  • Computation & Runtime.

Dataset: Replica, ScanNet, TUM RGB-D

Dataset: Replica, ScanNet, TUM RGB-D

Dataset: Replica, ScanNet, TUM RGB-D

57 of 67

Experiments

  • NICE SLAM
  • Reconstruction

Dataset: Replica, ScanNet, TUM RGB-D

  • Accuracy (cm): the average distance between sampled points from �the reconstructed mesh and the nearest ground-truth point

  • Completion (cm): the average distance between sampled points from �the ground-truth mesh and the nearest reconstructed mesh

  • Completion Ratio (cm): the percentage of points in the reconstructed mesh with Completion under 5 cm

58 of 67

Experiments

  • NICE SLAM
  • Reconstruction

Dataset: Replica, ScanNet, TUM RGB-D

+ Depth estimation 성능이 크게 좋아짐

+ Reconsturction 성능 향상

- Memory comsuption

59 of 67

Experiments

  • NICE SLAM
  • Reconstruction

Dataset: Replica, ScanNet, TUM RGB-D

60 of 67

Experiments

  • NICE SLAM
  • Tracking

Dataset: Replica, ScanNet, TUM RGB-D

- 기존 explict paper들에 비해 떨어지는 성능

+ 기존 implicit paper들에 비해 크게 향상된 성능

61 of 67

Experiments

  • NICE SLAM
  • Computation & Runtime.

Dataset: Replica, ScanNet, TUM RGB-D

+ iMAP가 비교하였을 때 크게 향상된 Mapping 성능

Tracking: 200 samples

Mapping: 1000 samples

FLOPs: 3D point에 대해서 color, occupancy를 획득하는데 필요한 부동소수점 연산의 횟수

62 of 67

Conclusion

  • NICE SLAM
  • Feature grid를 사용하여 기존 implicit representation보다 효율적이고 정확한 mapping

63 of 67

Conclusion

  • NICE SLAM

Q & A

64 of 67

질문

  • NICE SLAM

Tracking 시 camera pose intialization

65 of 67

  • NICE SLAM

Why 3-level Feature Grids?

Memory consumption이나 real-time capability를 고려해보았을 때 3개가 제일 좋다!

질문

66 of 67

질문

  • NICE SLAM

Why is the Mid-level Output not a Residual to the Coarse-level Output?

mid-fine과 달리 mid-coarse는 grid 크기의 차이가 너무 큽니다..

67 of 67

질문

  • NICE SLAM

Why is the Mid-level Output not a Residual to the Coarse-level Output?

mid-fine과 달리 mid-coarse는 grid 크기의 차이가 너무 큽니다..