1 of 67

InHa LEE

epsilon8854@unist.ac.kr

Nice-slam: Neural implicit scalable encoding for slam.

CVPR 2022

Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hujun Bao, Zhaopeng Cui, Martin R. Oswald, Marc Pollefeys

2 of 67

Introduction

SLAM?

Simultaneous Localization and Mapping

Robot pose

Map points

(in Real time)

3 of 67

Introduction

RADAR

event camera

Ultrasound

LiDAR

Wheel encoder

infrared camera

GNSS

monocular camera

RGB-D camera

Photometric sensors

For SLAM..

4 of 67

Introduction

event camera

infrared camera

monocular camera

RGB-D camera

Localization

Mapping

Visual SLAM

Photometric sensors

5 of 67

Introduction

Output

nice slam 영상

6 of 67

Method

iMAP: Implicit mapping and positioning in real-time[1]

[1] Sucar, Edgar, et al. "iMAP: Implicit mapping and positioning in real-time." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

Map information

Mesh, Point clouds, Voxel...

Storage

7 of 67

Method

iMAP: Implicit mapping and positioning in real-time[1]

[1] Sucar, Edgar, et al. "iMAP: Implicit mapping and positioning in real-time." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

Map information

R

Storage

NeRF[2]

[2] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

8 of 67

Method

Concept of iMAP(Mapping)

[1] Sucar, Edgar, et al. "iMAP: Implicit mapping and positioning in real-time." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[2] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

NeRF[2]

RGB-D camera

9 of 67

Method

Concept of iMAP(Mapping)

[1] Sucar, Edgar, et al. "iMAP: Implicit mapping and positioning in real-time." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[2] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

Ray

Camera origin:

Ray

Viewing direction:

iMAP

position

MLP

color

density

inter-sample dist:

Occupancy:

Transmittance

color

Depth

10 of 67

Method

Concept of iMAP(Mapping)

[1] Sucar, Edgar, et al. "iMAP: Implicit mapping and positioning in real-time." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[2] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

NeRF[2]

RGB-D camera

iMAP

position

MLP

color

density

color

Depth

11 of 67

Method

Concept of iMAP(Mapping)

[1] Sucar, Edgar, et al. "iMAP: Implicit mapping and positioning in real-time." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[2] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

NeRF[2]

RGB-D camera

iMAP

position

MLP

color

density

color

Depth

Mapping

12 of 67

Method

NeRF[2]

[1] Sucar, Edgar, et al. "iMAP: Implicit mapping and positioning in real-time." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[2] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

Bound!

13 of 67

Method

Concept of iMAP(Tracking)

[1] Sucar, Edgar, et al. "iMAP: Implicit mapping and positioning in real-time." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[2] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

NeRF[2]

RGB-D camera

iMAP

position

MLP

color

density

color

Depth

14 of 67

Method

Concept of iMAP(Tracking)

[1] Sucar, Edgar, et al. "iMAP: Implicit mapping and positioning in real-time." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[2] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

NeRF[2]

RGB-D camera

iMAP

position

MLP

color

density

color

Depth

Update camera pose (10Hz)

15 of 67

Method

Concept of iMAP(Tracking)

[1] Sucar, Edgar, et al. "iMAP: Implicit mapping and positioning in real-time." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[2] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

NeRF[2]

RGB-D camera

iMAP

position

MLP

color

density

color

Depth

Tracking

16 of 67

Method

iMAP: Implicit mapping and positioning in real-time[1]

[1] Sucar, Edgar, et al. "iMAP: Implicit mapping and positioning in real-time." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[2] Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." European conference on computer vision. Springer, Cham, 2020.

17 of 67

Method

NICE SLAM

NICE SLAM

iMAP 보다는 큰 Scene 가능
iMAP 보다 Reconstruction 성능 향상
큰 network 사용 불가능

NICE SLAM

position

color

iMAP

position

MLP

color

density

Large Scene 불가능
샘플링 개수 제한-> Reconstruction 성능 저하
큰 network 사용 불가능

iMAP

Occupancy

18 of 67

Method

NICE SLAM

Sample point 에 대한 Occupancy를 추정하기 위해 pretrained decoder 사용
Sample point 와 인접한 Feature grid로부터 tri-linear interpolation한 input을 사용

NICE SLAM

position

color

Occupancy

Feature grid 에 속한 feature는 모두 32 dimension인 learnable parameter입니다.

19 of 67

Method

NICE SLAM

Convolutional Occupancy Network[1] 의 pretrained decoder사용
CNN을 통한 Translation Equivariance을 확보 가능
사용할 CNN으로 U-Net을 사용함으로서 Local-global integration 가능

[1] Peng, Songyou, et al. "Convolutional occupancy networks." European Conference on Computer Vision. Springer, Cham, 2020.

20 of 67

Method

NICE SLAM

NICE SLAM

position

color

Occupancy

Sample point 에 대한 Color를 추정하기 위해 color decoder 사용
Sample point 와 인접한 Feature grid로부터 tri-linear interpolation한 input을 사용

Feature grid 에 속한 feature는 모두 32 dimension인 learnable parameter입니다

Color�Decoder

Color

21 of 67

Method

NICE SLAM

22 of 67

Method

NICE SLAM

Occupancy

Color

23 of 67

Method

NICE SLAM

Occupancy

Color

24 of 67

Method

NICE SLAM

25 of 67

Method

Good quality의 Mapping을 하는 것이 목적
Coarse-to-fine 방법으로 occupancy를 예측
Detail한 geometric 정보를 학습
Pretrained된 decoder는 update하지 않음

0.16m

0.32m

mid

fine

Mid-level feature

High-level feature

26 of 67

Method

NICE SLAM

27 of 67

Method

Tracking을 위한 mapping이 주 목적, extrapolated 되는 geometric 정보들이 tracking에 도움을 줌
Partial한 부분만 보더라도 나머지 부분을 예측
High-level의 geometric 정보를 학습(capture)
Mid/fine feature하고는 independent하게 optimize

부분만 보더라도 �나머지를 채울 수 있다..!

2m

28 of 67

Method

NICE SLAM

29 of 67

Method

0.16m

fine

Fully Connected

Color

Pretrained 되지 않았습니다!

Tracking performance를 향상시키기 위해 사용
학습 가능한 Decoder를 사용

30 of 67

Method

NICE SLAM

31 of 67

Method

Ray위에서 uniform하게 sample (단, Bound 내에 있어야함)
Depth를 알고 있으므로, near depth에서 uniform하게 sample

Ray

32 of 67

Method

들어오는 frame 중 중요한 frame을 keyframe으로 등록(keyframe은 여러 개입니다)
현재 frame + keyframe 중에서 총 M개(200개)의 픽셀(ray)만 sampling

Ray

keyframe

현재 frame

33 of 67

Method

들어오는 frame 중 중요한 frame을 keyframe으로 설정 (keyframe은 여러 개입니다)
Keyframe을 통해 forgetting을 방지

34 of 67

34

,

1. To select keyframe, get proportion of depth error smaller than threshold

2. If P is under a threshold , this frame is keyframe and added to the keyframe set

Continual Neural Mapping: Learning An Implicit Scene Representation from Sequential Observations

F1

F2

F0

F3

F4

F5

F6

Method

35 of 67

35

,

1. To select keyframe, get proportion of depth error smaller than threshold

2. If P is under a threshold , this frame is keyframe and added to the keyframe set

Continual Neural Mapping: Learning An Implicit Scene Representation from Sequential Observations

F1

F2

F0

F3

F4

F5

F6

Method

36 of 67

36

,

1. To select keyframe, get proportion of depth error smaller than threshold

2. If P is under a threshold , this frame is keyframe and added to the keyframe set

Continual Neural Mapping: Learning An Implicit Scene Representation from Sequential Observations

F1

F2

F0

F3

F4

F5

F6

Method

37 of 67

37

,

1. To select keyframe, get proportion of depth error smaller than threshold

2. If P is under a threshold , this frame is keyframe and added to the keyframe set

Continual Neural Mapping: Learning An Implicit Scene Representation from Sequential Observations

F1

F2

F0

F3

F4

F5

F6

Method

그 keyframe을 keyframe set에 추가를 합니다.

그래서

Loss를 구하기 위한

Rendering 및 optimization을 할 때

현재 frame에서만 sampling 하는 것이 아니라

이 keyframe set에 있는 frame에서도 sampling을 해서

전 frame이 현재에도 영향을 미치도록 설계했습니다.

영상으로 보면 좀 더 잘 이해가는데요

하지만 아무리 엄선한 keyframe만 저장을 해놓는다고 해도

쌓이기 시작하면 그 수가 많아집니다

Keyframe 셋에 있는 모든 keyframe에 대해 렌더링 및 옵티마이제이션을

진행하는 것이 아니라

Keyframe set 중에서도 loss가 좀 높은 친구들만 골라서 일정 개수만큼의

Keyframe만 사용합니다.

논문에선 전체에서 4개만 사용한다고 하는데요

영상을 보시면 좀 더 잘 이해갈겁니다.

38 of 67

Method

Rendering

Ray

Camera origin:

Ray

Viewing direction:

iMAP

position

MLP

color

density

inter-sample dist:

Occupancy:

color

Depth

뽑은 point에 대해서 각 network에(+feature grid)에 통과시켜 color와 occupancy 획득

39 of 67

Method

Rendering

Ray

Camera origin:

Ray

Viewing direction:

Coarse-level Occupancy:

뽑은 point에 대해서 각 network에(+feature grid)에 통과시켜 color와 occupancy 획득

NICE SLAM

position

color

Occupancy

fine-level Occupancy:

40 of 67

Method

Rendering

Ray

Camera origin:

Ray

Viewing direction:

Coarse-level Occupancy:

뽑은 point에 대해서 각 network에(+feature grid)에 통과시켜 color와 occupancy 획득

NICE SLAM

position

color

Occupancy

fine-level Occupancy:

Transformation

41 of 67

Method

NICE SLAM

42 of 67

Method

Loss

Geometric loss

Photometric loss

GT

Rendered image

Normalize

43 of 67

Method

Loss

Geometric loss

Photometric loss

GT

Rendered image

Normalize

Coarse-level Occupancy:

fine-level Occupancy:

44 of 67

Method

Loss

Geometric loss

Photometric loss

GT

Rendered image

Normalize

45 of 67

Method

Coarse-to-fine 방법으로 occupancy를 예측

Detail한 geometric 정보를 학습

0.16m

0.32m

mid

fine

46 of 67

Method

mid/fine-level

position

Occupancy

Fisrt stage

Second stage

Geometric loss

Coarse-level Occupancy:

fine-level Occupancy:

Fine geometric loss는 2stage 각각 적용

Mid,fine feature grid를 optimizationc

47 of 67

Method

Fine geometric loss는 2stage 각각 적용

mid/fine-level

position

Occupancy

Fisrt stage

Second stage

Geometric loss

Coarse level

position

Occupancy

Coarse-level Occupancy:

fine-level Occupancy:

Coarse Feature grid를 optimization

Mid,fine feature grid를 optimization

48 of 67

Method

Feature grid와 color decoder 둘 다 optimize
Photometric 정보는 Tracking에 도움됨

Photometric loss

Coarse-level Occupancy:

fine-level Occupancy:

Fully Connected

Color

GT

Rendered image

49 of 67

Method

Coarse-level Occupancy:

fine-level Occupancy:

Loss

Rendering image

Get sample

Fed to network

NICE SLAM

position

color

Occupancy

Camera origin:

Viewing direction:

Transformation:

50 of 67

Method

Loss

Local Bundle adjustment

: Keyframe들의 rotation, translation

:weighting factor

51 of 67

Method

NICE SLAM

52 of 67

Method

현재 frame의 pose를 estimate
첫 번째 frame을 Global frame으로 설정
Photometric loss 와 geometric loss둘 다 사용

53 of 67

Method

Depth는 camera motion의 변화에 따라 매우 민감하게 변할 수 있음�이를 방지하고자 특정 부분에만 집중하지 않도록 loss를 변형

54 of 67

Method

NICE SLAM

55 of 67

Method

NICE SLAM

Mid-&fine-level geometric and color mapping� Thread (Mapping)

Coarse-level geometric mapping Thread

Camera tracking Thread

56 of 67

Experiments

NICE SLAM

Reconstruction

Tracking

Computation & Runtime.

Dataset: Replica, ScanNet, TUM RGB-D

57 of 67

Experiments

NICE SLAM

Reconstruction

Dataset: Replica, ScanNet, TUM RGB-D

Accuracy (cm): the average distance between sampled points from �the reconstructed mesh and the nearest ground-truth point

Completion (cm): the average distance between sampled points from �the ground-truth mesh and the nearest reconstructed mesh

Completion Ratio (cm): the percentage of points in the reconstructed mesh with Completion under 5 cm

58 of 67

Experiments

NICE SLAM

Reconstruction

Dataset: Replica, ScanNet, TUM RGB-D

+ Depth estimation 성능이 크게 좋아짐

+ Reconsturction 성능 향상

- Memory comsuption

59 of 67

Experiments

NICE SLAM

Reconstruction

Dataset: Replica, ScanNet, TUM RGB-D

60 of 67

Experiments

NICE SLAM

Tracking

Dataset: Replica, ScanNet, TUM RGB-D

- 기존 explict paper들에 비해 떨어지는 성능

+ 기존 implicit paper들에 비해 크게 향상된 성능

61 of 67

Experiments

NICE SLAM

Computation & Runtime.

Dataset: Replica, ScanNet, TUM RGB-D

+ iMAP가 비교하였을 때 크게 향상된 Mapping 성능

Tracking: 200 samples

Mapping: 1000 samples

FLOPs: 3D point에 대해서 color, occupancy를 획득하는데 필요한 부동소수점 연산의 횟수

62 of 67

Conclusion

NICE SLAM

Feature grid를 사용하여 기존 implicit representation보다 효율적이고 정확한 mapping

63 of 67

Conclusion

NICE SLAM

Q & A

64 of 67

질문

NICE SLAM

Tracking 시 camera pose intialization

65 of 67

NICE SLAM

Why 3-level Feature Grids?

Memory consumption이나 real-time capability를 고려해보았을 때 3개가 제일 좋다!

질문

66 of 67

질문

NICE SLAM

Why is the Mid-level Output not a Residual to the Coarse-level Output?

mid-fine과 달리 mid-coarse는 grid 크기의 차이가 너무 큽니다..

67 of 67

질문

NICE SLAM

Why is the Mid-level Output not a Residual to the Coarse-level Output?

mid-fine과 달리 mid-coarse는 grid 크기의 차이가 너무 큽니다..