1 of 21

Visual Odometry with Deep Learning

2022.8.19

2 of 21

Introduction

Image Sequence

Feature Detection

(SIFT/SURF/ORB…)

Feature Matching

Outlier Rejection

Motion Estimate

Calibration

Bundle Adjustment

Pose

Geometric Method

Deep Learning Method

3 of 21

Absolute Pose Generate

In the VO tasks, every image in the dataset corresponds to 3×4 [R | t], R denotes 3×3 rotation matrix, which represents the rotation matrix of the left camera coordinate system relative to the first frame of the scene. t denotes 3×1 translation matrix, which denotes the position of the left camera coordinate system in the first frame.

1. Transformation matrix:

Pose of first frame

 

Rotation matrix in 3D space:

4 of 21

Absolute Pose Generate

 

 

 

 

In result, absolute pose can be denoted as 6-D vector:

Translation Vector

5 of 21

Relative Pose Generate

world coordinate

camera coordinate

Relative Pose Schema

 

2. Euler angles:

Relative pose of two adjacent frames can be denoted as 6-D vector:

displacement vector

6 of 21

Network Architectures

1. CNN Layers:

Use FlowNetSimple pretrained model to learn the geometric features between two adjacent frames.

2. LSTM Layers:

Let the network learn the relationship between multiple successive poses. Because sometimes the difference between adjacent poses is small.

7 of 21

Cost Function

 

 

 

 

8 of 21

KITTI VO Benchmark

Autonomous Driving Platform

KITTI Odometry sequence 00-10

9 of 21

Trajectories

Ground Truth of sequence 00-10

10 of 21

Trajectories

11 of 21

nuScenes

nuScenes car setup

 

nuScenes schema

Use inverse of pose matrix to transform data from global coordinate to camera coordinate.

12 of 21

Trajectories

Ground Truth of scenes 0-849

13 of 21

Experiments

KITTI Loss

NuScenes Loss

14 of 21

Experiments

Epoch=10

Translation RMSE

Rotation RMSE

01

506.189331

2.378539

03

72.12484

0.510368

05

79.494957

2.086924

07

17.149811

2.947729

09

40.579418

1.762159

mean

143.107671

1.9371438

Test on KITTI Sequences

Epoch=120

Translation RMSE

Rotation RMSE

01

528.226257

0.981712

03

53.838741

0.453822

05

86.06134

1.83037

07

13.937366

2.656275

09

57.504028

2.550753

mean

147.913546

1.694586

15 of 21

Experiments

K = 50

Translation RMSE

Rotation RMSE

01

526.587524

3.802394

03

22.536444

0.869126

05

42.410206

2.308919

07

37.448059

0.890943

09

31.28471

2.076861

mean

132.053389

1.9896486

Test on KITTI Sequences (Epoch=10)

K = 100

Translation RMSE

Rotation RMSE

01

506.189331

2.378539

03

72.12484

0.510368

05

79.494957

2.086924

07

17.149811

2.947729

09

40.579418

1.762159

mean

143.107671

1.9371438

16 of 21

Experiments

Comparison between K=50 with K=100 (Epoch=10)

17 of 21

Experiments

Trajectories on test sequences 00-10 for KITTI

18 of 21

Experiments

19 of 21

Experiments

Trajectories on test scenes 0-849 for NuScenes

20 of 21

Conclusion

  1. The pose calculation is carried out on KITTI VO benchmark and NuScenes dataset in the VO field. The rotation theory is used to solve the absolute pose of a single image in the scene and the relative pose between two adjacent images. Euler angles and rotation matrices are used respectively, which provides data labels for subsequent deep learning training experiments.
  2. A DeepVO estimation method is proposed. This method needs to input continuous multiple frames of RGB three channel pictures, and directly output the relative pose between two adjacent pictures end-to-end. The main idea is to use FlowNetSimple network to extract the geometric relationship features of two adjacent pictures in the picture sequence, then input the feature sequence to LSTM network for feature learning on the time series, and finally reduce the dimension of the features to realize the regression of the relative pose between consecutive multi frame pictures.

21 of 21

Drawbacks & Future

  1. From the comparison of trajectories between KITTI and nuScenes, we can find that the predicted results on nuScenes dataset are much better than those on KITTI. This is because routes in nuScenes have been divided by 850 scenes and each scene is just a nearly straight line. There are many turnings on KITTI dataset. In result, we can use token to concatenate multiple scenes to form a complete road for training in the future.
  2. Design a new network structure. FlowNetSimple has many parameters and the training process is time-consuming. Therefore, we can consider using a reduced version of the network to train the model quickly in the future, such as SqueezeNet and MobileNet. For DeepVO, the GRU with fewer parameters can be used to replace the LSTM, or the attention mechanism can be referenced in the RNN to achieve higher accuracy attitude estimation.