1 of 21

Visual Odometry with Deep Learning

2022.8.19

2 of 21

Introduction

Image Sequence

Feature Detection

(SIFT/SURF/ORB…)

Feature Matching

Outlier Rejection

Motion Estimate

Calibration

Bundle Adjustment

Pose

Geometric Method

Deep Learning Method

3 of 21

Absolute Pose Generate

In the VO tasks, every image in the dataset corresponds to 3×4 [R | t], R denotes 3×3 rotation matrix, which represents the rotation matrix of the left camera coordinate system relative to the first frame of the scene. t denotes 3×1 translation matrix, which denotes the position of the left camera coordinate system in the first frame.

1. Transformation matrix:

Pose of first frame

Rotation matrix in 3D space:

In the VO tasks, every image in the dataset corresponds to a 3×4 matrix. Such matrix is consisted by 3×3 rotation matrix and 3×1 translation vector. Each of these matrices is relative to the first frame in the dataset. Usually, such matrix is stretched into a 12-d vector by row for processing. Our task is to use vectors of each frame estimating the motion of camera, including rotation and displacement. For convenience of math calculation, we add 0 and 1 in the 3×4 matrix to form homogeneous transformation matrix. Absolute pose is the pose of each frame relatives to the pose of first frame. We set 4×4 identity matrix for the pose of first frame. And We can get rotation matrix in 3D space by multiplying rotation matrix of roll, yaw and pitch. If camera rotates by x-axis, we can get pitch. Yaw is rotated by y-axis and roll is rotated by z-axis.

4 of 21

Absolute Pose Generate

In result, absolute pose can be denoted as 6-D vector:

Translation Vector

5 of 21

Relative Pose Generate

world coordinate

camera coordinate

Relative Pose Schema

2. Euler angles:

Relative pose of two adjacent frames can be denoted as 6-D vector:

displacement vector

6 of 21

Network Architectures

1. CNN Layers:

Use FlowNetSimple pretrained model to learn the geometric features between two adjacent frames.

2. LSTM Layers:

Let the network learn the relationship between multiple successive poses. Because sometimes the difference between adjacent poses is small.

This is the details of DeepVO network architectures. It takes a video clip or a monocular image sequence as input. At each time step, the RGB image frame is pre-processed by subtracting the mean RGB values of the training set and, optionally resizing to a new size in the multiple of 64. Two consecutive images are stacked together to form a tensor for the deep RCNN to learn how to extract motion information and estimate poses. Image tensor is fed into the pretrained FlowNetSimple network to produce an effective feature for the monocular VO, which is then passed through a RNN for sequential learning. Each image pair yields a pose estimate at each time step through the network. The VO system develops over time and estimates new poses as images are captured. Bottom table shows specific parameters of DeepVO model.

7 of 21

Cost Function

8 of 21

KITTI VO Benchmark

Autonomous Driving Platform

KITTI Odometry sequence 00-10

9 of 21

Trajectories

Ground Truth of sequence 00-10

10 of 21

Trajectories

11 of 21

nuScenes

nuScenes car setup

nuScenes schema

Use inverse of pose matrix to transform data from global coordinate to camera coordinate.

12 of 21

Trajectories

Ground Truth of scenes 0-849

13 of 21

Experiments

KITTI Loss

NuScenes Loss

14 of 21

Experiments

Epoch=10	Translation RMSE	Rotation RMSE
01	506.189331	2.378539
03	72.12484	0.510368
05	79.494957	2.086924
07	17.149811	2.947729
09	40.579418	1.762159
mean	143.107671	1.9371438

Test on KITTI Sequences

Epoch=120	Translation RMSE	Rotation RMSE
01	528.226257	0.981712
03	53.838741	0.453822
05	86.06134	1.83037
07	13.937366	2.656275
09	57.504028	2.550753
mean	147.913546	1.694586

15 of 21

Experiments

K = 50	Translation RMSE	Rotation RMSE
01	526.587524	3.802394
03	22.536444	0.869126
05	42.410206	2.308919
07	37.448059	0.890943
09	31.28471	2.076861
mean	132.053389	1.9896486

Test on KITTI Sequences (Epoch=10)

K = 100	Translation RMSE	Rotation RMSE
01	506.189331	2.378539
03	72.12484	0.510368
05	79.494957	2.086924
07	17.149811	2.947729
09	40.579418	1.762159
mean	143.107671	1.9371438

16 of 21

Experiments

Comparison between K=50 with K=100 (Epoch=10)

17 of 21

Experiments

Trajectories on test sequences 00-10 for KITTI

18 of 21

Experiments

19 of 21

Experiments

Trajectories on test scenes 0-849 for NuScenes

20 of 21

Conclusion

The pose calculation is carried out on KITTI VO benchmark and NuScenes dataset in the VO field. The rotation theory is used to solve the absolute pose of a single image in the scene and the relative pose between two adjacent images. Euler angles and rotation matrices are used respectively, which provides data labels for subsequent deep learning training experiments.
A DeepVO estimation method is proposed. This method needs to input continuous multiple frames of RGB three channel pictures, and directly output the relative pose between two adjacent pictures end-to-end. The main idea is to use FlowNetSimple network to extract the geometric relationship features of two adjacent pictures in the picture sequence, then input the feature sequence to LSTM network for feature learning on the time series, and finally reduce the dimension of the features to realize the regression of the relative pose between consecutive multi frame pictures.

21 of 21

Drawbacks & Future

From the comparison of trajectories between KITTI and nuScenes, we can find that the predicted results on nuScenes dataset are much better than those on KITTI. This is because routes in nuScenes have been divided by 850 scenes and each scene is just a nearly straight line. There are many turnings on KITTI dataset. In result, we can use token to concatenate multiple scenes to form a complete road for training in the future.
Design a new network structure. FlowNetSimple has many parameters and the training process is time-consuming. Therefore, we can consider using a reduced version of the network to train the model quickly in the future, such as SqueezeNet and MobileNet. For DeepVO, the GRU with fewer parameters can be used to replace the LSTM, or the attention mechanism can be referenced in the RNN to achieve higher accuracy attitude estimation.