1 of 17

Reinforced Feature Points:

Optimizing Feature Detection and Description for a High-Level Task

Aritra Bhowmik, Stefan Gumhold, Carsten Rother, Eric Brachmann

Presented by: Arash Sadeghi Amjadi

1

2 of 17

Introduction

Task: Image matching, a core problem of computer vision.
Practical usage of image matching
History of the task
In this work, a complete vision pipeline was designed. Particularly solving the task of relative pose estimation.

2

Finding and matching sparse 2D feature points across images has been a long-standing problem in computer vision.
Feature detection algorithms enable the creation of vivid 3D models from image collections, building maps for robotic agents, recognizing places and precise locations as well as recognizing objects.
Train featur …
Naturally, the design of feature detection and description algorithms, has received tremendous attention in computer vision research since its early days. Although invented three decades ago, the seminal SIFT algorithm remains the gold standard feature detection pipeline to this day.

-
Independent studies suggest that these learned pipelines have not yet reached the accuracy of their classical counterparts due to limited generalization abilities.
one prominent strain of current research aims to keep the concept of sparse feature detection but replaces hand-crafted designs like SIFT with data-driven, learned representations. Initial works largely focused on learning to compare image patches to yield expressive feature descriptors. Fewer works attempt to learn feature detection or a complete architecture for feature detection and description.

In this work, a more radical approach was taken. Instead of hand-crafting a training procedure that emulates aspects of high-level vision pipelines, the feature detector was embedded in a complete vision pipeline during training.

camera re-localization, structure-from-motion or SLAM.
The problem of whether it is more beneficial to find many matches or few, reliable matches was dodged. All these aspects are solely guided by the task loss, i.e. by minimizing the relative pose error between two images.

Their approach was relized using the SuperPoint architecture, a fully convolutional CNN for feature detection and description, pre-trained on synthetic and homography- warped real images. Their training scheme can be applied to architectures other than SuperPoint, like LIFT or R2D2, and also to separate networks for feature detection and description.

3 of 17

Introduction

Contribution

A new training methodology
Applying the proposed method to a state-of-the-art architecture
After training, SuperPoint reaches, and slightly exceeds, the accuracy of SIFT

3

4 of 17

Method

Task: estimation of relative transformation, between two images, I and I’
Keypoints x_i of image I is found by “detection network”
The description of keypoint x_i, d(x_i ; w) is done by “description network”
Both Networks use joint architecture, SuperPoint, with most weights, w, shared between these to networks.

4

5 of 17

Method

Main goal: optimizing learnable parameter, w, for enhancing accuracy.
Problem: keypoint selection and feature matching are discrete and non-differentiable operations. Components of vision pipeline might also be non-differentiable.
Solution: thanks to reinforcement learning, formulating feature detection and matching as probabilistic actions.

5

6 of 17

Method

6

7 of 17

Method

Part 1: Probabilistic Key Point Selection

Reformulating keypoint heatmap as a probability distribution over key point locations parameterized by w.
Sampling N key points independently

Joint probability of sampling key points independently in each image

7

8 of 17

Method

Part 2: Probabilistic Feature Matching

Probability of match between two key points with descriptors denoted as

Complete set of M matches between I and I’ sampled independently

8

9 of 17

Method

Part 3: Learning Objective

Data is in form of (I,I’,T*) with ground truth of T*
Loss value l(M, X , X’ ), scalar, depends on the key points X and X’, and the matches M that was selected among the key points.
Loss to be minimized:

Problem: calculating the expectation is infeasible. Solution: using a initialized and pre-trained network like SuperPoint. For such a network:

Heatmap predicted by detector is sparse. Only few image locations have an impact.
Matches among unrelated key points have a large descriptor distance and no impact on the expectation.

9

Note that we do not need ground truth key point locations X or ground truth image correspondences M.
or relative pose estimation, calculating ` entails robust fitting of the essential matrix, its decomposi- tion to yield an estimated relative camera transformation T̂ , and its comparison to a ground truth transformation T ∗
selecting X and X’ + selecting matches
Observation 1) means, we can just sample from the key point heat map, and ignore other image locations. Observation 2) means that for the key points we selected, we do not have to realise a complete matching of all key points in X to all key points in X’ . Instead, we rely on a k-nearest-neighbour matching with some small k. All nearest neighbours beyond k likely have large descriptor distances, and hence near zero probability. In practice, we found no advantage in using a k > 1 which means we can do a normal nearest neighbour matching during training when calculating P (M|X , X’ ; w)

10 of 17

Method

Part 3: Learning Objective

Updating learnable parameter following classic REINFORCE algorithm:

10

11 of 17

Experiments

Part1: Relative pose estimation

Network architecture
Task Description
Datasets
Training
Test

11

SuperPoint is a fully- convolutional neural network which processes full-sized images. The network has two output heads: one produces a heat map from which key points can be picked, and the other head produces 256-dimensional descriptors as a dense descriptor field over the image. The descriptor output of Su- perPoint fits well into our training methodology, as we can look up descriptors for arbitrary image locations without doing repeated forward passes of the network. Both output heads share a common encoder which processes the image and reduces its dimensionality, while the output heads act as decoders. We use the network weights provided by the authors as an initialization.
We calculate the relative camera pose between a pair of images by robust fitting of the essential matrix. We show an overview of the processing pipeline in Fig. 2. The feature detector produces a set of tentative im- age correspondences. We estimate the essential matrix us- ing the 5-point algorithm [33] in conjunction with a robust estimator. For the robust estimator, we conducted experi- ments with a standard RANSAC [17] estimator, as well as with the recent NG-RANSAC [10]. NG-RANSAC uses a neural network to suppress outlier correspondences, and to guide RANSAC sampling towards promising candidates for the essential matrix. As a learning-based robust estimator, NG-RANSAC is particularly interesting in our setup, since we can refine it in conjunction with SuperPoint during end- to-end training.
To facilitate comparison to other methods, we follow the evaluation protocol of Yi et al. [59] for rel- ative pose estimation. They evaluate using a collection of 7 outdoor and 16 indoor datasets from various sources [47, 21, 57]. One outdoor scene and one indoor scene serve as training data, the remaining 21 scenes serve as test set. All datasets come with co-visibility information for the se- lection of suitable image pairs, and ground truth poses.
We sample 600 key points for each image. Next, we perform a nearest neighbour matching between key points, accepting only matches of mutual nearest neighbors in both images. We randomly choose 50% of all matches from this distribution for the relative pose estimation pipeline. We fit the essential matrix, and estimate the relative pose up to scale. We measure the angle between the estimated and ground truth rotation, as well as, the angle between the estimated and the ground truth translation vector. We take the maximum of both angles as our task loss `. For difficult image pairs, essential matrix estimation can fail, and the task loss can be very large. To limit the influence of such large losses, we apply a square root soft clamping [10] of the loss after a value of 25 ◦ , and a hard clamping after a value of 75 ◦ . To approximate the expected task loss L(w) and its gradients we draw key points n X = 3 times, and, for each set of key points, we draw n M = 3 sets of matches.
For testing, we revert to a deterministic procedure for feature detection, instead of doing sampling. We select the strongest 2000 key points from the detector heat map using local non-max suppression. We remove very weak key point with a heat map value below 0.00015.

12 of 17

Experiments

Part1: Relative pose estimation

12

13 of 17

Experiments

Part1: Relative pose estimation

13

We report test accuracy in accordance to Yi et al. [59], who calculate the pose error as the maximum of rotation and translation angular error. For each dataset, the area under the cumulative error curve (AUC) is calculated and the mean AUC for outdoor and indoor datasets are re- ported separately.
For RootSIFT, we apply Lowe’s ratio criterion [26] to filter matches where the distance ratio of the nearest and second nearest neighbor is above 0.8.
We found that the excellent accuracy of RootSIFT is largely due to the effec- tiveness of Lowe’s ratio filter for removing unreliable SIFT matches.
For our best result, we show the impact of training SuperPoint vs. NG-RANSAC end-to-end. Init. for SuperPoint means weights provided by Detone et al. [14], init. for NG-RANSAC means training according to Brachmann and Rother [10] for SuperPoint. We show results worse than the RootSIFT baseline in red, and results better than or equal to RootSIFT in green

14 of 17

Experiments

Part2: Low-Level Matching Accuracy

14

we analyse the perfor- mance of Reinforced SuperPoint, trained for relative pose estimation (see previous section), on the H-Patches [3] benchmark. The benchmark consists of 116 test sequences showing images under increasing viewpoint and illumina- tion changes. We adhere to the evaluation protocol of Dus- manu et al. [16]. That is, we find key points and matches between image pairs of a sequence, accepting only matches of mutual nearest neighbours between two images. We calculate the reprojection error of each match using the ground truth homography.
We measure the average per- centage of correct matches for thresholds ranging from 1px to 10px for the reprojection error.
Left: We show the mean matching accuracy for SuperPoint before and after being trained for relative pose estimation. Results of competitors as reported in [16]. Right: Area under the curve (AUC) for the plots on the left.

15 of 17

Experiments

Part2: Low-Level Matching Accuracy

15

16 of 17

Experiments

Part3: Structure-from-Motion

16

We select three of the smaller scenes from the benchmark, and extract key points and matches using SuperPoint and Reinforced SuperPoint. We create a sparse SfM reconstruction using COLMAP [45], and report the number of reconstructed 3D points, the average track length of features (indicating feature stability across views), and the average reprojection error (indicating key point precision). We report our results in Table 2, and confirm the findings of our previous experiments. While the number of key points reduces, the matching quality in- creases, as measured by track length and reprojection er- ror. For reference, we also show results for DSP-SIFT [15] the best of all SIFT variants on the benchmark [46], and GeoDesc [27], a learned descriptor which achieves state-of- the-art results on the benchmark. Note that SuperPoint only provides pixel-accurate key point locations, compared to the sub-pixel accuracy of DSP-SIFT and GeoDesc. Hence, the reprojection error of SuperPoint is higher.

17 of 17

Thank you for your attention!

17