1 of 14

Cone Detection by Faster R-CNN

2022.7.30

2 of 14

Introduction

Object Detection Background

Object detection is a field of computer vision that involves taking either videos or still images as input and passing it through various different types of algorithms/programs in an attempt to identify and classify various types of objects such as people and cars.

Common model for object detection

R-CNN, Fast R-CNN, Faster R-CNN, YOLO and etc.

Project Proposal

Train a Faster R-CNN model using a dataset of traffic cones, to locate and classify a cone within an image.

3 of 14

Cone Dataset

Github Dataset: 123 annotated images

Data Augmentation: Albumentations API

Annotation files:

(1) First coordinate: label

(2) Second coordinate: x-value of the cone center

(3) Third coordinate: y-value of the cone center

(4) Fourth coordinate: width of the cone

(5) Fifth coordinate: height of the cone

80-20 percent split for training and validating.

augmentation

Horizontal flip

Median Blur

Crop

Contrast

4 of 14

Four steps to implement Faster R-CNN:

- CNN layers: Put the image into pretrained VGG16-Net to get the corresponding feature map.

- Region Proposal Network: Use RPN to generate region proposals. There are two tasks: Classifier and Regressor.

- ROI Pooling: Collect feature maps and region proposals to generate proposals feature maps.

- Classification: Utilize proposal feature maps to classify.

RPN + Fast R-CNN

Faster R-CNN

5 of 14

Region Proposal Network

(1) Generate anchors for each pixel after the convolution layers.

(2) Use Softmax to determine if a region is positive with a potential object or negative with no possible object (based on texture, shape, color features etc.)

(3) Bounding Box Regression finds the translating and scaling parameters to better fit the anchor to the Ground Truth (actual boundary box for the object).

(4) Proposal Layer takes in the positive anchors along with the regression parameters to output the precise dimension of the proposed region.

Anchors: a set of 9 boundary boxes with fixed aspect ratio

6 of 14

Fast R-CNN

General steps:

(1) Use selective search to generate 1K~2K region proposals.

(2) Put the image into pretrained VGG16-Net to get the corresponding feature map. The generated region proposals are projected on the feature map to get feature vectors.

(3) Use ROI pooling layer to reshape them into a fixed size. From the ROI feature vector, we use softmax layer to predict the class of the proposed region and the offset values for the bounding box.

Selective Search:

1. Generate initial sub-segmentation, we generate many candidate regions.

2. Use greedy algorithm to recursively combine similar regions into larger ones.

3. Use the generated regions to produce the final candidate region proposals.

7 of 14

RPN Multi-task Loss

Classification Loss

Bbox Regression Loss

Parameters of equation:

(1) denotes the probability that the i-th anchor is predicted to be the true label.

(2) is 1 when sample is positive and 0 vice versa.

(3) denotes the bounding box regression parameter for predicting the i-th anchor.

(4) denotes the corresponding ground truth box of the i-th anchor.

(5) denotes the number of all samples in a mini-batch is 256.

(6) denotes the number of anchor positions (not the number of anchors) is about 2400.

Tong: This week our group has realized the Loss function of the faster-RCNN and finished the code of detecting cones in the image. First, I will introduce the loss function of Faster-RCNN. Because Faster-RCNN is combined with RPN network and Fast-RCNN. The loss function is consisted by two parts. First, I will introduce loss function of RPN network. During detecting task. we use model to do classification and predict the bounding box. In the classification, we judge the object whether cone or not. It is a binary classification problem. In result, we use cross-entropy loss function for the classification. In the predicting positions of bounding box tasks, we predict the parameters of bounding box. This is a regression problem. In result, we use piecewise function for the regression loss function. Here, I will briefly introduce the meaning of every parameter in the equation. p_i is the probability of judging whether current anchor is the true label. When current sample is positive, p_i_* is 1. When current sample is negative, p_i_* is 0. t_i is the regression paramaters of the bounding box in the current image. t_i_* is the ground truth bounding box. N_cls and N_reg are the fixed value showed on the slides.

8 of 14

RPN Multi-task Loss

Binary Cross Entropy:

Bbox Regression Loss:

(1) denotes the bounding box regression parameter for predicting the i-th anchor.

(2) denotes the corresponding ground truth box of the i-th anchor.

9 of 14

Fast R-CNN Multi-task Loss

Classification Loss

Bbox Regression Loss

Parameters of equation:

(1) denotes the softmax probability distribution predicted by the classifier.

(2) corresponds with ground truth category labels.

(3) denotes the regression parameters for the corresponding class predicted by the bounding box regressor.

(4) corresponds to the bounding box regression parameters of the ground truth.

(5) Classification Loss is also the cross-entropy loss function.

(6) Bbox regression loss is the same with RPN network.

10 of 14

Advantages & Drawbacks

Advantages

Produces high accuracy results with different tested datasets (high compatibility).
Faster prediction time and lower computing power requirement.
Identify varying scales and aspect ratio objects.

Drawbacks

RoI pooling lowers the definition and accuracy of image.
NMS layer could remove boundary boxes for overlapping objects.
The number of positive and negative samples is limited by hyperparameters to ensure the balance.

David: After the implementation, we’ve realized couple of advantages and drawbacks of this model when compared to some other state-of-the-art models. We would display the results in the following pages. Here, we know that Faster R-CNN is quite compatible with different datasets and can very easily produce relatively accurate results. Compared to some of its previous generations, it is among the fastest, and the closest to fully realizing real-time detection. It is also particularly effective and good at identifying objects of drastically different scales and aspect ratios.

On the other hand, there are some drawbacks that still could be improved, such as how the twice RoI pooling layer lowers the definition of the image, and thus inhibits the accuracy of the detection, the non max suppression layer could remove some actual bounding boxes, and that the number of positive and negative samples are predetermined by hyperparameters, that may vary from object to object.

11 of 14

Experiment results

original images prediction results

All the cones are detected and the correct dimensions of the bounding boxes are applied
Cars and pillars are also detected (false positive)
Average Recall: 0.594
Average Precision: 0.513

12 of 14

Result Comparison With Yolov3

Yolov3 test on video

Analysis:

(1) From table and video results, Faster R-CNN and YOLOv3 models both have effect on the traffic cone dataset.

(2) Faster R-CNN in our project maintains a high recall rate, while maintaining a relatively high precision.

(3) YOLOv3 sacrifices recall rate in order to ensure high precision.

(4) In result, Faster R-CNN model in our project is better than YOLOv3.

13 of 14

Conclusion & Future Study

(1) In this project, we have reproduced the code of the Faster R-CNN model to detect the traffic cones on the image.

(2) Faster R-CNN model has presented RPNs for efficient and accurate region proposal generation.

(3) By sharing convolutional features with the down-stream detection network, the region proposal step is nearly cost-free.

(4) The learned RPN also improves region proposal quality and thus the overall object detection accuracy.

(5) We have compared the results between Faster R-CNN and YOLOv3 model and presented drawbacks to solve in the future.

14 of 14

References

[1] Annis, J., Floyd, D., Fontes, S., & Navarrete, M. (n.d.). Parking Analysis via Image Processing. Bakersfield; CSU Bakersfield.

[2] “Albumentations Documentation - What Is Image Augmentation.” What Is Image Augmentation - Albumentations Documentation, https://albumentations.ai/docs/introduction/image_augmentation/.

[3] Du, Lixuan, et al. “Overview of Two-Stage Object Detection Algorithms.” Research Gate, May 2020, https://www.researchgate.net/figure/Network-structure-diagram-of-Faster-R-CNN-Faster-R-CNN-is-mainly-divided-into-the_fig1_341871095/actions#reference. Accessed 15 July 2022.

[4] Justin Freid. “Frontiers: Visual identity that begins to transform advertising.” Translated by Fred, 10 Sept. 2018, https://zhuanlan.zhihu.com/p/44239428.

[5] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. 1, 2, 4, 5, 8.

[6] Uijlings J R R, Van De Sande K E A, Gevers T, et al. Selective search for object recognition[J]. International journal of computer vision, 2013, 104(2): 154-171.