Cone Detection by Faster R-CNN
2022.7.30
Introduction
Object detection is a field of computer vision that involves taking either videos or still images as input and passing it through various different types of algorithms/programs in an attempt to identify and classify various types of objects such as people and cars.
R-CNN, Fast R-CNN, Faster R-CNN, YOLO and etc.
Train a Faster R-CNN model using a dataset of traffic cones, to locate and classify a cone within an image.
Cone Dataset
Github Dataset: 123 annotated images
Data Augmentation: Albumentations API
Annotation files:
(1) First coordinate: label
(2) Second coordinate: x-value of the cone center
(3) Third coordinate: y-value of the cone center
(4) Fourth coordinate: width of the cone
(5) Fifth coordinate: height of the cone
80-20 percent split for training and validating.
augmentation
Horizontal flip
Median Blur
Crop
Contrast
Four steps to implement Faster R-CNN:
- CNN layers: Put the image into pretrained VGG16-Net to get the corresponding feature map.
- Region Proposal Network: Use RPN to generate region proposals. There are two tasks: Classifier and Regressor.
- ROI Pooling: Collect feature maps and region proposals to generate proposals feature maps.
- Classification: Utilize proposal feature maps to classify.
RPN + Fast R-CNN
Faster R-CNN
Region Proposal Network
(1) Generate anchors for each pixel after the convolution layers.
(2) Use Softmax to determine if a region is positive with a potential object or negative with no possible object (based on texture, shape, color features etc.)
(3) Bounding Box Regression finds the translating and scaling parameters to better fit the anchor to the Ground Truth (actual boundary box for the object).
(4) Proposal Layer takes in the positive anchors along with the regression parameters to output the precise dimension of the proposed region.
Anchors: a set of 9 boundary boxes with fixed aspect ratio
Fast R-CNN
General steps:
(1) Use selective search to generate 1K~2K region proposals.
(2) Put the image into pretrained VGG16-Net to get the corresponding feature map. The generated region proposals are projected on the feature map to get feature vectors.
(3) Use ROI pooling layer to reshape them into a fixed size. From the ROI feature vector, we use softmax layer to predict the class of the proposed region and the offset values for the bounding box.
Selective Search:
1. Generate initial sub-segmentation, we generate many candidate regions.
2. Use greedy algorithm to recursively combine similar regions into larger ones.
3. Use the generated regions to produce the final candidate region proposals.
RPN Multi-task Loss
Classification Loss
Bbox Regression Loss
Parameters of equation:
(1) denotes the probability that the i-th anchor is predicted to be the true label.
(2) is 1 when sample is positive and 0 vice versa.
(3) denotes the bounding box regression parameter for predicting the i-th anchor.
(4) denotes the corresponding ground truth box of the i-th anchor.
(5) denotes the number of all samples in a mini-batch is 256.
(6) denotes the number of anchor positions (not the number of anchors) is about 2400.
RPN Multi-task Loss
Binary Cross Entropy:
Bbox Regression Loss:
(1) denotes the bounding box regression parameter for predicting the i-th anchor.
(2) denotes the corresponding ground truth box of the i-th anchor.
Fast R-CNN Multi-task Loss
Classification Loss
Bbox Regression Loss
Parameters of equation:
(1) denotes the softmax probability distribution predicted by the classifier.
(2) corresponds with ground truth category labels.
(3) denotes the regression parameters for the corresponding class predicted by the bounding box regressor.
(4) corresponds to the bounding box regression parameters of the ground truth.
(5) Classification Loss is also the cross-entropy loss function.
(6) Bbox regression loss is the same with RPN network.
Advantages & Drawbacks
Advantages
Drawbacks
Experiment results
original images prediction results
Result Comparison With Yolov3
Yolov3 test on video
Analysis:
(1) From table and video results, Faster R-CNN and YOLOv3 models both have effect on the traffic cone dataset.
(2) Faster R-CNN in our project maintains a high recall rate, while maintaining a relatively high precision.
(3) YOLOv3 sacrifices recall rate in order to ensure high precision.
(4) In result, Faster R-CNN model in our project is better than YOLOv3.
Conclusion & Future Study
(1) In this project, we have reproduced the code of the Faster R-CNN model to detect the traffic cones on the image.
(2) Faster R-CNN model has presented RPNs for efficient and accurate region proposal generation.
(3) By sharing convolutional features with the down-stream detection network, the region proposal step is nearly cost-free.
(4) The learned RPN also improves region proposal quality and thus the overall object detection accuracy.
(5) We have compared the results between Faster R-CNN and YOLOv3 model and presented drawbacks to solve in the future.
References
[1] Annis, J., Floyd, D., Fontes, S., & Navarrete, M. (n.d.). Parking Analysis via Image Processing. Bakersfield; CSU Bakersfield.
[2] “Albumentations Documentation - What Is Image Augmentation.” What Is Image Augmentation - Albumentations Documentation, https://albumentations.ai/docs/introduction/image_augmentation/.
[3] Du, Lixuan, et al. “Overview of Two-Stage Object Detection Algorithms.” Research Gate, May 2020, https://www.researchgate.net/figure/Network-structure-diagram-of-Faster-R-CNN-Faster-R-CNN-is-mainly-divided-into-the_fig1_341871095/actions#reference. Accessed 15 July 2022.
[4] Justin Freid. “Frontiers: Visual identity that begins to transform advertising.” Translated by Fred, 10 Sept. 2018, https://zhuanlan.zhihu.com/p/44239428.
[5] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. 1, 2, 4, 5, 8.
[6] Uijlings J R R, Van De Sande K E A, Gevers T, et al. Selective search for object recognition[J]. International journal of computer vision, 2013, 104(2): 154-171.