1 of 25

Baseline Assessment of Object Detection Models of Partially Occluded Objects and their Parts

Darius Jefferson II, Dr. Andre Harrison

Presented by Darius Jefferson II (Computer Scientist)

DEVCOM Army Research Laboratory

OCT 2022

DEVCOM ARL ARD

N/A

Darius Jefferson II, (301) 394-1404

U.S. Army

U.S. ARMY COMBAT CAPABILITIES DEVELOPMENT COMMAND

ARMY RESEARCH LABORATORY

PUBLIC RELEASE

Controlled by:

CUI Category:

Distribution/Dissemination Control:

POC:

Controlled by:

PUBLIC RELEASE

2 of 25

MOTIVATION

Popular object detection models are starting to be incorporated into military technology for use on the battlefield
These models can improve Soldiers’ situational awareness by identifying hazardous objects
Problems arise when they try to identify partially occluded objects
Occlusion is common in the real world, but not directly accounted for by most researchers

Battlefields especially provide numerous factors (weather, explosions, etc.) that can cause occlusion

It is imperative to determine the effectiveness of current object detection models on occluded objects to ensure Soldier safety

PUBLIC RELEASE

3 of 25

APPROACH

Aim: To perform a baseline assessment on 3 state-of-the-art object detection models on 2 popular object recognition datasets containing partially occluded objects
Models

Gonzalez-Garcia (GG)
Faster R-CNN
YOLOv5

Datasets

PASCAL VOC 2010
PASCAL-Part

PUBLIC RELEASE

4 of 25

Datasets

PUBLIC RELEASE

5 of 25

PASCAL VOC 2010

The Pattern Analysis, Statistical Modeling and Computational Learning Visual Object Classes (PASCAL VOC) datasets provide images standardized for object class recognition
All VOC datasets were made for a series of object recognition competitions (2005-2012)
This experiment specifically uses VOC 2010
VOC 2010 contains 20 object classes, mainly animals, vehicles, and common household items.
VOC 2010 is annotated using XML files and contains a variety of tags describing the objects in a scene. Of particular interest are the occluded and difficult tags

Occluded tag contains a binary value indicating whether or not an object is occluded
Difficult tag contains a binary value indicating whether an object is considered “difficult” to detect. The Gonzalez-Garcia model and the original PASCAL VOC challenges would purposefully skip difficult objects

PUBLIC RELEASE

6 of 25

OCCLUDED & DIFFICULT EXAMPLES

PUBLIC RELEASE

7 of 25

PASCAL-PART

PASCAL-Part dataset extends VOC 2010 by breaking down most of VOC 2010’s object classes into their semantic parts and annotating them
Semantic parts are those sections of an object that can be easily recognized and described by humans
PASCAL-Part contains 105 part classes
2 major differences from VOC 2010:

PASCAL-Part was annotated in MATLAB
Parts annotations does not denote part occlusion

Person

head

torso

arm

leg

Bicycle

wheel

saddle

handlebar

chainwheel

Object/Parts Semantic Hierarchy

PUBLIC RELEASE

8 of 25

Object Detection Models

PUBLIC RELEASE

9 of 25

GONZALEZ-GARCIA (GG) MODEL

MATLAB-based model that performs object/parts detection
Uses Fast R-CNN as a backbone
Input images are sent through convolutional layers, after which Region-of-Interest (RoI) pooling will create 2 proposals (objects and parts) per image

Features from the object proposals are sent to the object class and object appearance branches

Object class branch uses object class to infer the presence of its parts
Object appearance branch uses object viewpoint to infer the presence of parts

Features from the part proposals are sent to the part appearance branch and the relative location branch

The part proposals sent to the relative location branch are given classification scores based on their locations within their corresponding object using OffsetNet

The outputs of the part appearance, object class, and object appearance branches are combined to form 1 unified part representation

This representation is scored, and then fed into a regressed bounding-box layer
Simultaneously, it is also combined with the classification scores from the relative location branch to form the model’s final outputs

Overview of the GG model (from Gonzalez-Garcia et. al., see Reference 1)

PUBLIC RELEASE

10 of 25

FASTER R-CNN

Python/PyTorch-based object detection model
Implementation obtained from Facebook’s Detectron2 model zoo; original was deprecated
All model implementations within Detectron2’s model zoo were designed with parallel computing in mind (for training)
Comprised of two modules, a Region Proposal Network (RPN) and Fast R-CNN

RPN directs Fast R-CNN on where to look in an image

An image is sent through convolutional layers to generate feature maps, which are then sent to the RPN and then to Fast R-CNN for detection

Overview of Faster R-CNN (from Ren et. al., see Reference 8)

PUBLIC RELEASE

11 of 25

YOLOV5

Python/PyTorch-based object detection model
A successor to YOLOv3 and the YOLO series of object detection models
One of the first YOLO models to use Python/PyTorch instead of C++
YOLO models divide input images into a grid using a neural network

Each square within the grid is predicted upon and weighted independent of the others
This contrasts with other models that apply their networks at multiple regions and scales within an image numerous times

Offers several GPU architecture variations for training

PUBLIC RELEASE

12 of 25

OBJECT DETECTION METRICS

Several standard object detection metrics were used in this experiment
Precision measures the frequency with which a model was correct when predicting the positive class (the class you are trying to correctly predict)
Recall measures how many correct classifications were made from all potential positive labels (the labels that were correctly predicted and those that were not predicted at all)
Positive predictions are determined by finding the intersection-over-union (IoU)

IoU is the % overlap between a predicted object’s bounding box and its ground truth annotation’s bounding box
In this experiment, predictions were considered “correct” at 50% IoU (AP⁵⁰)

Average precision (AP) can be found as the area under the precision-recall curve (used in this paper)

AP⁵⁰ was calculated for each object class to indicate their accuracy

Mean average precision (mAP) is the mean of APs across all object classes and used to indicate overall model performance/accuracy

PUBLIC RELEASE

13 of 25

Experiment

PUBLIC RELEASE

14 of 25

EXPERIMENT

The main objective was to determine the effectiveness of each of the 3 object detection models on the VOC 2010 and PASCAL-Part datasets
1^st half:

Evaluated the models’ overall performance

Trained on VOC 2010 and tested on its validation set

Investigated the models’ effectiveness on partially occluded objects

Using the occluded and difficult tags, objects were evaluated based on breakout categories:

Aggregate (validation set, no difficult objects)
Occluded & Difficult
Occluded & Non-Difficult
Unoccluded & Difficult
Unoccluded & Non-Difficult

2^nd half:

Determined the models’ effectiveness on the semantic parts of objects

Trained on PASCAL-Part and tested on its validation set
No occluded objects because parts were not annotated for that
Each model treats parts as though they were whole objects

PUBLIC RELEASE

15 of 25

Results

PUBLIC RELEASE

16 of 25

OBJECT CLASS RESULTS - AGGREGATE

Aggregate mAPs

GG: 55.09%

Faster R-CNN: 68.56%

YOLOv5: 67.92%

PUBLIC RELEASE

17 of 25

OBJECT CLASS RESULTS – BREAKOUTS 1

Occluded & Difficult mAPs

GG: 0.63%

Faster R-CNN: 3.59%

YOLOv5: 2.51%

PUBLIC RELEASE

18 of 25

OBJECT CLASS RESULTS – BREAKOUTS 2

Occluded & Non-Difficult mAPs

GG: 31.83%

Faster R-CNN: 50.84%

YOLOv5: 47.63%

PUBLIC RELEASE

19 of 25

OBJECT CLASS RESULTS – BREAKOUTS 3

Unoccluded & Difficult mAPs

GG: 1.59%

Faster R-CNN: 6.64%

YOLOv5: 4.10%

PUBLIC RELEASE

20 of 25

OBJECT CLASS RESULTS – BREAKOUTS 4

Unoccluded & Non-Difficult mAPs

GG: 57.24%

Faster R-CNN: 71.83%

YOLOv5: 73.36%

PUBLIC RELEASE

21 of 25

PARTS CLASS RESULTS

Parts Class mAPs

GG: 26.12%

Faster R-CNN: 22.78%

YOLOv5: 36.25%

PUBLIC RELEASE

22 of 25

Conclusion

PUBLIC RELEASE

23 of 25

CONCLUSION AND FUTURE WORK

All object detection models have great overall performance
Faster R-CNN performed the best on partially occluded objects out of the 3
All 3 models faltered on difficult/occluded & difficult objects
YOLOv5 performed the best on semantic parts detection

The other 2 models struggled
Could be modified to use parts detection for inferring parent object’s presence, potentially improving performance on occluded objects

In the future, these models’ training sets will be augmented with synthetic data to see if their accuracy on real-world objects improves

Synthetic images will include examples of occluded/difficult objects and their semantic parts

Additionally, attributes from these models will be incorporated into a new, custom parts detection model

PUBLIC RELEASE

24 of 25

REFERENCES 1

Gonzalez-Garcia A, Modolo D, Ferrari V. Objects as context for detecting their semantic parts. arXiv.org; 2017 [accessed 2021 Sep 10]. https://arxiv.org/abs/1703.09529.
Wu Y, Massa F, Girshick R, Kirillov A, Lo W-Y. Detectron2: A pytorch-based modular object-detection library. Facebook AI Research; 2019 Oct 10 [accessed 2021 Sep 10]. https://ai.facebook.com/blog/-detectron2-a-pytorch-based-modular-object-detection-library-/.
Jocher G, Stoken A, Chaurasia A, Borovec J, NanoCode012, TaoXie, Kwon Y, Michael K, Changyu L, Fang J, et al. ultralytics/yolov5: v6.0 - YOLOv5n 'Nano' models, Roboflow integration, TensorFlow export, OpenCV DNN support. Ultralytics; 2021 Oct [accessed 2022 Jan 28]. https://github.com /ultralytics/yolov5.
Everingham M, van Gool L, Williams C, Winn J, Zisserman A. The PASCAL visual object classes challenge 2010 (VOC 2010) results; 2010 [accessed 2021 Sep 10]. http://host.robots.ox.ac.uk/pascal/VOC/voc2010/.
Chen X, Mottaghi R, Liu X, Fidler S, Urtasun R, Yuille A. The PASCAL-Part Dataset; 2014 [accessed 2022 Jun 13]. http://roozbehm.info/pascal-parts/pascal-parts.html.
Gonzalez-Garcia A, Modolo D, Ferrari V. Do semantic parts emerge in convolutional neural networks? International Journal of Computer Vision. 2017 Oct;126(5):476–494.
Girshick R. Fast R-CNN. arXiv.org; 2015 Sep 27 [accessed 2021 Sep 14]. https://arxiv.org/abs/1504.08083.
Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object-detection with region proposal networks. arXiv.org; 2016 Jan 6 [accessed 2021 Sep 14]. https://arxiv.org/abs/1506.01497.
He K, Gkioxari G, Dollar P, Girshick R. Mask R-CNN. arXiv.org; 2018 Jan 24 [accessed 2021 Sep 14]. https://arxiv.org/abs/1703.06870.
Lin T-Y, Goyal P, Girshick R, He K, Dollar P. Focal loss for dense object-detection. arXiv.org; 2018 Feb 7 [accessed 2021 Sep 14]. https://arxiv.org /abs/1708.02002.
Guler RA, Neverova N, Kokkinos I. DensePose: dense human pose estimation in the wild. arXiv.org; 2018 Feb 1 [accessed 2021 Sep 14]. https:P//arxiv.org /abs/1802.00434.16

PUBLIC RELEASE

25 of 25

REFERENCES 2

Cai Z, Vasconcelos N. Cascade R-CNN: delving into high quality object-detection. arXiv.org; 2017 Dec 3 [accessed 2021 Sep 14]. https://arxiv.org/abs /1712.00726.
Kirillov A, Girshick R, He K, Dollar P. Panoptic feature pyramid networks. arXiv.org; 2019 Apr 10 [accessed 2021 Sep 14]. https://arxiv.org/abs /1901.02446.
Chen X, Girshick R, He K, Dollar P. TensorMask: a foundation for dense object segmentation. arXiv.org; 2019 Aug 27 [accessed 2021 Sep 14]. https://arxiv.org/abs/1903.12174.
Redmon J. YOLO: Real-time object-detection; 2018 [accessed 2021 Sep 10]. https://pjreddie.com/darknet/yolo/.
YOLOv5 Documentation; 2021 [accessed 2022 Jan 28]. https://docs.ultralytics.com.
Bochkovskiy A, Wang C-Y, Liao H-YM. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv.org; 2020 Apr 23 [accessed 2022 Sep 15]. https://arxiv.org/abs/2004.10934.
Everingham M, Winn J. devkit_doc.avi; 2010 May 8 [accessed 2021 Sep 10]. http://host.robots.ox.ac.uk/pascal/VOC/voc2010/devkit_doc_08-May-2010.pdf.
Google. Machine Learning Glossary. developers.google.com; [accessed 2022 Sep 14]. https://developers.google.com/machine-learning/glossary.

PUBLIC RELEASE