1 of 25

Baseline Assessment of Object Detection Models of Partially Occluded Objects and their Parts

Darius Jefferson II, Dr. Andre Harrison

Presented by Darius Jefferson II (Computer Scientist)

DEVCOM Army Research Laboratory

OCT 2022

DEVCOM ARL ARD

N/A

Darius Jefferson II, (301) 394-1404

U.S. Army

U.S. ARMY COMBAT CAPABILITIES DEVELOPMENT COMMAND

ARMY RESEARCH LABORATORY

PUBLIC RELEASE

PUBLIC RELEASE

Controlled by:

CUI Category:

Distribution/Dissemination Control:

POC:

Controlled by:

PUBLIC RELEASE

PUBLIC RELEASE

1

2 of 25

MOTIVATION

  • Popular object detection models are starting to be incorporated into military technology for use on the battlefield
  • These models can improve Soldiers’ situational awareness by identifying hazardous objects
  • Problems arise when they try to identify partially occluded objects
  • Occlusion is common in the real world, but not directly accounted for by most researchers
    • Battlefields especially provide numerous factors (weather, explosions, etc.) that can cause occlusion
  • It is imperative to determine the effectiveness of current object detection models on occluded objects to ensure Soldier safety

PUBLIC RELEASE

PUBLIC RELEASE

2

3 of 25

APPROACH

  • Aim: To perform a baseline assessment on 3 state-of-the-art object detection models on 2 popular object recognition datasets containing partially occluded objects
  • Models
    • Gonzalez-Garcia (GG)
    • Faster R-CNN
    • YOLOv5
  • Datasets
    • PASCAL VOC 2010
    • PASCAL-Part

PUBLIC RELEASE

PUBLIC RELEASE

3

4 of 25

Datasets

PUBLIC RELEASE

PUBLIC RELEASE

4

5 of 25

PASCAL VOC 2010

  • The Pattern Analysis, Statistical Modeling and Computational Learning Visual Object Classes (PASCAL VOC) datasets provide images standardized for object class recognition
  • All VOC datasets were made for a series of object recognition competitions (2005-2012)
  • This experiment specifically uses VOC 2010
  • VOC 2010 contains 20 object classes, mainly animals, vehicles, and common household items.
  • VOC 2010 is annotated using XML files and contains a variety of tags describing the objects in a scene. Of particular interest are the occluded and difficult tags
    • Occluded tag contains a binary value indicating whether or not an object is occluded
    • Difficult tag contains a binary value indicating whether an object is considered “difficult” to detect. The Gonzalez-Garcia model and the original PASCAL VOC challenges would purposefully skip difficult objects

PUBLIC RELEASE

PUBLIC RELEASE

5

6 of 25

OCCLUDED & DIFFICULT EXAMPLES

PUBLIC RELEASE

PUBLIC RELEASE

6

7 of 25

PASCAL-PART

  • PASCAL-Part dataset extends VOC 2010 by breaking down most of VOC 2010’s object classes into their semantic parts and annotating them
  • Semantic parts are those sections of an object that can be easily recognized and described by humans
  • PASCAL-Part contains 105 part classes
  • 2 major differences from VOC 2010:
    • PASCAL-Part was annotated in MATLAB
    • Parts annotations does not denote part occlusion

Person

head

torso

arm

leg

Bicycle

wheel

saddle

handlebar

chainwheel

Object/Parts Semantic Hierarchy

PUBLIC RELEASE

PUBLIC RELEASE

7

8 of 25

Object Detection Models

PUBLIC RELEASE

PUBLIC RELEASE

8

9 of 25

GONZALEZ-GARCIA (GG) MODEL

  • MATLAB-based model that performs object/parts detection
  • Uses Fast R-CNN as a backbone
  • Input images are sent through convolutional layers, after which Region-of-Interest (RoI) pooling will create 2 proposals (objects and parts) per image
    • Features from the object proposals are sent to the object class and object appearance branches
      • Object class branch uses object class to infer the presence of its parts
      • Object appearance branch uses object viewpoint to infer the presence of parts
    • Features from the part proposals are sent to the part appearance branch and the relative location branch
      • The part proposals sent to the relative location branch are given classification scores based on their locations within their corresponding object using OffsetNet
  • The outputs of the part appearance, object class, and object appearance branches are combined to form 1 unified part representation
    • This representation is scored, and then fed into a regressed bounding-box layer
    • Simultaneously, it is also combined with the classification scores from the relative location branch to form the model’s final outputs

Overview of the GG model (from Gonzalez-Garcia et. al., see Reference 1)

PUBLIC RELEASE

PUBLIC RELEASE

9

10 of 25

FASTER R-CNN

  • Python/PyTorch-based object detection model
  • Implementation obtained from Facebook’s Detectron2 model zoo; original was deprecated
  • All model implementations within Detectron2’s model zoo were designed with parallel computing in mind (for training)
  • Comprised of two modules, a Region Proposal Network (RPN) and Fast R-CNN
    • RPN directs Fast R-CNN on where to look in an image
  • An image is sent through convolutional layers to generate feature maps, which are then sent to the RPN and then to Fast R-CNN for detection

Overview of Faster R-CNN (from Ren et. al., see Reference 8)

PUBLIC RELEASE

PUBLIC RELEASE

10

11 of 25

YOLOV5

  • Python/PyTorch-based object detection model
  • A successor to YOLOv3 and the YOLO series of object detection models
  • One of the first YOLO models to use Python/PyTorch instead of C++
  • YOLO models divide input images into a grid using a neural network
    • Each square within the grid is predicted upon and weighted independent of the others
    • This contrasts with other models that apply their networks at multiple regions and scales within an image numerous times
  • Offers several GPU architecture variations for training

PUBLIC RELEASE

PUBLIC RELEASE

11

12 of 25

OBJECT DETECTION METRICS

  • Several standard object detection metrics were used in this experiment
  • Precision measures the frequency with which a model was correct when predicting the positive class (the class you are trying to correctly predict)
  • Recall measures how many correct classifications were made from all potential positive labels (the labels that were correctly predicted and those that were not predicted at all)
  • Positive predictions are determined by finding the intersection-over-union (IoU)
    • IoU is the % overlap between a predicted object’s bounding box and its ground truth annotation’s bounding box
    • In this experiment, predictions were considered “correct” at 50% IoU (AP50)
  • Average precision (AP) can be found as the area under the precision-recall curve (used in this paper)
    • AP50 was calculated for each object class to indicate their accuracy
  • Mean average precision (mAP) is the mean of APs across all object classes and used to indicate overall model performance/accuracy

PUBLIC RELEASE

PUBLIC RELEASE

12

13 of 25

Experiment

PUBLIC RELEASE

PUBLIC RELEASE

13

14 of 25

EXPERIMENT

  • The main objective was to determine the effectiveness of each of the 3 object detection models on the VOC 2010 and PASCAL-Part datasets
  • 1st half:
    • Evaluated the models’ overall performance
      • Trained on VOC 2010 and tested on its validation set
    • Investigated the models’ effectiveness on partially occluded objects
      • Using the occluded and difficult tags, objects were evaluated based on breakout categories:
        • Aggregate (validation set, no difficult objects)
        • Occluded & Difficult
        • Occluded & Non-Difficult
        • Unoccluded & Difficult
        • Unoccluded & Non-Difficult
  • 2nd half:
    • Determined the models’ effectiveness on the semantic parts of objects
      • Trained on PASCAL-Part and tested on its validation set
      • No occluded objects because parts were not annotated for that
      • Each model treats parts as though they were whole objects

PUBLIC RELEASE

PUBLIC RELEASE

14

15 of 25

Results

PUBLIC RELEASE

PUBLIC RELEASE

15

16 of 25

OBJECT CLASS RESULTS - AGGREGATE

Aggregate mAPs

GG: 55.09%

Faster R-CNN: 68.56%

YOLOv5: 67.92%

PUBLIC RELEASE

PUBLIC RELEASE

16

17 of 25

OBJECT CLASS RESULTS – BREAKOUTS 1

Occluded & Difficult mAPs

GG: 0.63%

Faster R-CNN: 3.59%

YOLOv5: 2.51%

PUBLIC RELEASE

PUBLIC RELEASE

17

18 of 25

OBJECT CLASS RESULTS – BREAKOUTS 2

Occluded & Non-Difficult mAPs

GG: 31.83%

Faster R-CNN: 50.84%

YOLOv5: 47.63%

PUBLIC RELEASE

PUBLIC RELEASE

18

19 of 25

OBJECT CLASS RESULTS – BREAKOUTS 3

Unoccluded & Difficult mAPs

GG: 1.59%

Faster R-CNN: 6.64%

YOLOv5: 4.10%

PUBLIC RELEASE

PUBLIC RELEASE

19

20 of 25

OBJECT CLASS RESULTS – BREAKOUTS 4

Unoccluded & Non-Difficult mAPs

GG: 57.24%

Faster R-CNN: 71.83%

YOLOv5: 73.36%

PUBLIC RELEASE

PUBLIC RELEASE

20

21 of 25

PARTS CLASS RESULTS

Parts Class mAPs

GG: 26.12%

Faster R-CNN: 22.78%

YOLOv5: 36.25%

PUBLIC RELEASE

PUBLIC RELEASE

21

22 of 25

Conclusion

PUBLIC RELEASE

PUBLIC RELEASE

22

23 of 25

CONCLUSION AND FUTURE WORK

  • All object detection models have great overall performance
  • Faster R-CNN performed the best on partially occluded objects out of the 3
  • All 3 models faltered on difficult/occluded & difficult objects
  • YOLOv5 performed the best on semantic parts detection
    • The other 2 models struggled
    • Could be modified to use parts detection for inferring parent object’s presence, potentially improving performance on occluded objects
  • In the future, these models’ training sets will be augmented with synthetic data to see if their accuracy on real-world objects improves
    • Synthetic images will include examples of occluded/difficult objects and their semantic parts
  • Additionally, attributes from these models will be incorporated into a new, custom parts detection model

PUBLIC RELEASE

PUBLIC RELEASE

23

24 of 25

REFERENCES 1

  1. Gonzalez-Garcia A, Modolo D, Ferrari V. Objects as context for detecting their semantic parts. arXiv.org; 2017 [accessed 2021 Sep 10]. https://arxiv.org/abs/1703.09529.
  2. Wu Y, Massa F, Girshick R, Kirillov A, Lo W-Y. Detectron2: A pytorch-based modular object-detection library. Facebook AI Research; 2019 Oct 10 [accessed 2021 Sep 10]. https://ai.facebook.com/blog/-detectron2-a-pytorch-based-modular-object-detection-library-/.
  3. Jocher G, Stoken A, Chaurasia A, Borovec J, NanoCode012, TaoXie, Kwon Y, Michael K, Changyu L, Fang J, et al. ultralytics/yolov5: v6.0 - YOLOv5n 'Nano' models, Roboflow integration, TensorFlow export, OpenCV DNN support. Ultralytics; 2021 Oct [accessed 2022 Jan 28]. https://github.com /ultralytics/yolov5.
  4. Everingham M, van Gool L, Williams C, Winn J, Zisserman A. The PASCAL visual object classes challenge 2010 (VOC 2010) results; 2010 [accessed 2021 Sep 10]. http://host.robots.ox.ac.uk/pascal/VOC/voc2010/.
  5. Chen X, Mottaghi R, Liu X, Fidler S, Urtasun R, Yuille A. The PASCAL-Part Dataset; 2014 [accessed 2022 Jun 13]. http://roozbehm.info/pascal-parts/pascal-parts.html.
  6. Gonzalez-Garcia A, Modolo D, Ferrari V. Do semantic parts emerge in convolutional neural networks? International Journal of Computer Vision. 2017 Oct;126(5):476–494.
  7. Girshick R. Fast R-CNN. arXiv.org; 2015 Sep 27 [accessed 2021 Sep 14]. https://arxiv.org/abs/1504.08083.
  8. Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object-detection with region proposal networks. arXiv.org; 2016 Jan 6 [accessed 2021 Sep 14]. https://arxiv.org/abs/1506.01497.
  9. He K, Gkioxari G, Dollar P, Girshick R. Mask R-CNN. arXiv.org; 2018 Jan 24 [accessed 2021 Sep 14]. https://arxiv.org/abs/1703.06870.
  10. Lin T-Y, Goyal P, Girshick R, He K, Dollar P. Focal loss for dense object-detection. arXiv.org; 2018 Feb 7 [accessed 2021 Sep 14]. https://arxiv.org /abs/1708.02002.
  11. Guler RA, Neverova N, Kokkinos I. DensePose: dense human pose estimation in the wild. arXiv.org; 2018 Feb 1 [accessed 2021 Sep 14]. https:P//arxiv.org /abs/1802.00434.16

PUBLIC RELEASE

PUBLIC RELEASE

24

25 of 25

REFERENCES 2

  1. Cai Z, Vasconcelos N. Cascade R-CNN: delving into high quality object-detection. arXiv.org; 2017 Dec 3 [accessed 2021 Sep 14]. https://arxiv.org/abs /1712.00726.
  2. Kirillov A, Girshick R, He K, Dollar P. Panoptic feature pyramid networks. arXiv.org; 2019 Apr 10 [accessed 2021 Sep 14]. https://arxiv.org/abs /1901.02446.
  3. Chen X, Girshick R, He K, Dollar P. TensorMask: a foundation for dense object segmentation. arXiv.org; 2019 Aug 27 [accessed 2021 Sep 14]. https://arxiv.org/abs/1903.12174.
  4. Redmon J. YOLO: Real-time object-detection; 2018 [accessed 2021 Sep 10]. https://pjreddie.com/darknet/yolo/.
  5. YOLOv5 Documentation; 2021 [accessed 2022 Jan 28]. https://docs.ultralytics.com.
  6. Bochkovskiy A, Wang C-Y, Liao H-YM. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv.org; 2020 Apr 23 [accessed 2022 Sep 15]. https://arxiv.org/abs/2004.10934.
  7. Everingham M, Winn J. devkit_doc.avi; 2010 May 8 [accessed 2021 Sep 10]. http://host.robots.ox.ac.uk/pascal/VOC/voc2010/devkit_doc_08-May-2010.pdf.
  8. Google. Machine Learning Glossary. developers.google.com; [accessed 2022 Sep 14]. https://developers.google.com/machine-learning/glossary.

PUBLIC RELEASE

PUBLIC RELEASE

25