1 of 49

3D Object Detection/Classification

A Summary and Discussion by:

Deepak Warrier and Reza Averly

2 of 49

Outline

  • 3D Object Detection Task
  • Difficulty of 3D Point Clouds
  • Brief overview of PointNet + PointRCNN
  • PV-RCNN
  • CenterPoint
    • A brief overview of VoxelNet/PointPillars
  • PV-RCNN vs CenterPoint Experiment Results

3 of 49

Task: 3D Object Detection/Classification

Given a 3D Point Cloud...

  • Object Detection: produce a set of (x,y,z,l,w,h,a) that indicate bounding prisms at (x,y,z), which size (l,w,h), and rotation a
  • Classification: produce a class, indicating the type of object detected
  • Segmentation: produce a class for each point

4 of 49

Difficulty of 3D Point Clouds

  • *Unordered*
    • As opposed to images/voxels whose locations have a built-in ordering
  • Points neighboring each other have meaning
  • Invariant to rotations and translations
  • Limited in providing dense information
    • No shape, texture, color, in contrast to image
  • Unstructured
    • Distance to neighboring points is not fixed, while image has 2D fixed grid
  • Irregularity
    • Points aren't evenly sampled (some regions are dense and sparse)

5 of 49

3D Object Detection Method

  • Grid-based method

Transform point clouds to 3D voxels or 2D bird-view maps

+computationally efficient

-fine-grained localization accuracy loss

  • Point-based method

Use point clouds

+localization accuracy

-higher computation cost

6 of 49

What we will discuss

PointNet

PointRCNN

PV-RCNN

CenterPoint

Data

Point

Point

Point

Point

Approach

Point-Based

Point-Based 

Point-Based and Voxel-Based

Voxel-Based

Task

Classification

Segmentation

3D Object Detection

(Region Proposal)

3D Object Detection (Region Proposal)

3D Object Detection (Center Point)

7 of 49

PointNet: Purpose/Tasks

8 of 49

PointNet: General Architecture

T-Nets are set as learnable sections of the network

9 of 49

PointNet: General Architecture

T-Nets are set as learnable sections of the network

Point Identity is maintained through network

10 of 49

PointNet: T-Nets

T-Nets are set as learnable sections of the network. They learn a set of affine transformations

An expanded view of 3x3 T-Net

Credit: Luis Gonzales @ Medium.com

Affine Transformations typically can be visualized like this.

The T-Net allows the actual parameters for the transformation to be learned.

Credit: Wikimedia Commons

11 of 49

PointNet: Key Features and Takeaways

  • Robust to perturbations and noise in the points
  • Learns a set of critical points
    • Helpful features to learn when using PN as a backbone for other models

12 of 49

PointRCNN

Uses PointNet++ (can use other backbones as well)

13 of 49

PointRCNN: Region Proposals

All the dots shown are Foreground Points

For each FG point detected, generate a 3D bounding box proposal 

x and z points are assigned to bins. Orientation is also split into bins as well

Computes bin-based localization loss by comparing x and z bins with the target x and z bins. Similar Loss is computed for orientation.

y is computed using normal L1 Loss

Bin origin

Bin size

14 of 49

PointRCNN: Region Proposals

Cross Entropy Classification Loss on x, z, and orientation

Smooth L1 Loss on box size and height

Overall loss on box regression

An example of Non-Maximal Suppression

Source: PyImageSearch

15 of 49

PointRCNN: Box Refinement

  • Stage is similar to RCNN refinement stage
  • Box Proposals are enlarged (by constant amount)
    • Captures surrounding context
  • For each point inside the enlarged box
    • Points of form: (x, y, z, r, m, f)
      • m – in set {0,1}, represents if a foreground point or not
      • f – a C dimensional learned point feature representation
        • Generated from the backbone in proposal stage

16 of 49

PointRCNN: Box Refinement

  • Points per box are transformed to box's egocentric space
  • Using previous residual loss functions, boxes are refined
    • Angle is refined using a similar form (bin-based)

 

Bin-based residual loss on orientation

17 of 49

Questions?

18 of 49

PV-RCNN

Combine point-based and voxel-based method

19 of 49

PV-RCNN: Voxel-Based

Motivation: Transform point clouds into multi-scale voxels (sparse 3D matrix) to generate region proposals

20 of 49

PV-RCNN: Voxel-Based

 

 

1x, 2x, 4x, 8x downsample

 

4 layer 3D CNN

Use 3x3x3 kernel sparse convolution

3D Box Proposals

 

 

21 of 49

PV-RCNN: Point-Based

Motivation: Use keypoints data to enrich features for 3D box proposals

22 of 49

PV-RCNN: FPS + VSA

  •  

 

23 of 49

PV-RCNN: VSA

  •  

 

 

 

Do pooling from a set of voxel-feature vectors

24 of 49

PV-RCNN: PKWM

Motivation: Foreground keypoints are more important than background

How: predict weights for each keypoint

Keypoints in truth ground box are foreground

Data:

- Point clouds

- 3D boxes

Predict weight

Focal Loss: modified CE-Loss that deals with

class imbalance of background and foreground

25 of 49

PV-RCNN: First-Stage Recap

 

For each keypoint

Sampling neighboring voxels

Feature encoding and pooling

 

 

Transform points to voxels

 

26 of 49

PV-RCNN: Second-Stage

  • Motivation: Refine proposal and predict confidence

27 of 49

PV-RCNN: ROI POOLING

  •  

Random sampling

Feature encoding

Similar to VSA!!!

Sample keypoints -> sample grid points

Neighbor Voxels -> neighbor keypoints

28 of 49

PV-RCNN: CONFIDENCE + REFINEMENT

  • Each proposal has 216 grid points with feature vectors
  • For confidence prediction:
    • Calculate confidence using:

    • Train using Cross-Entropy Loss

  • For box refinement, use smooth-L1 loss

Intersection of Union

Min Max ensures

Range (0, 1)

29 of 49

Questions?

30 of 49

CenterPoint: Idea

  • Fundamental Difference: Use centers instead of full boxes to detect objects
    • Points have no orientation
      • Search space for boxes reduces, since there is no need to search through various rotations
      • Backbone will be forced to learn rotational invariance

31 of 49

CenterPoint: Architecture

32 of 49

CenterPoint: The Feature Representation

The backbones in CenterPoint make use of encoded, pillar-oriented features of the 3D point cloud

  • Effectively "voxelize" the space and sample points for efficient representation
  • Captures variety of contextual shape information
  • Enables further work to be done via 2D convolutions, instead of 3D

A helpful representation, since these pillars can be flattened for the next stage

VoxelNet

PointPillars

33 of 49

CenterPoint: Keypoint Detection

  • Backbone (from prev. slide) produces, M in R^(WxHxF)
  • CenterPoint uses another (modified) backbone, CenterNet
    • Using detected points from backbone, generate a K-channel Gaussian heatmap
      • K is the number of classes to predict

Source: Uri Almog

34 of 49

CenterPoint: Keypoint Detection

  • In "map view", objects are sparse
    • Few detected centers tend to be close together
    • Using standard CenterNet, the predictions bias towards background (as opposed to object)
  • Fixed by enlarging Gaussian with
    • w,l – width and length of ground truth box
    • f – CornerNet's radius function
    • t – the minimum radius of the gaussian (2 was used in the paper)
  • The change helps to balance out the background bias

35 of 49

CenterPoint: The Regression Stage

  • The goal of the regression stage is to produce an 8-length vector of features per detected center
    • Sub-voxel location refinement: (o1, o2)
      • Where to refine within a given voxel
    • Height: y
      • Height data is lost in "image" world, so restore using this information
    • Box Size: (w,l,h)
      • The goal of the entire net is to get this and the center point, so we need to regress this info too.
    • Yaw Rotation: (sin(a), cos(a))
      • a is the yaw angle, but having sin and cos create a continuous target to regress to

36 of 49

CenterPoint: The Regression Stage

  • The 8-tuple stores all the critical information needed for a 3D bounding box.
  • Each detected keypoint is regressed to the 8-tuple
    • L1 Loss is used to train and optimize the model
    • Handles boxes at various orientations/shapes/sizes by keeping information in log-scale

37 of 49

CenterPoint: The Regression Stage

The Center point of the computed heatmap is used to index the corresponding regression head

While this may have a regression head, we won't index it

38 of 49

CenterPoint: Refinement Stage (stage 2)

  • Pick 5 points on each predicted bounding box and get their point-features (from the backbone)

These 5 points

And their features

39 of 49

CenterPoint: Refinement Stage (stage 2)

  • Using the aforementioned features, pass them through an MLP
    • The MLP produces a confidence score and box refinement

Confidence Score computed by 

I is based off the IoU (Intersection over Union) measure of the predicted box and the ground truth box

Use Cross Entropy Loss to train

is the predicted confidence score

40 of 49

Questions?

41 of 49

PointRCNN vs PV-RCNN vs CenterPoint

PointNet

PointRCNN:     Point-Based

PV-RCNN:         Voxel-Based + Point-Based

CenterPoint:    Voxel-Based

42 of 49

Experiments: Comparing PV-RCNN and Center-Based (Waymo Dataset)

  • MAP/H results:

Overall, Center-Based performs with a higher Mean Average Precision, compared to PV-RCNN

  • Likely due to the robustness offered by tracking centers, as opposed to full boxes

43 of 49

Ablation Studies

Anchors measures based on PV-RCNN

Once the vehicles become too unaligned from axis, anchor-based performance drops

  • Hard to keep having to train on variety of orientations

44 of 49

PV-RCNN: Ablation Study

45 of 49

PV-RCNN: Ablation Study

Using voxel-3 and voxel-4 gives enough

Performance boost

46 of 49

PV-RCNN: Ablation Study

47 of 49

PV-RCNN: Ablation Study

48 of 49

CenterPoint: Ablation Study

BEV features are enough in CenterPoint model

49 of 49

Questions?