1 of 49

3D Object Detection/Classification

A Summary and Discussion by:

Deepak Warrier and Reza Averly

2 of 49

Outline

3D Object Detection Task
Difficulty of 3D Point Clouds
Brief overview of PointNet + PointRCNN
PV-RCNN
CenterPoint

A brief overview of VoxelNet/PointPillars

PV-RCNN vs CenterPoint Experiment Results

3 of 49

Task: 3D Object Detection/Classification

Given a 3D Point Cloud...

Object Detection: produce a set of (x,y,z,l,w,h,a) that indicate bounding prisms at (x,y,z), which size (l,w,h), and rotation a
Classification: produce a class, indicating the type of object detected
Segmentation: produce a class for each point

4 of 49

Difficulty of 3D Point Clouds

*Unordered*

As opposed to images/voxels whose locations have a built-in ordering

Points neighboring each other have meaning
Invariant to rotations and translations
Limited in providing dense information

No shape, texture, color, in contrast to image

Unstructured

Distance to neighboring points is not fixed, while image has 2D fixed grid

Irregularity

Points aren't evenly sampled (some regions are dense and sparse)

5 of 49

3D Object Detection Method

Grid-based method

Transform point clouds to 3D voxels or 2D bird-view maps

+computationally efficient

-fine-grained localization accuracy loss

Point-based method

Use point clouds

+localization accuracy

-higher computation cost

6 of 49

What we will discuss

	PointNet	PointRCNN	PV-RCNN	CenterPoint
Data	Point	Point	Point	Point
Approach	Point-Based	Point-Based	Point-Based and Voxel-Based	Voxel-Based
Task	Classification Segmentation	3D Object Detection (Region Proposal)	3D Object Detection (Region Proposal)	3D Object Detection (Center Point)

7 of 49

PointNet: Purpose/Tasks

8 of 49

PointNet: General Architecture

T-Nets are set as learnable sections of the network

9 of 49

PointNet: General Architecture

T-Nets are set as learnable sections of the network

Point Identity is maintained through network

10 of 49

PointNet: T-Nets

T-Nets are set as learnable sections of the network. They learn a set of affine transformations

An expanded view of 3x3 T-Net

Credit: Luis Gonzales @ Medium.com

Affine Transformations typically can be visualized like this.

The T-Net allows the actual parameters for the transformation to be learned.

Credit: Wikimedia Commons

11 of 49

PointNet: Key Features and Takeaways

Robust to perturbations and noise in the points
Learns a set of critical points

Helpful features to learn when using PN as a backbone for other models

12 of 49

PointRCNN

Uses PointNet++ (can use other backbones as well)

13 of 49

PointRCNN: Region Proposals

All the dots shown are Foreground Points

For each FG point detected, generate a 3D bounding box proposal

x and z points are assigned to bins. Orientation is also split into bins as well

Computes bin-based localization loss by comparing x and z bins with the target x and z bins. Similar Loss is computed for orientation.

y is computed using normal L1 Loss

Bin origin

Bin size

The way PointRCNN proposes regions to investigate by predicting a mask for Foreground Points. This mask will segment the point cloud into foreground and background points. In order to train the segmentation mask, it is easy to just take the points that are in a ground truth box and classify all the points in the box as "foreground", while everything else it "background".

Once we have the mask, we can use the foreground points to generate 3D box proposals. Since slight perturbations of objects in point cloud data are common, PointRCNN aims to reduce the effect of them via binning the points around the foreground point. An illustration of the binning process is illustrated in the figure on the right. It doesn't make sense to train a direct L1 loss on bins, so a bin-based loss function is developed to compute residuals between the predicted and target x-z bins.

In the context of driving, many objects are do not change height much, so the y coordinate of the box is learned using standard L1 loss.

14 of 49

PointRCNN: Region Proposals

Cross Entropy Classification Loss on x, z, and orientation

Smooth L1 Loss on box size and height

Overall loss on box regression

An example of Non-Maximal Suppression

Source: PyImageSearch

15 of 49

PointRCNN: Box Refinement

Stage is similar to RCNN refinement stage
Box Proposals are enlarged (by constant amount)

Captures surrounding context

For each point inside the enlarged box

Points of form: (x, y, z, r, m, f)

m – in set {0,1}, represents if a foreground point or not
f – a C dimensional learned point feature representation

Generated from the backbone in proposal stage

16 of 49

PointRCNN: Box Refinement

Points per box are transformed to box's egocentric space
Using previous residual loss functions, boxes are refined

Angle is refined using a similar form (bin-based)

Bin-based residual loss on orientation

17 of 49

Questions?

18 of 49

PV-RCNN

Combine point-based and voxel-based method

19 of 49

PV-RCNN: Voxel-Based

Motivation: Transform point clouds into multi-scale voxels (sparse 3D matrix) to generate region proposals

20 of 49

PV-RCNN: Voxel-Based

1x, 2x, 4x, 8x downsample

4 layer 3D CNN

Use 3x3x3 kernel sparse convolution

3D Box Proposals

21 of 49

PV-RCNN: Point-Based

Motivation: Use keypoints data to enrich features for 3D box proposals

22 of 49

PV-RCNN: FPS + VSA

23 of 49

PV-RCNN: VSA

Do pooling from a set of voxel-feature vectors

24 of 49

PV-RCNN: PKWM

Motivation: Foreground keypoints are more important than background

How: predict weights for each keypoint

Keypoints in truth ground box are foreground

Data:

- Point clouds

- 3D boxes

Predict weight

Focal Loss: modified CE-Loss that deals with

class imbalance of background and foreground

25 of 49

PV-RCNN: First-Stage Recap

For each keypoint

Sampling neighboring voxels

Feature encoding and pooling

Transform points to voxels

26 of 49

PV-RCNN: Second-Stage

Motivation: Refine proposal and predict confidence

27 of 49

PV-RCNN: ROI POOLING

Random sampling

Feature encoding

Similar to VSA!!!

Sample keypoints -> sample grid points

Neighbor Voxels -> neighbor keypoints

28 of 49

PV-RCNN: CONFIDENCE + REFINEMENT

Each proposal has 216 grid points with feature vectors
For confidence prediction:

Calculate confidence using:

Train using Cross-Entropy Loss

For box refinement, use smooth-L1 loss

Intersection of Union

Min Max ensures

Range (0, 1)

29 of 49

Questions?

30 of 49

CenterPoint: Idea

Fundamental Difference: Use centers instead of full boxes to detect objects

Points have no orientation

Search space for boxes reduces, since there is no need to search through various rotations
Backbone will be forced to learn rotational invariance

31 of 49

CenterPoint: Architecture

32 of 49

CenterPoint: The Feature Representation

The backbones in CenterPoint make use of encoded, pillar-oriented features of the 3D point cloud

Effectively "voxelize" the space and sample points for efficient representation
Captures variety of contextual shape information
Enables further work to be done via 2D convolutions, instead of 3D

A helpful representation, since these pillars can be flattened for the next stage

VoxelNet

PointPillars

33 of 49

CenterPoint: Keypoint Detection

Backbone (from prev. slide) produces, M in R^(WxHxF)
CenterPoint uses another (modified) backbone, CenterNet

Using detected points from backbone, generate a K-channel Gaussian heatmap

K is the number of classes to predict

Source: Uri Almog

34 of 49

CenterPoint: Keypoint Detection

In "map view", objects are sparse

Few detected centers tend to be close together
Using standard CenterNet, the predictions bias towards background (as opposed to object)

Fixed by enlarging Gaussian with

w,l – width and length of ground truth box
f – CornerNet's radius function
t – the minimum radius of the gaussian (2 was used in the paper)

The change helps to balance out the background bias

35 of 49

CenterPoint: The Regression Stage

The goal of the regression stage is to produce an 8-length vector of features per detected center

Sub-voxel location refinement: (o1, o2)

Where to refine within a given voxel

Height: y

Height data is lost in "image" world, so restore using this information

Box Size: (w,l,h)

The goal of the entire net is to get this and the center point, so we need to regress this info too.

Yaw Rotation: (sin(a), cos(a))

a is the yaw angle, but having sin and cos create a continuous target to regress to

36 of 49

CenterPoint: The Regression Stage

The 8-tuple stores all the critical information needed for a 3D bounding box.
Each detected keypoint is regressed to the 8-tuple

L1 Loss is used to train and optimize the model
Handles boxes at various orientations/shapes/sizes by keeping information in log-scale

37 of 49

CenterPoint: The Regression Stage

The Center point of the computed heatmap is used to index the corresponding regression head

While this may have a regression head, we won't index it

38 of 49

CenterPoint: Refinement Stage (stage 2)

Pick 5 points on each predicted bounding box and get their point-features (from the backbone)

These 5 points

And their features

39 of 49

CenterPoint: Refinement Stage (stage 2)

Using the aforementioned features, pass them through an MLP

The MLP produces a confidence score and box refinement

Confidence Score computed by

I is based off the IoU (Intersection over Union) measure of the predicted box and the ground truth box

Use Cross Entropy Loss to train

I is the predicted confidence score

40 of 49

Questions?

41 of 49

PointRCNN vs PV-RCNN vs CenterPoint

PointNet

PointRCNN: Point-Based

PV-RCNN: Voxel-Based + Point-Based

CenterPoint: Voxel-Based

42 of 49

Experiments: Comparing PV-RCNN and Center-Based (Waymo Dataset)

MAP/H results:

Overall, Center-Based performs with a higher Mean Average Precision, compared to PV-RCNN

Likely due to the robustness offered by tracking centers, as opposed to full boxes

43 of 49

Ablation Studies

Anchors measures based on PV-RCNN

Once the vehicles become too unaligned from axis, anchor-based performance drops

Hard to keep having to train on variety of orientations

44 of 49

PV-RCNN: Ablation Study

45 of 49

PV-RCNN: Ablation Study

Using voxel-3 and voxel-4 gives enough

Performance boost

46 of 49

PV-RCNN: Ablation Study

47 of 49

PV-RCNN: Ablation Study

48 of 49

CenterPoint: Ablation Study

BEV features are enough in CenterPoint model

49 of 49

Questions?