1 of 26

What You See is What You Get: Exploiting Visibility for 3D Object Detection

Peiyun Hu, Jason Ziglar, David Held, Deva Ramanan �CVPR 2020

Nicholas Vadivelu

2020/07/07

2 of 26

Motivation

  • LiDAR sweeps are inherently “2.5D”
    • We can never see behind occluded objects (as with many sensor types)
    • This is why we can represent them in 2D data structures (e.g. 2D range images)
  • Representing as 3D point clouds (sets of (x, y, z)) hides this information
  • Occupancy maps have not been used for object detection

3 of 26

Contributions

  1. (Re)Introduce raycasting algorithms to efficiently compute visibility for a voxel grid (on the fly)
  2. Show that the visibility can be combined with synthetic data augmentation and temporal aggregation of LiDAR sweeps
  3. Approach to augment voxel-based networks with visibility

4 of 26

Ray Casting Overview (2D case)

# V is visibility: a multichannel 2D feature map �V[:] <- UNKNOWN

for each LiDAR Point (a, b, c):� x, y, z <- source� while (x, y, z) != (a, b, c):� V[x, y, z] <- FREE� x, y, z <- next voxel on ray� V[a, b, c] <- BLOCKED

5 of 26

Ray Casting Overview (2D case)

# V is visibility: a multichannel 2D feature map �V[:] <- UNKNOWN

for each LiDAR Point (a, b, c):� x, y, z <- source� while (x, y, z) != (a, b, c):� V[x, y, z] <- FREE� x, y, z <- next voxel on ray� V[a, b, c] <- BLOCKED

6 of 26

Ray Casting Overview (2D case)

# V is visibility: a multichannel 2D feature map �V[:] <- UNKNOWN

for each LiDAR Point (a, b, c):� x, y, z <- source� while (x, y, z) != (a, b, c):� V[x, y, z] <- FREE� x, y, z <- next voxel on ray� V[a, b, c] <- BLOCKED

7 of 26

Ray Casting Overview (2D case)

# V is visibility: a multichannel 2D feature map �V[:] <- UNKNOWN

for each LiDAR Point (a, b, c):� x, y, z <- source� while (x, y, z) != (a, b, c):� V[x, y, z] <- FREE� x, y, z <- next voxel on ray� V[a, b, c] <- BLOCKED

8 of 26

Ray Casting Overview (2D case)

# V is visibility: a multichannel 2D feature map �V[:] <- UNKNOWN

for each LiDAR Point (a, b, c):� x, y, z <- source� while (x, y, z) != (a, b, c):� V[x, y, z] <- FREE� x, y, z <- next voxel on ray� V[a, b, c] <- BLOCKED

9 of 26

Ray Casting Overview (2D case)

# V is visibility: a multichannel 2D feature map �V[:] <- UNKNOWN

for each LiDAR Point (a, b, c):� x, y, z <- source� while (x, y, z) != (a, b, c):� V[x, y, z] <- FREE� x, y, z <- next voxel on ray� V[a, b, c] <- BLOCKED

10 of 26

Ray Casting Overview (2D case)

# V is visibility: a multichannel 2D feature map �V[:] <- UNKNOWN

for each LiDAR Point (a, b, c):� x, y, z <- source� while (x, y, z) != (a, b, c):� V[x, y, z] <- FREE� x, y, z <- next voxel on ray� V[a, b, c] <- BLOCKED

11 of 26

Ray Casting Overview (2D case)

# V is visibility: a multichannel 2D feature map �V[:] <- UNKNOWN

for each LiDAR Point (a, b, c):� x, y, z <- source� while (x, y, z) != (a, b, c):� V[x, y, z] <- FREE� x, y, z <- next voxel on ray� V[a, b, c] <- BLOCKED

12 of 26

Ray Casting Overview (2D case)

# V is visibility: a multichannel 2D feature map �V[:] <- UNKNOWN

for each LiDAR Point (a, b, c):� x, y, z <- source� while (x, y, z) != (a, b, c):� V[x, y, z] <- FREE� x, y, z <- next voxel on ray� V[a, b, c] <- BLOCKED

  • N LiDAR points, Map Dimensions (l, w, h):�Time Complexity: O(N*max(l, w, h))

13 of 26

Object Augmentation

  • Copy-pastes of rarely seen objects into LiDAR scenes
  • Want to avoid putting objects in occluded areas:

14 of 26

Temporal Aggregation

  • Aggregate sweeps from different time points (compensating for motion)
  • Use Bayesian filtering to turn the 4D spatial-temporal visibility into a 3D posterior probability of occupancy
    • Follow OctoMap’s (Hornung et al. 2013) formulation

15 of 26

Approach: A Two-stream Network

  • Augment PointPillars architecture

16 of 26

Experiments

  • Dataset: NuScenes 3D detection dataset
    • 1000 scenes captured in 2 cities
    • Training set: 700 scenes (28,130 annotated frames)
    • Validation Set: 150 Scenes (6,019 annotation frames)

17 of 26

Ablation: Late vs Early Fusion

18 of 26

Ablation: Types of Object Augmentation

19 of 26

Ablation

20 of 26

Ablation: Object Augmentation

21 of 26

Ablation: Temporal Aggregation

22 of 26

Ablation: Visibility Stream

23 of 26

24 of 26

Related Work: Visibility

  • The Mobile Robot RHINO
    • Joachim Buhmann, Wolfram Burgard, Armin B. Cremers, Dieter Fox, Thomas Hofmann, Frank E. Schneider, Jiannis Strikos and Sebastian Thrun 1995
    • Use 2D probabilistic occupancy map from sonar readings for navigation
  • Octomap: An efficient probabilistic 3d mapping framework based on octrees
    • Armin Hornung, Kai M. Wurm, Maren Bennewitz, Cyrill Stachniss, Wolfram Burgard
    • General purpose 3D occupancy mapping
  • A Probabilistic Representation of LiDAR Range Data for Efficient 3D Object Detection
    • Theodore C. Yapo, Charles V. Stewart, and Richard J. Radke 2008
    • Formulates object detection as a hypothesis testing problem

25 of 26

Thoughts

  • Runtime? (they mention 24.4±3.5ms on Intel i9)
  • Intelligent object augmentation works great
  • Not clear how the probabilistic aggregation over time is different than naively combining over multiple time steps
  • Application to V2VNet:
    • V2VNet gets a lot of benefit from nearby vehicles having visibility in occluded areas
    • The GNN aggregates incoming messages via a mean
    • Including this visibility information with a learned aggregation (e.g. weights) could be interesting

26 of 26

Thanks for Listening!