1 of 25

Open-World Panoptic LiDAR Segmentation

Students: Meghana Ganesina, Anirudh Chakravarthy

Advisors: Aljosa Osep, Deva Ramanan, Shu Kong

MSCV capstone project overview, January 2022

2 of 25

Mobile robot perception

2

3 of 25

LiDAR-based mobile robot perception

3

Recently, we learned how to learn representations directly from raw, unorganized point clouds
We witnessed rapid progress in perception tasks, such as …

Semantic segmentation of LiDAR point clouds, the task of assigning semantic meaning to every 3D LiDAR point.
Recent developments have been moving towards holistic scene understanding, jointly tackling object instance and semantic segmentation, i.e., panoptic segmentation. Previous work has focused on the interpretation of individual scans.
Orthogonally, 3D multi-object tracking methods commonly first train 3D LiDAR object detectors and then associate these detections over time.
In this work, we propose 4D panoptic segmentation that unifies all these perception tasks. For this task, we need to assign semantic meaning to every 3D point in a sequence, and a unique object instance ID, in space and time.

4 of 25

Motivation

Prior work

Static models, do not adapt their behaviour over time
New data: we would need to manually label and restart the training process

Would this setting generalize to the real-world?

5 of 25

Our work

Continual learning for LiDAR panoptic segmentation via object discovery

Robot drives around and collects streams of LiDAR sensory data
At the end of the day: re-consolidate this data (offline).

Did we observe any “novel” object classes that we do not recognise?
If so: ask humans about this class (human-in-the-loop) / pseudo-labels (clustering)
Incrementally update the system

6 of 25

Open Set Vs Closed Set

6

7 of 25

Datasets

Semantic KITTI, Panoptic NuScenes

Semantic KITTI (Behley et al., ICCV’19)

Panoptic nuScenes (Fong et al., arxiv:2109.03805, ‘21)

8 of 25

4D Panoptic LiDAR segmentation

S

Semantic head

O

Objectness head

Σ

Point variance head

ε

Point embeddings

t

t+1

t+2

Point sampling

S

O

Σ

ε

Encoder-Decoder Network

4D Semantic + Instance Predictions

4D Point Cloud

Aygun et al, 4D Panoptic LiDAR Segmentation, CVPR 2021.

9 of 25

LiDAR Panoptic Segmentation (single-scan)

Single-scan LiDAR Panoptic Segmentation (Behley et al., ICCV’19, ICRA’21)

Semantic Segmentation

RangeNet(++) - Millioto et al., IROS’19

KPConv - Thomas et al., CVPR’19

Object Detection

PointPillars - Lang et al., CVPR’19

10 of 25

Towards Open World Object Detection

K J Joseph, Salman Khan, Fahad Shahbaz Khan, Vineeth N Balasubramanian

CVPR’21

11 of 25

Open world recognition

Bendale et al., Towards open world recognition, CVPR’16
Key idea:

Recognize `novel` classes
Label and re-train

12 of 25

Open world detection

The premise:

Detector will detect and classify “known” object classes
The remaining anchor boxes will be detected/classified as “unknown” objects or rejected as “stuff” or “background” classes
Those detected “unknown” objects will be labeled by human annotators and used to (incrementally) re-train the detector

12

13 of 25

Method

Maximize discrimination between latent representations in the feature space
How? Contrastive learning!
Intuition: by “pushing apart” features for (distinct) “known” classes and “unknown” classes, it will become easier to identify unknown classes as novel

class prototype

feature vector

14 of 25

Experiments

Catastrophic forgetting

Known-unknown confusion

Precision for known

Precision for known + unknown

15 of 25

Method

Base: Faster R-CNN
Add additional (contrastive) loss that “pulls apart” features in the embedding space (N-known + unknown)

15

class prototype

feature vector

16 of 25

Exemplar-based Open-Set Panoptic Segmentation Network

Jaedong Hwang, Seoung Wug Oh, Joon-Young Lee, Bohyung Han

CVPR’21

17 of 25

Open-world panoptic segmentation

Panoptic segmentation: instance + semantic segmentation
Open-world panoptic segmentation: remove % of labels for certain classes and call them ‘unknown’

18 of 25

Method

Based on Panoptic FPN two-stage top-down approach [Kirrilov et al., CVPR’19]
Sample proposals

Criterion: > 50% of the area in the ‘void’ area.
Based on the ‘objectness’ score, extract features (1024-dim)

Clustering

Every 200 iter, k-means
Pick ‘good’ clusters: high average objectness score, small average cosine distance between centroid and elements -> store them

Mining

Find exemplars in the incoming mini-batch: compute cosine distance between proposals and exemplars (clusters) -> unknowns (use for supervision)

Standard cross-entropy loss + “negative supervision” term

Just treat mined classes as GT

19 of 25

Experiments

COCO dataset (COCO stuff)
Declare % of classes as ‘unknown’

5%: car, cow, pizza, toilet
10%: boat, tie, zebra, stop sign
20%: table, banana, bicycle, cake, sink, cat, keyboard

Observation: due to “dense” GT, objects always somehow well delineated

20 of 25

Results

Utilizing the “void” boxes

Baselines

20

21 of 25

Roadmap

Instance mining - Jan 26

Define task sets, train models (supervised)
Instance segmentation (we are here)
Instance tracking

Instance grouping / discovery - Mar 15

Implement evaluation set-up (assume clusters are given)
Simple baseline (embeddings from a pre-trained network + clustering)
Self-supervised embedding learning + clustering
End-to-end model

Model re-training - Apr 15

Transfer labels clusters -> point clouds
Training from scratch, incremental update with replay

1st milestone

2nd milestone

3rd milestone

22 of 25

Supervised Learning

Dataset: SemanticKITTI, Behley et al., ICCV’19, KITTI RAW (1.5h of “raw” recordings), Geiger et al., CVPR’12

	Labeled	“Not labeled” (held-out)	Held-out instances to “discover”
Task set 0	road, building, vegetation, car, fence, human	sidewalk, truck, terrain, pole, parking, bicycle, traffic sign, motorcycle	Truck, pole, bicycle, traffic sign, motorcycle
Task set 1	road, building, vegetation, car, fence, human, sidewalk, truck, terrain, pole	parking, bicycle, traffic sign, motorcycle	Bicycle, traffic sign, motorcycle
Task set 2	road, building, vegetation, car, fence human, sidewalk, truck, terrain, pole, parking, bicycle, traffic sign, motorcycle	-	-

23 of 25

Supervised Learning

We can do K-way classification reasonably well! (SemanticKITTI validation split)

	Task set 0			Task set 1			Task set 2
	mIoU	mIoU_kn	IoU_unk	mIoU	mIoU_kn	IoU_unk	mIoU	mIoU_kn	IoU_unk
4D-Panoptic (single scan)	0.7383	0.7215	0.8388	0.7377	0.7594	0.5214	0.6260	0.6260	N/A

24 of 25

2. Instance mining

Goal: stream of raw sensory data => a set of potential objects
Per-scan proposal generation

Hu et al., Learning to Optimally Segment Point Clouds, RAL’19
Instead of the learned regressor average “objectness score” (estimated by the network) across each segment

24

25 of 25

Thank you!

25