1 of 28

Semantic Understanding of road scene

Yu-Hsuan & Qibang

Advisor: Prof. Dong Huang, Ryan Lingo

Semantic Understanding of road scene

2 of 28

Outline

Part 1: Capstone Project

Motivation
Background Knowledge
Dataset
Backbone
Possible Directions

Part 2: Paper Survey - SOTA Models

360KITTI dataset SOTA model
nuScenes dataset SOTA model
Waymo dataset SOTA model

2

3 of 28

Motivation

3

4 of 28

Preliminary

4

Three different segmentation

This is preliminary, I want to bring you some background knowledge, so I list three different segmentation methods here.

First is semantic segmentation, which assign labels to every pixel, blue indicate sky, dark green indicate trees, and red represents people, but we cannot differentiate them as different person.

#what if we want to differentiate between different person in the same class? In this case, we use instance segmentation, which is highly related to object detection. However, the output is like a mask instead of a bounding box. Unlike semantic segmentation, usually, we do not label every pixel in the image, we are only interested in finding the objects. We can see every person has different colors so we can tell them apart.

#Final is panoptic segmentation, which is the combination of semantic and instance segmentation.

In our project, we focus on semantic or panoptic segmentation.

5 of 28

Dataset

SemanticKITTI

Kitti360

5

6 of 28

Nuscene

Two diverse cities: Boston and Singapore

Waymo

Panoramic images

6

7 of 28

Backbone

7

Input

Output

Fusion

Reference: 2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds Wu et. al. (ECCV2022)

8 of 28

Advantages of using 2D image and 3D LiDAR as input

8

xMUDA: Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation Maximilian et. al. (CVPR2020)

9 of 28

Possible directions to improve model robustness

Data Augmentation

Add more data produced by stable diffusion model + domain consistency (light and shadow)

9

10 of 28

Possible Solution - Data Augmentation

10

Repopulating Street Scenes

Ref: Repopulating Street Scenes Wang et. al. (CVPR2021)

11 of 28

11

12 of 28

Possible directions to improve model robustness

Data Augmentation

Add more data produced by stable diffusion model + domain consistency (light and shadow)

Utilize different input data

KITTI also provide 360 image / panorama which contains global and local information

12

13 of 28

Possible directions to improve model robustness

Data Augmentation

Add more data produced by stable diffusion model + domain consistency (light and shadow)

Utilize different input data

KITTI also provide 360 image / panorama which contains global and local information

Ease the gap between 2D and 3D data

How to fuse feature from different data type (images/ points cloud/ panorama)

13

14 of 28

How to fuse feature from 2D/3D - SemanticKITTI (MSFSKD)

14

2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds Wu et. al. (ECCV 2022)

15 of 28

Possible directions to improve model robustness

Data Augmentation

Add more data produced by stable diffusion model + domain consistency (light and shadow)

Utilize different input data

KITTI also provide 360 image / panorama which contains global and local information

Ease the gap between 2D and 3D data

How to fuse feature from different data type (images/ points cloud/ panorama)

Cope with dangerous scene: Fast moving object

Optical flow

15

16 of 28

Paper Survey - Current SOTA Model

16

17 of 28

Evaluation Metric

Intersection over Union (IoU)
Dice Coefficient, Pixel Accuracy, and Mean Accuracy Metrics (not object level)

17

Reference: https://learnopencv.com/intersection-over-union-iou-in-object-detection-and-segmentation/

https://pycad.co/the-difference-between-dice-and-dice-loss/

Before we start to look at the papers, I’ll first introduce the evaluation metrics the papers are using. The most common one would be IoU which stands for Intersection over Union. This part is the equation of IOU calculating which is the area of intersection divided by ground truth area plus the predicted box area minus the area of intersection. It checks the similarity between the predicted and ground truth masks and determines how much area overlaps between the two masks. In the case of semantic segmentation,the IoU value becomes the denominator of the number of pixels corresponding to any class and the number of pixels accurately predicted for that class. The average value is called mIoU.

Dice Coefficient, Pixel Accuracy, and Mean Accuracy Metrics would also be used. This part is the equation of Dice coefficient.

18 of 28

SemanticKITTI SOTA: 2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds

Reference: 2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds

The first paper we are going to talk about is the 2DPASS paper that Yuxuan mentioned earlier. In autonomous driving, cameras provide dense color information and fine-grained texture, but they are quite unreliable in low light conditions. On the other hand, LiDARs could offer accurate and wide-ranging depth information regardless of lighting variances but only capture sparse and textureless data. As camera and LiDAR sensors capture complementary information, it would be essential to conduct semantic segmentation through multi-modality data fusion. However, existing fusion-based approaches require paired input data in both training and inference stages, for example the LiDAR point clouds and camera images needs to be strictly point-to-pixel mapped. And this is not practical in most cases due to the difference of field of views between cameras and LiDars. It’s computational cost is also high as fusion-based models process both images and point clouds at runtime. Thus, they introduced a general training scheme, by leveraging a multi-scale auxiliary modal fusion and knowledge distillation, to acquire richer semantic and structural information from the multimodal data.

The upper part is their general model. It first crops a small patch from the original camera image as the 2D input. Then the cropped image patch and LiDAR point cloud pass through the 2D and 3D encoders independently to generate multi-scale features. For each scale, we go through this MSFSKD which stands for multi-scale fusion-tosingle knowledge distillation process which is shown here. The modality fusion is first adopted to enhance multi-modality feature. And then, the enhanced feature promotes the 3D representation through the uni-directional Modality-Preserving KD to get the 3D Predictions. And after this part, the feature maps are used to generate the final semantic scores using modal-specific decoders, which are supervised by pure 3D labels.

19 of 28

19

SemanticKITTI SOTA: 2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds

Reference: 2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds

20 of 28

20

SemanticKITTI SOTA: 2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds

Reference: 2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds

21 of 28

21

Waymo Open Dataset SOTA: Cross-modal Learning for Domain Adaptation in 3D Semantic Segmentation

Reference: Cross-modal Learning for Domain Adaptation in 3D Semantic Segmentation

And for the second dataset of Waymo Open Dataset, we are introducing the cross-modal learning for domain adaptation paper, which is the inspiration of the 2DPASS method. The same as the 2DPASS paper, the cross-modal learning framework also aims to take advantage of the domain gap differences between cameras and Lidars. In their model, a 2D and a 3D network take an image and a point cloud as inputs respectively and predict their own 3D segmentation labels, during which process the 2D predictions are uplifted to 3D. And then the crossmodal learning enforces consistency between the 2D and 3D predictions via mutual mimicking, which is beneficial for domain adaptation in both unsupervised and semi-supervised learning. And that leads to the main topic of this paper, is to constrain the network to make correct predictions on labeled data and consistent predictions across modalities on unlabeled target-domain data, which closely aligns our project that aims to give accurate segmentations on the rare conditions that may not exist in the training dataset.

The below part is the architecture of the method. There are two independent network streams: a 2D stream (in red) which takes an image as input and uses a U-Net-style 2D ConvNet, as well as a 3D stream (in blue) which takes a point cloud as input and uses a U-Net-Style 3D SparseConvNet. Then, the 3D points that have labels are projected into the image and the 2D features are sampled at the corresponding pixel locations. The four segmentation outputs consist of the main predictions and the 2D mimicry predictions, which are transferred across modalities using KL divergences to use the 2D mimicry prediction to estimate the main 3D prediction.

22 of 28

22

Waymo Open Dataset SOTA: Cross-modal Learning for Domain Adaptation in 3D Semantic Segmentation

Reference: Cross-modal Learning for Domain Adaptation in 3D Semantic Segmentation

23 of 28

23

Waymo Open Dataset SOTA: Cross-modal Learning for Domain Adaptation in 3D Semantic Segmentation

Reference: Cross-modal Learning for Domain Adaptation in 3D Semantic Segmentation

24 of 28

24

360KITTI Dataset SOTA: Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation

Reference: Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation

The third paper we are going to talk about also focus on hybriding 2D/ 3D data. They introduced learning to merge features from multiple images with a dedicated attention-based scheme. For each 3D point, the information from relevant images is aggregated based on the point’s viewing condition. In order to exploit the correspondence between points and image pixels to perform 3D point cloud semantic segmentation with features learned from both modalities, their approach starts by computing an occlusionaware mapping between 3D points and pixels, then uses viewing conditions through an attention scheme to aggregate relevant image features for each 3D point. The right part is their bimodal 2D/3D Architecture. Using the multi-view aggregation module, a 2D convolutional encoder and a 3D network composed of an encoder, a decoder, and a classifier is combined together. We associate relevant 2D features to each 3D point according to their viewing conditions in each compatible image. There are 3 different fusion strategies: early, intermediate, and late fusion which differs from each other by the position of the fusing process.

25 of 28

25

360KITTI Dataset SOTA: Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation

Reference: Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation

26 of 28

Compare & Contrast

2DPASS / Cross-Modal Learning / Multi-View Aggregation

All three papers aim to hybrid 2D/3D data for more information
2DPASS was inspired by Cross-Modal Learning
Pros / Cons

26

Unidirectional v.s. Bidirectional

Among the three papers, all of them aim to solve the problem and take advantage of both the 2D image data and 3D point cloud data to get a more accurate semantics segmentation. However, Multi-view aggregation focus more on using attention model to compute and learn parameters on 3D scenes. While 2DPASS and Cross-modal learning uses the cross-modality structure to acquire richer semantic and structural information from the multimodal data. Although 2DPASS was inspired by Cross-Modal Learning, they took different approaches while fusing. Cross-modal Learning fuses the model from two sides, both 2d to 3d and 3d to 2d. However, 2DPASS believes this process of fusing from both sides will lead to the loss of characteristic from both 2d and 3d models. Thus, they improved it to be unidirectional.

27 of 28

Conclusion

Improving the robustness of self-driving cars in unexpected situations

Identify rare conditions
Data Augmentation
Increasing/ Changing input data
Method of easing gap between 2D/ 3D data
Cope with fast moving objects

27

28 of 28

Questions?

28