1 of 38

From BEV to Scene-as-Occupancy:

An Overview of Camera 3D Perception

Chonghao Sima

https://opendrivelab.com

Shanghai AI Laboratory | 上海人工智能实验室

2 of 38

From BEV to Scene-as-Occupancy:

An Overview of Camera 3D Perception

Chonghao Sima

https://opendrivelab.com

Shanghai AI Laboratory | 上海人工智能实验室

3 of 38

Why we need 3D perception

RoboTaxi

Autonomous Driving

Trunk

ADAS System

Delivery

Robotics & Embodied AI

Housework

Agriculture

Logistics system

Industry

Why Camera-based? low-cost, easy-to-deploy, long-range, rich in semantic appearance;

Shanghai AI Laboratory | 上海人工智能实验室

4 of 38

Shanghai AI Laboratory | 上海人工智能实验室

5 of 38

Why we need 3D perception

RoboTaxi

Autonomous Driving

Trunk

ADAS System

Delivery

Robotics & Embodied AI

Housework

Agriculture

Logistics system

Industry

Why Camera-based? low-cost, easy-to-deploy, long-range, rich in semantic appearance;

Shanghai AI Laboratory | 上海人工智能实验室

6 of 38

Why we need 3D perception

RoboTaxi

Autonomous Driving

Trunk

ADAS System

Delivery

Robotics & Embodied AI

Housework

Agriculture

Logistics system

Industry

Why Camera-based? low-cost, easy-to-deploy, long-range, rich in semantic appearance;

Shanghai AI Laboratory | 上海人工智能实验室

7 of 38

Why we need 3D perception

RoboTaxi

Autonomous Driving

Trunk

ADAS System

Delivery

Robotics & Embodied AI

Housework

Agriculture

Logistics system

Industry

Why Camera-based? low-cost, easy-to-deploy, long-range, rich in semantic appearance;

Shanghai AI Laboratory | 上海人工智能实验室

8 of 38

Core Issues in Camera-only 3D Perception

Accurate Depth: Bridging the gap between Camera-based and LiDAR-based method

[1] Depth Estimation Matters Most: Improving Per-Object Depth Estimation for Monocular 3D Detection and Tracking, arxiv:2206.03666

How to solve?

Pseudo-LiDAR Track

Depth prediction to form the pseudo-LiDAR

Center-point Track

Heatmap in 2D to infer pose in 3D

Depth Pre-Training

Embed 3D prior in 2D pre-trained backbone

BEV View Transformation

Transforming perspective view feature to BEV

Shanghai AI Laboratory | 上海人工智能实验室

9 of 38

Trending in BEV Perception

2021.7

HDMapNet

Output BEV Map

Given HD map in BEV coordinates
Propose the aggregation of feature extracted from both Camera and LiDAR

Shanghai AI Laboratory | 上海人工智能实验室

10 of 38

Trending in BEV Perception

2021.7

HDMapNet

2021.10-12

DETR3D
BEVDet

Implicitly processing BEV features

Output BEV Map

Fused Object Detection using omnidirectional cameras in BEV

Given HD map in BEV coordinates
Propose the aggregation of feature extracted from both Camera and LiDAR

Shanghai AI Laboratory | 上海人工智能实验室

11 of 38

Trending in BEV Perception

2021.7

HDMapNet

2021.10-12

DETR3D
BEVDet

2022.3

BEVFormer
PersFormer

Explicitly Construct BEV representation via camera parameters

Implicitly processing BEV features

Output BEV Map

Explicitly processing BEV feature

Fused Object Detection using omnidirectional cameras in BEV

Given HD map in BEV coordinates
Propose the aggregation of feature extracted from both Camera and LiDAR

Shanghai AI Laboratory | 上海人工智能实验室

12 of 38

Trending in BEV Perception

2021.7

HDMapNet

2021.10-12

DETR3D
BEVNet

2022.3

BEVFormer
PersFormer

2022.5

BEVFusion(Damo Academy)
BEVFusion(MIT)
FUTR3D

Given HD map in BEV coordinates
Propose the aggregation of feature extracted from both Camera and LiDAR

Explicitly Construct BEV representation via camera parameters

Multimodal feature fusion in BEV

Implicitly processing BEV features

Output BEV Map

Explicitly processing BEV features

Fusion on the dimension of BEV-feature

Fused Object Detection using omnidirectional cameras in BEV

Core Question: How to model the View Transformation from perspective view to BEV more effectively?

Shanghai AI Laboratory | 上海人工智能实验室

13 of 38

View Transformation

Issues:

From 3D to 2D:

Shanghai AI Laboratory | 上海人工智能实验室

14 of 38

View Transformation

Issues:

From 3D to 2D:

Multiple 3D points will hit the same 2D pixel

Shanghai AI Laboratory | 上海人工智能实验室

15 of 38

View Transformation

Issues:

From 3D to 2D:

Multiple 3D points will hit the same 2D pixel

From 2D to 3D:

Shanghai AI Laboratory | 上海人工智能实验室

16 of 38

View Transformation

Issues:

From 3D to 2D:

Multiple 3D points will hit the same 2D pixel

From 2D to 3D:

Depth is unknown

No matter what, the transformation is ill-posed

Shanghai AI Laboratory | 上海人工智能实验室

17 of 38

BEVFormer

Multi-camera and temporal feature based on Deformable Attention.

BEV Queries Q: lookup to obtain BEV feature map
Spatial Cross-Attention: fuse multi-camera feature
Temporal Self-Attention: aggregate temporal BEV feature

Shanghai AI Laboratory | 上海人工智能实验室

18 of 38

BEVFormer

Multi-camera and temporal feature based on Deformable Attention.

BEV Queries Q: lookup to obtain BEV feature map
Spatial Cross-Attention: fuse multi-camera feature
Temporal Self-Attention: aggregate temporal BEV feature

Shanghai AI Laboratory | 上海人工智能实验室

19 of 38

BEVFormer

Shanghai AI Laboratory | 上海人工智能实验室

20 of 38

PersFormer

Input: Image in perspective view

Output: Lane lines in 3D space

Shanghai AI Laboratory | 上海人工智能实验室

21 of 38

PersFormer

Input: Image in perspective view

Output: Lane lines in 3D space

Conventional 2D Lane

Segmentation

Anchor-based Detection

SCNN [1]
LaneAF [2]

LaneATT [3]
CondLaneNet (Row-wise) [4]

[1] SCNN, AAAI 2018

[2] LaneAF, RA-L 2021

[3] LaneATT, CVPR 2021

[4] CondLaneNet, CVPR 2021

Problem

IPM: the assumption of flat ground does not always hold

Shanghai AI Laboratory | 上海人工智能实验室

22 of 38

PersFormer

Problem from algorithms side

Simple designs of the view transform between perspective view and bird’s eye view (BEV)

Problem from dataset side

Lack of a large-scale dataset in real world.

Shanghai AI Laboratory | 上海人工智能实验室

23 of 38

PersFormer

PersFormer (Perspective Transformer), an end-to-end a 3D lane line detector with a Transformer-based spatial feature transformation module

OpenLane, a large-scale real-world 3D lane line dataset

Shanghai AI Laboratory | 上海人工智能实验室

24 of 38

PersFormer

IPM-based cross attention for BEV feature
Unified anchor design (2D/3D lane detection)
Auxiliary task (BEV lane segmentation)

Shanghai AI Laboratory | 上海人工智能实验室

25 of 38

OpenLane

No real-world large-scale dataset for 3D lane detection, only small-scale synthetic dataset
Built upon Waymo Dataset, providing the playground of multi-task learning for future work

Shanghai AI Laboratory | 上海人工智能实验室

26 of 38

BEVFormer & PersFormer

BEV Perception is prevailing since 2022 in academia.

Shanghai AI Laboratory | 上海人工智能实验室

27 of 38

3D Occupancy Prediction

What’s the problem of current 3D perception representation?

3D bbox (a) ignores the detailed geometric of an irregular object.

(a)

Shanghai AI Laboratory | 上海人工智能实验室

28 of 38

3D Occupancy Prediction

What’s the problem of current 3D perception representation?

3D bbox (a) ignores the detailed geometric of an irregular object.
while 3D occupancy (b) catches the geometric shape well.

(a)

(b)

Shanghai AI Laboratory | 上海人工智能实验室

29 of 38

3D Occupancy Prediction

What’s the problem of current 3D perception representation?

3D bbox (a) ignores the detailed geometric of an irregular object.
while 3D occupancy (b) catches the geometric shape well.

Mobileye (c) and Tesla (d) also adore such a representation.

(a)

(b)

(c)

(d)

Shanghai AI Laboratory | 上海人工智能实验室

30 of 38

3D Occupancy Prediction

How does 3D perception evolve into 3D occupancy?

Shanghai AI Laboratory | 上海人工智能实验室

31 of 38

3D Occupancy Prediction

How does 3D perception evolve into 3D occupancy?

Shanghai AI Laboratory | 上海人工智能实验室

32 of 38

3D Occupancy Prediction

How does 3D perception evolve into 3D occupancy?

Shanghai AI Laboratory | 上海人工智能实验室

33 of 38

3D Occupancy Prediction

How does 3D perception evolve into 3D occupancy?

Shanghai AI Laboratory | 上海人工智能实验室

34 of 38

3D Occupancy Prediction

How does 3D perception evolve into 3D occupancy?

Shanghai AI Laboratory | 上海人工智能实验室

35 of 38

3D Occupancy Prediction

How does 3D perception evolve into 3D occupancy?

Shanghai AI Laboratory | 上海人工智能实验室

36 of 38

3D Occupancy Prediction

How does 3D perception evolve into 3D occupancy?

Shanghai AI Laboratory | 上海人工智能实验室

37 of 38

Scene as Occupancy

Shanghai AI Laboratory | 上海人工智能实验室

38 of 38

Challenge Stats

The Most fiercely contested Challenge this year

Participants: ranging from academy to industry
Teams: 149
Countries / Regions: 10
Submissions: 400+

SOTA performance doubles from baseline

mIoU Metric: 23.7 (baseline) -> 54.19 (state-of-the-art)
Metric variances(Top Three): 1.4

Shanghai AI Laboratory | 上海人工智能实验室