1 of 39

From BEV to Scene-as-Occupancy:

An Overview of Camera 3D Perception

Chonghao Sima

Shanghai AI Laboratory | 上海人工智能实验室

Shanghai AI Laboratory | 上海人工智能实验室

2 of 39

Why we need 3D perception

RoboTaxi

Autonomous Driving

Trunk

ADAS System

Delivery

Robotics & Embodied AI

Housework

Agriculture

Logistics system

Industry

Why Camera-based? low-cost, easy-to-deploy, long-range, rich in semantic appearance;

Shanghai AI Laboratory | 上海人工智能实验室

3 of 39

Why we need 3D perception

RoboTaxi

Autonomous Driving

Trunk

ADAS System

Delivery

Robotics & Embodied AI

Housework

Agriculture

Logistics system

Industry

Why Camera-based? low-cost, easy-to-deploy, long-range, rich in semantic appearance;

Shanghai AI Laboratory | 上海人工智能实验室

4 of 39

Why we need 3D perception

RoboTaxi

Autonomous Driving

Trunk

ADAS System

Delivery

Robotics & Embodied AI

Housework

Agriculture

Logistics system

Industry

Why Camera-based? low-cost, easy-to-deploy, long-range, rich in semantic appearance;

Shanghai AI Laboratory | 上海人工智能实验室

5 of 39

Why we need 3D perception

RoboTaxi

Autonomous Driving

Trunk

ADAS System

Delivery

Robotics & Embodied AI

Housework

Agriculture

Logistics system

Industry

Why Camera-based? low-cost, easy-to-deploy, long-range, rich in semantic appearance;

Shanghai AI Laboratory | 上海人工智能实验室

6 of 39

Core Issues in Camera-only 3D Perception

Accurate Depth: Bridging the gap between Camera-based and LiDAR-based method

[1] Depth Estimation Matters Most: Improving Per-Object Depth Estimation for Monocular 3D Detection and Tracking, arxiv:2206.03666

How to solve?

  • Pseudo-LiDAR Track
    • Depth prediction to form the pseudo-LiDAR
  • Center-point Track
    • Heatmap in 2D to infer pose in 3D
  • Depth Pre-Training
    • Embed 3D prior in 2D pre-trained backbone
  • BEV View Transformation
    • Transforming perspective view feature to BEV

Shanghai AI Laboratory | 上海人工智能实验室

7 of 39

Trending in BEV Perception

2021.7

  • HDMapNet
  • Output BEV Map
  • Given HD map in BEV coordinates
  • Propose the aggregation of feature extracted from both Camera and LiDAR

Shanghai AI Laboratory | 上海人工智能实验室

8 of 39

Trending in BEV Perception

2021.7

  • HDMapNet

2021.10-12

  • DETR3D
  • BEVDet
  • Implicitly processing BEV features
  • Output BEV Map
  • Fused Object Detection using omnidirectional cameras in BEV
  • Given HD map in BEV coordinates
  • Propose the aggregation of feature extracted from both Camera and LiDAR

Shanghai AI Laboratory | 上海人工智能实验室

9 of 39

Trending in BEV Perception

2021.7

  • HDMapNet

2021.10-12

  • DETR3D
  • BEVDet

2022.3

  • BEVFormer
  • PersFormer
  • Explicitly Construct BEV representation via camera parameters
  • Implicitly processing BEV features

  • Output BEV Map
  • Explicitly processing BEV feature
  • Fused Object Detection using omnidirectional cameras in BEV
  • Given HD map in BEV coordinates
  • Propose the aggregation of feature extracted from both Camera and LiDAR

Shanghai AI Laboratory | 上海人工智能实验室

10 of 39

Trending in BEV Perception

2021.7

  • HDMapNet

2021.10-12

  • DETR3D
  • BEVDet

2022.3

  • BEVFormer
  • PersFormer

2022.5

  • BEVFusion(Damo Academy)
  • BEVFusion(MIT)
  • FUTR3D
  • Given HD map in BEV coordinates
  • Propose the aggregation of feature extracted from both Camera and LiDAR
  • Explicitly Construct BEV representation via camera parameters
  • Multimodal feature fusion in BEV
  • Implicitly processing BEV features
  • Output BEV Map
  • Explicitly processing BEV features
  • Fusion on the dimension of BEV-feature
  • Fused Object Detection using omnidirectional cameras in BEV

Core Question: How to model the View Transformation from perspective view to BEV more effectively?

Shanghai AI Laboratory | 上海人工智能实验室

11 of 39

View Transformation

Issues:

  • From 3D to 2D:

Shanghai AI Laboratory | 上海人工智能实验室

12 of 39

View Transformation

Issues:

  • From 3D to 2D:
    • Multiple 3D points will hit the same 2D pixel

Shanghai AI Laboratory | 上海人工智能实验室

13 of 39

View Transformation

Issues:

  • From 3D to 2D:
    • Multiple 3D points will hit the same 2D pixel
  • From 2D to 3D:

Shanghai AI Laboratory | 上海人工智能实验室

14 of 39

View Transformation

Issues:

  • From 3D to 2D:
    • Multiple 3D points will hit the same 2D pixel
  • From 2D to 3D:
    • Depth is unknown

No matter what, the transformation is ill-posed

Shanghai AI Laboratory | 上海人工智能实验室

15 of 39

Two Ways to Address View Transformation

Way 1: From-2D-to-3D prior

    • Now that depth is unknown, we predict depth
      • Lift, Splat, Shoot and its derivant
      • Pseudo-LiDAR Family

Way 2: From-3D-to-2D prior

    • Index local features according to the projection from 3D to 2D
      • DETR3D and its derivant
      • Explicit BEV feature
    • Implicit 3D Positional Embedding

Shanghai AI Laboratory | 上海人工智能实验室

16 of 39

BEVFormer

Multi-camera and temporal feature based on Deformable Attention.

  • BEV Queries Q: lookup to obtain BEV feature map

Shanghai AI Laboratory | 上海人工智能实验室

17 of 39

BEVFormer

Multi-camera and temporal feature based on Deformable Attention.

  • BEV Queries Q: lookup to obtain BEV feature map
  • Spatial Cross-Attention: fuse multi-camera feature

Shanghai AI Laboratory | 上海人工智能实验室

18 of 39

BEVFormer

Multi-camera and temporal feature based on Deformable Attention.

  • BEV Queries Q: lookup to obtain BEV feature map
  • Spatial Cross-Attention: fuse multi-camera feature
  • Temporal Self-Attention: aggregate temporal BEV feature

Shanghai AI Laboratory | 上海人工智能实验室

19 of 39

BEVFormer

Multi-camera and temporal feature based on Deformable Attention.

  • BEV Queries Q: lookup to obtain BEV feature map
  • Spatial Cross-Attention: fuse multi-camera feature
  • Temporal Self-Attention: aggregate temporal BEV feature

Shanghai AI Laboratory | 上海人工智能实验室

20 of 39

BEVFormer

Shanghai AI Laboratory | 上海人工智能实验室

21 of 39

PersFormer

Input: Image in perspective view

Output: Lane lines in 3D space

Shanghai AI Laboratory | 上海人工智能实验室

22 of 39

PersFormer

Input: Image in perspective view

Output: Lane lines in 3D space

Conventional 2D Lane

Segmentation

Anchor-based Detection

  • SCNN [1]
  • LaneAF [2]
  • LaneATT [3]
  • CondLaneNet (Row-wise) [4]

[1] SCNN, AAAI 2018

[2] LaneAF, RA-L 2021

[3] LaneATT, CVPR 2021

[4] CondLaneNet, CVPR 2021

Problem

IPM: the assumption of flat ground does not always hold

Shanghai AI Laboratory | 上海人工智能实验室

23 of 39

PersFormer

  • Problem from algorithms side
    • Simple designs of the view transform between perspective view and bird’s eye view (BEV)

  • Problem from dataset side
    • Lack of a large-scale dataset in real world.

Shanghai AI Laboratory | 上海人工智能实验室

24 of 39

PersFormer

  • PersFormer (Perspective Transformer), an end-to-end monocular 3D lane line detector with a Transformer-based spatial feature transformation module

  • OpenLane, a large-scale real-world 3D lane line dataset

Shanghai AI Laboratory | 上海人工智能实验室

25 of 39

PersFormer

    • IPM-based cross attention for BEV feature
    • Unified anchor design (2D/3D lane detection)
    • Auxiliary task (BEV lane segmentation)

Shanghai AI Laboratory | 上海人工智能实验室

26 of 39

OpenLane

    • No real-world large-scale dataset for 3D lane detection, only small-scale synthetic dataset
    • Built upon Waymo Dataset, providing the playground of multi-task learning for future work

Shanghai AI Laboratory | 上海人工智能实验室

27 of 39

BEVFormer & PersFormer

BEV Perception is prevailing since 2022 in academia.

Shanghai AI Laboratory | 上海人工智能实验室

28 of 39

3D Occupancy Prediction

What’s the problem of current 3D perception representation?

  • 3D bbox (a) ignores the detailed geometric of an irregular object.

(a)

Shanghai AI Laboratory | 上海人工智能实验室

29 of 39

3D Occupancy Prediction

What’s the problem of current 3D perception representation?

  • 3D bbox (a) ignores the detailed geometric of an irregular object.
  • while 3D occupancy (b) catches the geometric shape well.

(a)

(b)

Shanghai AI Laboratory | 上海人工智能实验室

30 of 39

3D Occupancy Prediction

What’s the problem of current 3D perception representation?

  • 3D bbox (a) ignores the detailed geometric of an irregular object.
  • while 3D occupancy (b) catches the geometric shape well.

  • Mobileye (c) and Tesla (d) also adore such a representation.

(a)

(b)

(c)

(d)

Shanghai AI Laboratory | 上海人工智能实验室

31 of 39

3D Occupancy Prediction

How does 3D perception evolve into 3D occupancy?

Shanghai AI Laboratory | 上海人工智能实验室

32 of 39

3D Occupancy Prediction

How does 3D perception evolve into 3D occupancy?

Shanghai AI Laboratory | 上海人工智能实验室

33 of 39

3D Occupancy Prediction

How does 3D perception evolve into 3D occupancy?

Shanghai AI Laboratory | 上海人工智能实验室

34 of 39

3D Occupancy Prediction

How does 3D perception evolve into 3D occupancy?

Shanghai AI Laboratory | 上海人工智能实验室

35 of 39

3D Occupancy Prediction

How does 3D perception evolve into 3D occupancy?

Shanghai AI Laboratory | 上海人工智能实验室

36 of 39

3D Occupancy Prediction

How does 3D perception evolve into 3D occupancy?

Shanghai AI Laboratory | 上海人工智能实验室

37 of 39

3D Occupancy Prediction

How does 3D perception evolve into 3D occupancy?

Shanghai AI Laboratory | 上海人工智能实验室

38 of 39

Scene as Occupancy

Shanghai AI Laboratory | 上海人工智能实验室

39 of 39

Challenge Stats

  • The Most fiercely contested Challenge this year
    • Participants: ranging from academy to industry
    • Teams: 149
    • Countries / Regions: 10
    • Submissions: 400+
  • SOTA performance doubles from baseline
    • mIoU Metric: 23.7 (baseline) -> 54.19 (state-of-the-art)
    • Metric variances(Top Three): 1.4

Shanghai AI Laboratory | 上海人工智能实验室