1 of 14

Javed Ahmad1,2 , Alessio Del Bue2

Multimodal Fusion for 3D Object Detection

Preprint at arXiv

1Università degli Studi di Genova

2Istituto Italiano Di Tecnologia

2 of 14

3D Scene Perception

Automate Mapping [1]

Augmented Reality [2]

Autonomous Driving [3]

Robot Vision System [4]

[1] Mapillary.com

[2] MEMEXProject.eu

[3] Motional.com

[4]Universal-Robots.com

3 of 14

3D Scene Perception – Multimodal Scenario

Perception

Fusion

combine information

3D Features

2D Features

Scene Model

  • Object Localization
  • Object Detection
  • Scene Segmentation

LiDAR

Cameras

4 of 14

Multimodal Fusion for 3D Object Detection – The Problem

How to Fuse ?

LiDAR

2D RGB images

Camera

3D points

  • Diverse Sensor
  • Complementary Information
  • Image: Rich semantic features in 2D
  • LiDAR: Sparse but accurate in 3D localization

“A fusion scheme to leverage complementary information from diverse sensor such as LiDAR and Camera and perform 3D object detection”[1]

[1] Ahmad, J., & Del Bue, A. (2023). mmFUSION: Multimodal Fusion for 3D Objects Detection. arXiv preprint arXiv:2311.04058.

5 of 14

Multimodal Fusion

Network

Pts-Backbone

Pts-Backbone

Img-Backbone

Head

Multi-Sensors Data

Image

LiDAR Points

RADAR Points

Common Space

 

 

 

Fused Information

 

 

 

3D Detection

Box Head

Cls Head

 

 

 

 

Multimodal Fusion for 3D Object Detection – The Idea

6 of 14

Multimodal Fusion for 3D Object Detection – The Fusion problem

a) Early Fusion [1]

b) Late Fusion [2]

c) Intermediate (Ours)

Features in Lower Volume Space

Image

LiDAR

points

Image

Image

LiDAR

points

LiDAR

points

To Detection

Augmentation

mmFUSION Features

Augmentation

To Detection

To Detection

Proj.

Seg. / RoIs.

proposals

Voxelized

Up-Proj.

3D Space

Seg. / RoIs.

[1] Early Fusion: Painted pointRCNN, MVXNet (as early), Fusion-Painting, IPOD, Frustum-PointNet, Frustum-ConvNet

[2] Late Fusion: MV3D, F-PointNet, UberATG-MMF, AVOD, MVXNet (as late)

7 of 14

Multimodal Fusion for 3D Object Detection – Proposed Approach

Concat(128, 128)

3D Box

Head

CLS

Head

3D Detection Head

Voxelization

[1400, 1600, 40]

Img – Encoder

[128, 5, 200, 176]

[128, 5, 200, 176]

LiDAR – Encoder

[176, 200, 20]

Up-Proj. in 3D Volume

Image

LiDAR

Image

Backbone + Neck

Proj. Mat

mmFUSION

3D Conv - Img

3D Conv - LiDAR

ReLU

3D Conv

ResModule

3D Conv

ReLU

3D Conv

3D Conv

Joint

Feature

Generation

Cross-Modality Attention

Multi-Modality Attention

[256, 5, 200, 176]

Decoder Layers

Part A

Part B

Part C

Part D

8 of 14

Image

LiDAR

3D Detection

Fused Information

Common Space

Multi-Sensors Data

Image

LiDAR Points

Box Head

mmFUSION

Pts-Backbone

Img-Backbone

Cls Head

Multimodal Fusion for 3D Object Detection – Visualization

9 of 14

Metric: AP11-IoU=0.7 on KITTI (val. set)

Metric: AP40-IoU=0.7 on KITTI (val. set)

Multimodal Fusion for 3D Object Detection – Experiments

Comparisons of our mmFUSION with different well-known fusion schemes on KITTI Dataset- front camera + LiDAR

Metric: AP40-IoU=0.7 on KITTI (Test set)

KITTI Setup

10 of 14

Comparisons of our mmFUSION with different well-known fusion schemes on NuScenes Dataset- 6 camera + LiDAR

Multimodal Fusion for 3D Object Detection – Experiments

NuScenes Setup

11 of 14

Multimodal Fusion for 3D Object Detection – Ablation Study

12 of 14

Multimodal Fusion for 3D Object Detection – Latency

13 of 14

Conclusions

Given Multi-sensor (LiDAR and Camera):

We have devised a novel fusion scheme for multimodal data to tackle the challenge of 3D object detection, presenting critical implications for applications in autonomous driving and robotics.

14 of 14

A Basic Requirements for Multi-Sensor Based Perception

 

 

 

 

 

 

z

 

 

 

 

 

World

Coordinates System

 

 

 

z

LiDAR

Coordinates System

 

 

 

z

 

 

Camera

Coordinates System

 

 

Image coordinates

 

 

 

z

 

 

 

 

 

z