Javed Ahmad1,2 , Alessio Del Bue2
Multimodal Fusion for 3D Object Detection
Preprint at arXiv
1Università degli Studi di Genova
2Istituto Italiano Di Tecnologia
3D Scene Perception
Automate Mapping [1]
Augmented Reality [2]
Autonomous Driving [3]
Robot Vision System [4]
[1] Mapillary.com
[2] MEMEXProject.eu
[3] Motional.com
[4]Universal-Robots.com
3D Scene Perception – Multimodal Scenario
Perception
Fusion
combine information
3D Features
2D Features
Scene Model
LiDAR
Cameras
Multimodal Fusion for 3D Object Detection – The Problem
How to Fuse ?
LiDAR
2D RGB images
Camera
3D points
“A fusion scheme to leverage complementary information from diverse sensor such as LiDAR and Camera and perform 3D object detection”[1]
[1] Ahmad, J., & Del Bue, A. (2023). mmFUSION: Multimodal Fusion for 3D Objects Detection. arXiv preprint arXiv:2311.04058.
Multimodal Fusion
Network
Pts-Backbone
Pts-Backbone
Img-Backbone
Head
Multi-Sensors Data
Image
LiDAR Points
RADAR Points
Common Space
Fused Information
3D Detection
Box Head
Cls Head
Multimodal Fusion for 3D Object Detection – The Idea
Multimodal Fusion for 3D Object Detection – The Fusion problem
a) Early Fusion [1]
b) Late Fusion [2]
c) Intermediate (Ours)
Features in Lower Volume Space
Image
LiDAR
points
Image
Image
LiDAR
points
LiDAR
points
To Detection
Augmentation
mmFUSION Features
Augmentation
To Detection
To Detection
Proj.
Seg. / RoIs.
proposals
Voxelized
Up-Proj.
3D Space
Seg. / RoIs.
[1] Early Fusion: Painted pointRCNN, MVXNet (as early), Fusion-Painting, IPOD, Frustum-PointNet, Frustum-ConvNet
[2] Late Fusion: MV3D, F-PointNet, UberATG-MMF, AVOD, MVXNet (as late)
Multimodal Fusion for 3D Object Detection – Proposed Approach
Concat(128, 128)
3D Box
Head
CLS
Head
3D Detection Head
Voxelization
[1400, 1600, 40]
Img – Encoder
[128, 5, 200, 176]
[128, 5, 200, 176]
LiDAR – Encoder
[176, 200, 20]
Up-Proj. in 3D Volume
Image
LiDAR
Image
Backbone + Neck
Proj. Mat
mmFUSION
3D Conv - Img
3D Conv - LiDAR
ReLU
3D Conv
ResModule
3D Conv
ReLU
3D Conv
3D Conv
Joint
Feature
Generation
Cross-Modality Attention
Multi-Modality Attention
[256, 5, 200, 176]
Decoder Layers
Part A
Part B
Part C
Part D
Image
LiDAR
3D Detection
Fused Information
Common Space
Multi-Sensors Data
Image
LiDAR Points
Box Head
mmFUSION
Pts-Backbone
Img-Backbone
Cls Head
Multimodal Fusion for 3D Object Detection – Visualization
Metric: AP11-IoU=0.7 on KITTI (val. set)
Metric: AP40-IoU=0.7 on KITTI (val. set)
Multimodal Fusion for 3D Object Detection – Experiments
Comparisons of our mmFUSION with different well-known fusion schemes on KITTI Dataset- front camera + LiDAR
Metric: AP40-IoU=0.7 on KITTI (Test set)
KITTI Setup
Comparisons of our mmFUSION with different well-known fusion schemes on NuScenes Dataset- 6 camera + LiDAR
Multimodal Fusion for 3D Object Detection – Experiments
NuScenes Setup
Multimodal Fusion for 3D Object Detection – Ablation Study
Multimodal Fusion for 3D Object Detection – Latency
Conclusions
Given Multi-sensor (LiDAR and Camera):
We have devised a novel fusion scheme for multimodal data to tackle the challenge of 3D object detection, presenting critical implications for applications in autonomous driving and robotics.
A Basic Requirements for Multi-Sensor Based Perception
z
World
Coordinates System
z
LiDAR
Coordinates System
z
Camera
Coordinates System
Image coordinates
z
z