ABCDEFGHIJKLMNOPQRSTUVWXYZAAABACADAEAF
1
nameTimevenuetitlelinknotescategorytl;drpredecessor backbone3d size3d shapekeypoint3d orientationdistance2D to 3D tight optimrequired inputdrawbacks
tricks and contributions
insights
2
Mono3D1512CVPR 2016
Mono3D: Monocular 3D Object Detection for Autonomous Driving
https://www.cs.toronto.edu/~urtasun/publications/chen_etal_cvpr16.pdf
mono3d.md
direct 3D proposal
The pioneering paper on monocular 3dod, with tons of hand crafted feature
Mono3DFaster RCNN
from 3 template per class
NoneNone
scoring of dense proposal
scoring of dense proposal
None
2D bbox, 2D seg mask, 3D bbox
shared feature maps (mono3D)
3
Deep3DBox1612CVPR 2017
Deep3dBox: 3D Bounding Box Estimation Using Deep Learning and Geometry
https://arxiv.org/abs/1612.00496
deep3dbox.md
2D/3D tight constraint
Monocular 3d object detection (3dod) by using 2d bbox and geometry constraints.
Deep3DBoxMS-CNN
L2 loss for offset from subtype average
NoneNonemulti-bin for yaw
2D/3D optimization
the original deep3DBox optimization
2D bbox, 3D bbox, intrinsics
locking in the error in 2D object detection
4
Deep MANTA1703CVPR 2017
Deep MANTA: A Coarse-to-fine Many-Task Network for joint 2D and 3D vehicle analysis from monocular image
https://arxiv.org/abs/1703.07570
deep_manta.md
keypoints and shapes
Predict keypoints and use 3D to 2D projection (Epnp) to get position and orientation of the 3D bbox.
None
cascaded Faster RCNN
template classification scaled by a scaling factor
template classification scaled by a scaling factor
36 keypoints
6DoF pose by 2D/3D matching Epnp
6DoF pose by 2D/3D matching Epnp
None
2D bbox, 3D bbox, 103 3D CAD with 36 keypoint annotation
semi-auto labeling by putting template into 3D bbox
5
3D-RCNN1712CVPR 2018
3D-RCNN: Instance-level 3D Object Reconstruction via Render-and-Compare
http://openaccess.thecvf.com/content_cvpr_2018/papers/Kundu_3D-RCNN_Instance-Level_3D_CVPR_2018_paper.pdf
3d_rcnn.md
keypoints and shapes
inverse graphics, predict shape and pose, render and compare
Deep3DBoxFaster RCNNsubtype average
TSDF encoding, PCA, 10-dim space
2D projection of 3D center
viewpoint (azimuth, elevation, tilt) with improved weighted average multi-bin
find d by moving along ray angle until 3d tightly fit 2D
yes, move 3D along ray until fit tightly into 2D bbox
2D bbox, 3D bbox, 3D CAD
6
MLF1712CVPR 2018
MLF: Multi-Level Fusion based 3D Object Detection from Monocular Images
http://openaccess.thecvf.com/content_cvpr_2018/papers/Xu_Multi-Level_Fusion_Based_CVPR_2018_paper.pdf
mlf.md
feature transformation
Estimate depth map from monocular RGB and concat to be RGBD for mono 3DOD.
Deep3DBoxFaster RCNN
offset from whole dataset average
NoneNone
multi-bin, and SL1 for cos and sin
MonoDepth, SL1 for depth regression
None
2D bbox, 3D bbox, pretrained depth model
pretrained depth model
point cloud as 3-ch xyz map
7
MonoGRNet1811AAAI 2019
MonoGRNet: A Geometric Reasoning Network for Monocular 3D Object Localization
https://arxiv.org/abs/1811.10247
monogrnet.md
keypoints and shapes
Use the same network to estimate instance depth, 2D and 3D bbox.
MonoGRNet
MultiNet (YOLO+RoIAlign)
regress 8 corners in allocentric coordinate system
None
2D projection of 3D center
regress 8 corners in allocentric coordinate system
instance depth estimation (IDE) according to a grid
2D bbox, 3D bbox, intrinsics, depth map
requires depth map for training
2D/3D center loss, local/global corner loss; stagewise training to start 3D after 2D
instance depth estimation: pixel level depth estimation does not focus on object localization by design; depth of the nearest object instance
8
OFT1811BMVC 2019
OFT: Orthographic Feature Transform for Monocular 3D Object Detection
https://arxiv.org/abs/1811.08188
oft.md
feature transformation
Learn a projection of camera image to BEV for 3D object detection.
OFT
ResNet18+ResNet16 top down network
L1 loss for offset from subtype average in log space
NoneNone
L1 on cos and sin
positional offset in BEV space from local peaks
None
2D bbox, 3D bbox (intrinsics learned)
TopDown network to reason in BEV
9
Mono3D Track1811ICCV 2019
Joint Monocular 3D Vehicle Detection and Tracking
https://arxiv.org/abs/1811.10742
mono_3d_tracking.md
direct 3D proposal
Add 3D tracking with LSTM based on mono3d object detection.
Deep3DBoxFaster RCNN
L1 loss for offset from subtype average
None
2D projection of 3D center
multi-bin for local yaw in two bins
L1 loss on 1 over regressed disparity
None
2D bbox, 3D bbox, intrinsics
regressing 2D projection of 3D center helps recover amodal 3D bbox
10
GPP1811ArXiv
GPP: Ground Plane Polling for 6DoF Pose Estimation of Objects on the Road
https://arxiv.org/abs/1811.06666
gpp.md
keypoints and shapes
Regress tireline and height and project to the best ground plane near the car
GPP
RetinaNet+2D/3D head
refined from subtype average
None
2D projection of tirelines (observer facing vertices)
coarse (8) viewpoint classification
IPM based on best fitting ground plane
None
2D bbox, 3D bbox, intrinsics, fitted road planes
Need to collect and fit road data
able to predict local road pose
NA
11
ROI-10D1812CVPR 2019
ROI-10D: Monocular Lifting of 2D Detection to 6D Pose and Metric Shape
https://arxiv.org/abs/1812.02781
roi10d.md
keypoints and shapes
Concat depth map and coord map to RGB features + 2DOD + car shape reconstruction (6d latent space) for mono 3DOD.
3D RCNN
Faster RCNN with FPN
offset from whole dataset average
TSDF encoding, 3D Autoencoder, 6-dim space
None4-d quaternionregress depth zNone
2D bbox, 3D bbox, intrinsics, pretrained depth model
8-corner loss; stagewise training to start 3D after 2D
12
Pseudo-Lidar1812CVPR 2019
Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving
https://arxiv.org/abs/1812.07179
pseudo_lidar.md
feature transformation
estimate depth map from RGB image (mono/stereo) and use it to lift RGB to point cloud
Pseudo-lidar
Frustum-PointNet / AVOD
3DOD on point cloud
NoneNone
3DOD on point cloud
DORN depth estimation
None
2D bbox, 3D bbox, intrinsics, pretrained depth model
pretrained depth model
data representation matters
13
Mono3D++1901AAAI 2018
Mono3D++: Monocular 3D Vehicle Detection with Two-Scale 3D Hypotheses and Task Priors
https://arxiv.org/abs/1901.03446
mono3d++.md
keypoints and shapes
Mono 3DOD based on 3D and 2D consistency, in particular landmark and shape recon.
DeepMANTA
SSD for 2D bbox, stacked hourgalss for keypoint, monodepth for depth
N basis shape (N=?)
14 landmarks
CE cls on 360 bins
MonoDepthL1 loss
2D bbox, 3D bbox, pretrained depth model, 3D CAD model with keypoints
cars should staty on the ground, should look like a car, and should be at a resaonable distance. How to ensure 2D/3D consistency between generated 3D vehicle hypothesis.
14
GS3D1903CVPR 2019
GS3D: An Efficient 3D Object Detection Framework for Autonomous Driving
https://arxiv.org/abs/1903.10955
gs3d.md
2D/3D tight constraint
Get 3D bbox proposal (guidance) from 2D bbox + prior knowledge, then refine 3D bbox through surface features
GS3D
FasterRCNN with VGG16 (2D+O)
subtype averageNoneNone
from RoIAligned features (possibly multibin)
approximated with bbox height * 0.93
None
2D bbox, 3D bbox, intrinsics
quality aware loss, surface feature extraction
15
Pseudo-Lidar Color1903ICCV 2019
Accurate Monocular 3D Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving
https://arxiv.org/abs/1903.11444
pseudo_lidar_color.md
feature transformation
Concurrent proj with Pseudo-lidar but with color embedding
Pseudo-lidar
Frustum-PointNet
3DOD on point cloud
NoneNone
3DOD on point cloud
various pretrained depth weight
None
2D bbox, 3D bbox, intrinsics, pretrained depth model
16
BirdGAN1904IROS 2019
BirdGAN: Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles
https://arxiv.org/abs/1904.08494
birdgan.md
feature transformation
Learn to map 2D perspective image to BEV with GAN
BirdGANDCGAN
oriented 2DOD on BEV point cloud
NoneNone
oriented 2DOD on BEV point cloud
oriented 2DOD on BEV point cloud
None
2D bbox, 3D bbox (intrinsics learned)
In the clipping case, the frontal detectable depth is only about 10 to 15 meters
17
FQNet1904CVPR 2019
FQNet: Deep Fitting Degree Scoring Network for Monocular 3D Object Detection
https://arxiv.org/abs/1904.12681
fqnet.md
2D/3D tight constraint
Train a network to score the 3D IOU of a projected 3D wireframe with GT. Train a network to score the 3D IOU of a projected 3D wireframe with GT.
Deep3DBoxMS-CNN
k-means clustering and multi-bin
NoneNone
k-means clustering and multi-bin
approximated via optimization
similar to Deep3DBox (details in appendix)
2D bbox, 3D bbox, intrinsics
18
MonoPSR1904CVPR 2019
MonoPSR: Monocular 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction
https://arxiv.org/abs/1904.01690
monopsr.md
2D/3D tight constraint
3DOD by generating 3D proposal first and then reconstructing local point cloud of dynamic object
Deep3DBox, Pseudo-lidar
MS-CNN
L2 loss for offset from subtype average
NoneNonemulti-bin for yaw
approximated with bbox height, then regress the residual from RoIAligned feature
None
2D bbox, 3D bbox, intrinsics
shared feature maps (mono3D0
19
CenterNet1904ArXivObjects as Points
https://arxiv.org/pdf/1904.07850.pdf
centernet_ut.md
direct 3D proposal
Object detection as detection of the center point of the object and regression of its associated properties.
CenterNetDLA (Unet)
L1 loss over absolute dimension
NoneNone
multi-bin for global yaw in two overlapping bins
L1 loss on 1 over regressed disparity
None
2D bbox, 3D bbox, intrinsics
highly flexible network
20
MonoDIS1905ICCV 2019
MonoDIS: Disentangling Monocular 3D Object Detection
https://arxiv.org/abs/1905.12365
monodis.md
direct 3D proposal
end2end training of 2D and 3D heads on top of RetinaNet for monocular 3D object detection
MonoGRNet
RetinaNet+2D/3D head
offset from whole dataset average, learned via 3D corner loss
None
2D projection of 3D center
learned via 3D corner loss
regressed from dataset average, learned via 3D corner loss
None
2D bbox, 3D bbox, intrinsics
signed IoU loss (pulls together even before intersecting), disentangle learning
disentangling transformation to split the original combinational loss (e.g., size and location of bbox at the same time) into different groups, each group only contains the loss of one group of parameters and the rest using the GT
21
monogrnet_russian1905ArXiv
MonoGRNet 2: Monocular 3D Object Detection via Geometric Reasoning on Keypoints
https://arxiv.org/abs/1905.05618
monogrnet_russian.md
keypoints and shapes
Regress keypoints in 2D images and use 3D CAD model to infer depth
DeepMANTA
Mask RCNN with FPN
SL1 loss for offset from subtype average in log space
5 CAD14 landmarks
multi-bin for yaw in 72 non-overlapping bins
approximated with windshield height
None
2D bbox, 3D bbox, intrinsics
semi-auto labeling by putting template into 3D bbox
22
Pseudo-Lidar end2end1905ICCV 2019
Pseudo lidar-e2e: Monocular 3D Object Detection with Pseudo-LiDAR Point Cloud
https://arxiv.org/abs/1903.09847
pseudo_lidar_e2e.md
feature transformation
End-to-end pseudo-lidar training with 2D/3D bbox consistency loss
Pseudo-Lidar
Frustum-PointNet
3DOD on point cloud
NoneNone
3DOD on point cloud
DORN depth estimation
bbox conistency loss
2D bbox, 2D seg mask, 3D bbox, intrinsics
pretrained depth model
2D/3D bbox consistency
23
Shift RCNN1905IEEE ICIP 2019
Shift R-CNN: Deep Monocular 3D Object Detection with Closed-Form Geometric Constraints
https://arxiv.org/abs/1905.09970
shift_rcnn.md
2D/3D tight constraint
Extend the work of deep3Dbox by regressing residual center positions.
Deep3DBoxFaster RCNN
L2 loss for offset from subtype average
NoneNone
cos and sin, with unity constriant
approximated via optimization
Slightly different from Deep3DBox
2D bbox, 3D bbox, intrinsics
24
BEV IPM OD1906IV 2019
BEV-IPM: Deep Learning based Vehicle Position and Orientation Estimation via Inverse Perspective Mapping Image
https://ieeexplore.ieee.org/abstract/document/8814050
bev_od_ipm.md
feature transformation
IPM of the pitch/role corrected camera image, and then perform 2DOD on the IPM imag
YOLOv3
oriented 2DOD on BEV point cloud
NoneNone
oriented 2DOD on BEV point cloud
oriented 2DOD on BEV point cloud
None
2D bbox, BEV oriented bbox, IMU correction
up to 40 meters
Motion cancellation using IMU
IPM assumptions: 1) road is flat 2) mounting position of the camera is stationary --> motion cancellation helps this. 3) the vehicle to be detected is on the ground
25
Pseudo-Lidar++1906ArXiv
Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving
https://arxiv.org/abs/1906.06310
pseudo_lidar++.md
feature transformation
Improve depth estimation of pseudo-lidar with stereo depth network (SDN) and sparse depth measurements on "landmark" pixels with few-line lidars.
Pseudo-lidar
Frustum-PointNet / AVOD
3DOD on point cloud
NoneNone
3DOD on point cloud
PSMNet finetuned stereo depth
None
2D bbox, 3D bbox, pretrained depth model, sparse lidar data
use sparse lidar to correct depth, stereo depth loss
26
SS3D1906ArXiv
SS3D: Monocular 3D Object Detection and Box Fitting Trained End-to-End Using Intersection-over-Union Loss
https://arxiv.org/abs/1906.08070
ss3d.md
direct 3D proposal
CenterNet like structure to directly regress 26 attributes per object to fit a 3D bbox
U-Net like archlog sizeNone
8 3D corners projected to 2D
cos and sin (multibn not suitable)
direclty regress with indirect supervison of 3D IoU loss
None
2D bbox, 3D bbox, intrinsics
models uncertainty, direclty regress 26 number, 20 fps inference
27
TLNet1906CVPR 2019
TLNet: Triangulation Learning Network: from Monocular to Stereo 3D Object Detection
https://arxiv.org/abs/1906.01193
tlnet.md
direct 3D proposal
Place 3D anchors inside the frustum subtended by 2D object detection as the mono baseline
Faster RCNN with two refine stages
refined from dataset average
NoneNone
refined from 0 and 90 degrees anchors
refined from 3D anchors
None
2D bbox, 3D bbox, intrinsics
stereo coherence score and channel reweighting
28
M3D-RPN1907ICCV 2019
M3D-RPN: Monocular 3D Region Proposal Network for Object Detection
https://arxiv.org/abs/1907.06038
m3d_rpn.md
direct 3D proposal
Regress 2D and 3D bbox parameters simultaneously by precomputing 3D mean stats for each 2D anchor.
Faster RCNN
log size times 3D anchor size
NoneNone
smooth L1 directly on angle, postprocess to refine
refined from 3D anchors
None
2D bbox, 3D bbox, intrinsics
angle postprocessing
2D anchor with 2D/3D properties, depth aware conv, neg log IoU loss for 2D detection, directly regress 12 numbers
Reliance on additional sub-networks introduces persistent noise
29
ForeSeE1909ArXiv
ForeSeE: Task-Aware Monocular Depth Estimation for 3D Object Detection
https://arxiv.org/abs/1909.07701
foresee_mono3dod.md
feature transformation
Train a depth estimator focused on the foreground moving object and improve 3DOD based on pseudo-lidar.
Pseudo-lidar
Frustum-PointNet / AVOD
3DOD on point cloud
NoneNone
3DOD on point cloud
learn foreground/background depth individually
2D bbox, 3D bbox, depth map
Depth combination: Element-wise maximum value of confidence vector in C depth bins are obtained, and pass through a softmax
Not all pixels are equal. Estimation error on a car is much different from the same error on a building.
30
CasGeo1909ArXiv
3D Bounding Box Estimation for Autonomous Vehicles by Cascaded Geometric Constraints and Depurated 2D Detections Using 3D Results
https://arxiv.org/abs/1909.01867
casgeom.md
2D/3D tight constraint
Extends Deep3DBox by regressing the 3d bbox center on bottom edge and viewpoint classification
Deep3DBoxMS-CNN
refined from subtype average
None
2D projection of bottom surface center
multi-bin for yaw, coarse (4) viewpoint estimation
initialized by top/bottom surface center projection and approximated via optimization (Gauss-Newton)
similar to Deep3DBox (details in appendix)
2D bbox, 3D bbox, intrinsics
regress 3d height projection to help with initial guess of distance
31
MVRA1910ICCV 2019
MVRA: Multi-View Reprojection Architecture for Orientation Estimation
http://openaccess.thecvf.com/content_ICCVW_2019/papers/ADW/Choi_Multi-View_Reprojection_Architecture_for_Orientation_Estimation_ICCVW_2019_paper.pdf
mvra.md
2D/3D tight constraint
Build the 2D/3D constraints optimization into neural network and use iterative method to refine cropped cases.
Deep3DBoxFaster RCNN
refined from subtype average
NoneNone
multi-bin for yaw, viewpoint estimation, iterative trial and error for truncated
approximated via optimization
similar to Deep3DBox (details in appendix)
2D bbox, 3D bbox, intrinsics
predict better for truncated bbox
NA
32
monoloco1906ArXiv
MonoLoco: Monocular 3D Pedestrian Localization and Uncertainty Estimation
https://arxiv.org/abs/1906.06059
monoloco.md
keypoints and shapes
BEV localization for pedestrians with uncertainty.
monoloco
mask RCNN/Pif-Paf + MLP
NoneNone14 landmarksNone
approximated with shoulder-hip segment height
None
2D bbox, 3D bbox, intrinsics
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100