Review of Manocular 3D Object Detection

	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q	R	S	T
1	name	Time	venue	title	link	notes	category	tl;dr	predecessor	backbone	3d size	3d shape	keypoint	3d orientation	distance	2D to 3D tight optim	required input	drawbacks	tricks and contributions	insights

2	Mono3D	1512	CVPR 2016	Mono3D: Monocular 3D Object Detection for Autonomous Driving	https://www.cs.toronto.edu/~urtasun/publications/chen_etal_cvpr16.pdf	mono3d.md	direct 3D proposal	The pioneering paper on monocular 3dod, with tons of hand crafted feature	Mono3D	Faster RCNN	from 3 template per class	None	None	scoring of dense proposal	scoring of dense proposal	None	2D bbox, 2D seg mask, 3D bbox		shared feature maps (mono3D)
3	Deep3DBox	1612	CVPR 2017	Deep3dBox: 3D Bounding Box Estimation Using Deep Learning and Geometry	https://arxiv.org/abs/1612.00496	deep3dbox.md	2D/3D tight constraint	Monocular 3d object detection (3dod) by using 2d bbox and geometry constraints.	Deep3DBox	MS-CNN	L2 loss for offset from subtype average	None	None	multi-bin for yaw	2D/3D optimization	the original deep3DBox optimization	2D bbox, 3D bbox, intrinsics	locking in the error in 2D object detection
4	Deep MANTA	1703	CVPR 2017	Deep MANTA: A Coarse-to-fine Many-Task Network for joint 2D and 3D vehicle analysis from monocular image	https://arxiv.org/abs/1703.07570	deep_manta.md	keypoints and shapes	Predict keypoints and use 3D to 2D projection (Epnp) to get position and orientation of the 3D bbox.	None	cascaded Faster RCNN	template classification scaled by a scaling factor	template classification scaled by a scaling factor	36 keypoints	6DoF pose by 2D/3D matching Epnp	6DoF pose by 2D/3D matching Epnp	None	2D bbox, 3D bbox, 103 3D CAD with 36 keypoint annotation		semi-auto labeling by putting template into 3D bbox
5	3D-RCNN	1712	CVPR 2018	3D-RCNN: Instance-level 3D Object Reconstruction via Render-and-Compare	http://openaccess.thecvf.com/content_cvpr_2018/papers/Kundu_3D-RCNN_Instance-Level_3D_CVPR_2018_paper.pdf	3d_rcnn.md	keypoints and shapes	inverse graphics, predict shape and pose, render and compare	Deep3DBox	Faster RCNN	subtype average	TSDF encoding, PCA, 10-dim space	2D projection of 3D center	viewpoint (azimuth, elevation, tilt) with improved weighted average multi-bin	find d by moving along ray angle until 3d tightly fit 2D	yes, move 3D along ray until fit tightly into 2D bbox	2D bbox, 3D bbox, 3D CAD
6	MLF	1712	CVPR 2018	MLF: Multi-Level Fusion based 3D Object Detection from Monocular Images	http://openaccess.thecvf.com/content_cvpr_2018/papers/Xu_Multi-Level_Fusion_Based_CVPR_2018_paper.pdf	mlf.md	feature transformation	Estimate depth map from monocular RGB and concat to be RGBD for mono 3DOD.	Deep3DBox	Faster RCNN	offset from whole dataset average	None	None	multi-bin, and SL1 for cos and sin	MonoDepth, SL1 for depth regression	None	2D bbox, 3D bbox, pretrained depth model	pretrained depth model	point cloud as 3-ch xyz map
7	MonoGRNet	1811	AAAI 2019	MonoGRNet: A Geometric Reasoning Network for Monocular 3D Object Localization	https://arxiv.org/abs/1811.10247	monogrnet.md	keypoints and shapes	Use the same network to estimate instance depth, 2D and 3D bbox.	MonoGRNet	MultiNet (YOLO+RoIAlign)	regress 8 corners in allocentric coordinate system	None	2D projection of 3D center	regress 8 corners in allocentric coordinate system	instance depth estimation (IDE) according to a grid		2D bbox, 3D bbox, intrinsics, depth map	requires depth map for training	2D/3D center loss, local/global corner loss; stagewise training to start 3D after 2D	instance depth estimation: pixel level depth estimation does not focus on object localization by design; depth of the nearest object instance
8	OFT	1811	BMVC 2019	OFT: Orthographic Feature Transform for Monocular 3D Object Detection	https://arxiv.org/abs/1811.08188	oft.md	feature transformation	Learn a projection of camera image to BEV for 3D object detection.	OFT	ResNet18+ResNet16 top down network	L1 loss for offset from subtype average in log space	None	None	L1 on cos and sin	positional offset in BEV space from local peaks	None	2D bbox, 3D bbox (intrinsics learned)		TopDown network to reason in BEV
9	Mono3D Track	1811	ICCV 2019	Joint Monocular 3D Vehicle Detection and Tracking	https://arxiv.org/abs/1811.10742	mono_3d_tracking.md	direct 3D proposal	Add 3D tracking with LSTM based on mono3d object detection.	Deep3DBox	Faster RCNN	L1 loss for offset from subtype average	None	2D projection of 3D center	multi-bin for local yaw in two bins	L1 loss on 1 over regressed disparity	None	2D bbox, 3D bbox, intrinsics			regressing 2D projection of 3D center helps recover amodal 3D bbox
10	GPP	1811	ArXiv	GPP: Ground Plane Polling for 6DoF Pose Estimation of Objects on the Road	https://arxiv.org/abs/1811.06666	gpp.md	keypoints and shapes	Regress tireline and height and project to the best ground plane near the car	GPP	RetinaNet+2D/3D head	refined from subtype average	None	2D projection of tirelines (observer facing vertices)	coarse (8) viewpoint classification	IPM based on best fitting ground plane	None	2D bbox, 3D bbox, intrinsics, fitted road planes	Need to collect and fit road data	able to predict local road pose	NA
11	ROI-10D	1812	CVPR 2019	ROI-10D: Monocular Lifting of 2D Detection to 6D Pose and Metric Shape	https://arxiv.org/abs/1812.02781	roi10d.md	keypoints and shapes	Concat depth map and coord map to RGB features + 2DOD + car shape reconstruction (6d latent space) for mono 3DOD.	3D RCNN	Faster RCNN with FPN	offset from whole dataset average	TSDF encoding, 3D Autoencoder, 6-dim space	None	4-d quaternion	regress depth z	None	2D bbox, 3D bbox, intrinsics, pretrained depth model		8-corner loss; stagewise training to start 3D after 2D
12	Pseudo-Lidar	1812	CVPR 2019	Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving	https://arxiv.org/abs/1812.07179	pseudo_lidar.md	feature transformation	estimate depth map from RGB image (mono/stereo) and use it to lift RGB to point cloud	Pseudo-lidar	Frustum-PointNet / AVOD	3DOD on point cloud	None	None	3DOD on point cloud	DORN depth estimation	None	2D bbox, 3D bbox, intrinsics, pretrained depth model	pretrained depth model		data representation matters
13	Mono3D++	1901	AAAI 2018	Mono3D++: Monocular 3D Vehicle Detection with Two-Scale 3D Hypotheses and Task Priors	https://arxiv.org/abs/1901.03446	mono3d++.md	keypoints and shapes	Mono 3DOD based on 3D and 2D consistency, in particular landmark and shape recon.	DeepMANTA	SSD for 2D bbox, stacked hourgalss for keypoint, monodepth for depth		N basis shape (N=?)	14 landmarks	CE cls on 360 bins	MonoDepth	L1 loss	2D bbox, 3D bbox, pretrained depth model, 3D CAD model with keypoints			cars should staty on the ground, should look like a car, and should be at a resaonable distance. How to ensure 2D/3D consistency between generated 3D vehicle hypothesis.
14	GS3D	1903	CVPR 2019	GS3D: An Efficient 3D Object Detection Framework for Autonomous Driving	https://arxiv.org/abs/1903.10955	gs3d.md	2D/3D tight constraint	Get 3D bbox proposal (guidance) from 2D bbox + prior knowledge, then refine 3D bbox through surface features	GS3D	FasterRCNN with VGG16 (2D+O)	subtype average	None	None	from RoIAligned features (possibly multibin)	approximated with bbox height * 0.93	None	2D bbox, 3D bbox, intrinsics		quality aware loss, surface feature extraction
15	Pseudo-Lidar Color	1903	ICCV 2019	Accurate Monocular 3D Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving	https://arxiv.org/abs/1903.11444	pseudo_lidar_color.md	feature transformation	Concurrent proj with Pseudo-lidar but with color embedding	Pseudo-lidar	Frustum-PointNet	3DOD on point cloud	None	None	3DOD on point cloud	various pretrained depth weight	None	2D bbox, 3D bbox, intrinsics, pretrained depth model
16	BirdGAN	1904	IROS 2019	BirdGAN: Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles	https://arxiv.org/abs/1904.08494	birdgan.md	feature transformation	Learn to map 2D perspective image to BEV with GAN	BirdGAN	DCGAN	oriented 2DOD on BEV point cloud	None	None	oriented 2DOD on BEV point cloud	oriented 2DOD on BEV point cloud	None	2D bbox, 3D bbox (intrinsics learned)	In the clipping case, the frontal detectable depth is only about 10 to 15 meters
17	FQNet	1904	CVPR 2019	FQNet: Deep Fitting Degree Scoring Network for Monocular 3D Object Detection	https://arxiv.org/abs/1904.12681	fqnet.md	2D/3D tight constraint	Train a network to score the 3D IOU of a projected 3D wireframe with GT. Train a network to score the 3D IOU of a projected 3D wireframe with GT.	Deep3DBox	MS-CNN	k-means clustering and multi-bin	None	None	k-means clustering and multi-bin	approximated via optimization	similar to Deep3DBox (details in appendix)	2D bbox, 3D bbox, intrinsics
18	MonoPSR	1904	CVPR 2019	MonoPSR: Monocular 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction	https://arxiv.org/abs/1904.01690	monopsr.md	2D/3D tight constraint	3DOD by generating 3D proposal first and then reconstructing local point cloud of dynamic object	Deep3DBox, Pseudo-lidar	MS-CNN	L2 loss for offset from subtype average	None	None	multi-bin for yaw	approximated with bbox height, then regress the residual from RoIAligned feature	None	2D bbox, 3D bbox, intrinsics		shared feature maps (mono3D0
19	CenterNet	1904	ArXiv	Objects as Points	https://arxiv.org/pdf/1904.07850.pdf	centernet_ut.md	direct 3D proposal	Object detection as detection of the center point of the object and regression of its associated properties.	CenterNet	DLA (Unet)	L1 loss over absolute dimension	None	None	multi-bin for global yaw in two overlapping bins	L1 loss on 1 over regressed disparity	None	2D bbox, 3D bbox, intrinsics			highly flexible network
20	MonoDIS	1905	ICCV 2019	MonoDIS: Disentangling Monocular 3D Object Detection	https://arxiv.org/abs/1905.12365	monodis.md	direct 3D proposal	end2end training of 2D and 3D heads on top of RetinaNet for monocular 3D object detection	MonoGRNet	RetinaNet+2D/3D head	offset from whole dataset average, learned via 3D corner loss	None	2D projection of 3D center	learned via 3D corner loss	regressed from dataset average, learned via 3D corner loss	None	2D bbox, 3D bbox, intrinsics		signed IoU loss (pulls together even before intersecting), disentangle learning	disentangling transformation to split the original combinational loss (e.g., size and location of bbox at the same time) into different groups, each group only contains the loss of one group of parameters and the rest using the GT
21	monogrnet_russian	1905	ArXiv	MonoGRNet 2: Monocular 3D Object Detection via Geometric Reasoning on Keypoints	https://arxiv.org/abs/1905.05618	monogrnet_russian.md	keypoints and shapes	Regress keypoints in 2D images and use 3D CAD model to infer depth	DeepMANTA	Mask RCNN with FPN	SL1 loss for offset from subtype average in log space	5 CAD	14 landmarks	multi-bin for yaw in 72 non-overlapping bins	approximated with windshield height	None	2D bbox, 3D bbox, intrinsics		semi-auto labeling by putting template into 3D bbox
22	Pseudo-Lidar end2end	1905	ICCV 2019	Pseudo lidar-e2e: Monocular 3D Object Detection with Pseudo-LiDAR Point Cloud	https://arxiv.org/abs/1903.09847	pseudo_lidar_e2e.md	feature transformation	End-to-end pseudo-lidar training with 2D/3D bbox consistency loss	Pseudo-Lidar	Frustum-PointNet	3DOD on point cloud	None	None	3DOD on point cloud	DORN depth estimation	bbox conistency loss	2D bbox, 2D seg mask, 3D bbox, intrinsics	pretrained depth model	2D/3D bbox consistency
23	Shift RCNN	1905	IEEE ICIP 2019	Shift R-CNN: Deep Monocular 3D Object Detection with Closed-Form Geometric Constraints	https://arxiv.org/abs/1905.09970	shift_rcnn.md	2D/3D tight constraint	Extend the work of deep3Dbox by regressing residual center positions.	Deep3DBox	Faster RCNN	L2 loss for offset from subtype average	None	None	cos and sin, with unity constriant	approximated via optimization	Slightly different from Deep3DBox	2D bbox, 3D bbox, intrinsics
24	BEV IPM OD	1906	IV 2019	BEV-IPM: Deep Learning based Vehicle Position and Orientation Estimation via Inverse Perspective Mapping Image	https://ieeexplore.ieee.org/abstract/document/8814050	bev_od_ipm.md	feature transformation	IPM of the pitch/role corrected camera image, and then perform 2DOD on the IPM imag		YOLOv3	oriented 2DOD on BEV point cloud	None	None	oriented 2DOD on BEV point cloud	oriented 2DOD on BEV point cloud	None	2D bbox, BEV oriented bbox, IMU correction	up to 40 meters	Motion cancellation using IMU	IPM assumptions: 1) road is flat 2) mounting position of the camera is stationary --> motion cancellation helps this. 3) the vehicle to be detected is on the ground
25	Pseudo-Lidar++	1906	ArXiv	Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving	https://arxiv.org/abs/1906.06310	pseudo_lidar++.md	feature transformation	Improve depth estimation of pseudo-lidar with stereo depth network (SDN) and sparse depth measurements on "landmark" pixels with few-line lidars.	Pseudo-lidar	Frustum-PointNet / AVOD	3DOD on point cloud	None	None	3DOD on point cloud	PSMNet finetuned stereo depth	None	2D bbox, 3D bbox, pretrained depth model, sparse lidar data		use sparse lidar to correct depth, stereo depth loss
26	SS3D	1906	ArXiv	SS3D: Monocular 3D Object Detection and Box Fitting Trained End-to-End Using Intersection-over-Union Loss	https://arxiv.org/abs/1906.08070	ss3d.md	direct 3D proposal	CenterNet like structure to directly regress 26 attributes per object to fit a 3D bbox		U-Net like arch	log size	None	8 3D corners projected to 2D	cos and sin (multibn not suitable)	direclty regress with indirect supervison of 3D IoU loss	None	2D bbox, 3D bbox, intrinsics		models uncertainty, direclty regress 26 number, 20 fps inference
27	TLNet	1906	CVPR 2019	TLNet: Triangulation Learning Network: from Monocular to Stereo 3D Object Detection	https://arxiv.org/abs/1906.01193	tlnet.md	direct 3D proposal	Place 3D anchors inside the frustum subtended by 2D object detection as the mono baseline		Faster RCNN with two refine stages	refined from dataset average	None	None	refined from 0 and 90 degrees anchors	refined from 3D anchors	None	2D bbox, 3D bbox, intrinsics		stereo coherence score and channel reweighting
28	M3D-RPN	1907	ICCV 2019	M3D-RPN: Monocular 3D Region Proposal Network for Object Detection	https://arxiv.org/abs/1907.06038	m3d_rpn.md	direct 3D proposal	Regress 2D and 3D bbox parameters simultaneously by precomputing 3D mean stats for each 2D anchor.		Faster RCNN	log size times 3D anchor size	None	None	smooth L1 directly on angle, postprocess to refine	refined from 3D anchors	None	2D bbox, 3D bbox, intrinsics	angle postprocessing	2D anchor with 2D/3D properties, depth aware conv, neg log IoU loss for 2D detection, directly regress 12 numbers	Reliance on additional sub-networks introduces persistent noise
29	ForeSeE	1909	ArXiv	ForeSeE: Task-Aware Monocular Depth Estimation for 3D Object Detection	https://arxiv.org/abs/1909.07701	foresee_mono3dod.md	feature transformation	Train a depth estimator focused on the foreground moving object and improve 3DOD based on pseudo-lidar.	Pseudo-lidar	Frustum-PointNet / AVOD	3DOD on point cloud	None	None	3DOD on point cloud	learn foreground/background depth individually		2D bbox, 3D bbox, depth map		Depth combination: Element-wise maximum value of confidence vector in C depth bins are obtained, and pass through a softmax	Not all pixels are equal. Estimation error on a car is much different from the same error on a building.
30	CasGeo	1909	ArXiv	3D Bounding Box Estimation for Autonomous Vehicles by Cascaded Geometric Constraints and Depurated 2D Detections Using 3D Results	https://arxiv.org/abs/1909.01867	casgeom.md	2D/3D tight constraint	Extends Deep3DBox by regressing the 3d bbox center on bottom edge and viewpoint classification	Deep3DBox	MS-CNN	refined from subtype average	None	2D projection of bottom surface center	multi-bin for yaw, coarse (4) viewpoint estimation	initialized by top/bottom surface center projection and approximated via optimization (Gauss-Newton)	similar to Deep3DBox (details in appendix)	2D bbox, 3D bbox, intrinsics			regress 3d height projection to help with initial guess of distance
31	MVRA	1910	ICCV 2019	MVRA: Multi-View Reprojection Architecture for Orientation Estimation	http://openaccess.thecvf.com/content_ICCVW_2019/papers/ADW/Choi_Multi-View_Reprojection_Architecture_for_Orientation_Estimation_ICCVW_2019_paper.pdf	mvra.md	2D/3D tight constraint	Build the 2D/3D constraints optimization into neural network and use iterative method to refine cropped cases.	Deep3DBox	Faster RCNN	refined from subtype average	None	None	multi-bin for yaw, viewpoint estimation, iterative trial and error for truncated	approximated via optimization	similar to Deep3DBox (details in appendix)	2D bbox, 3D bbox, intrinsics		predict better for truncated bbox	NA
32	monoloco	1906	ArXiv	MonoLoco: Monocular 3D Pedestrian Localization and Uncertainty Estimation	https://arxiv.org/abs/1906.06059	monoloco.md	keypoints and shapes	BEV localization for pedestrians with uncertainty.	monoloco	mask RCNN/Pif-Paf + MLP	None	None	14 landmarks	None	approximated with shoulder-hip segment height	None	2D bbox, 3D bbox, intrinsics
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100