Recent Advances of
Binocular Stereo Vision
Hands on new models
1
Yaoyu Hu
yaoyuh@andrew.cmu.edu
2020-07-09
2
We will be focusing on a small part of computer vision tasks,
the passive binocular stereo reconstruction.
Since most of the RI people
may have already got certain level of experiences,
let’s get a little bit more involved.
Outline
3
Stereo vision 101
Recent non-learning methods
Recent learning methods
Datasets & benchmarks
Advanced learning
Uncertainty. Occlusion. Guided. Cost. Adaptive & online learning. Multi-task.
Related CV tasks
A recent review article.
Poggi, Matteo, Fabio Tosi, Konstantinos Batsos, Philippos Mordohai, and Stefano Mattoccia. "On the Synergies between Machine Learning and Stereo: a Survey." arXiv preprint arXiv:2004.08566 (2020).
4
Stereo vision 101
Quick review of binocular stereo vision.
Some tips on stereo calibration.
5
When people say “reconstruction”,
we usually refer to “dense reconstruction” or
“surface reconstruction”.
For dense reconstruction,
we often talk about reconstruction error,
validity, uncertainty, occlusion and efficiency.
6
Camera makes a movement along the x-axis.
Scene is stationary.
Two identical cameras placed along the x-axis.
Images are captured simultaneously.
Key observation: the images of objects move horizontally with magnitudes inverse proportional to the distance between the objects and the camera.
x
7
Image courtesy: Ioannis Gkioulekas, Course 15-862 @ CMU, Computational Photography, 2018 Fall.
Ref.
Tst.
8
Names:
binocular stereo reconstruction, stereo vision, stereo depth prediction
Disparity sensitivity
metric unit
pixel
For lower error sensitivity:
Move the camera closer to the object, use larger baseline, use longer focal length lens.
Does higher resolution/larger image size help?
If you have a disparity map for Ref., how can you recover the 3D points?
Orientation of coordinate systems? Points at infinity?
The remaining question and the fundamental question:
how to find per-pixel correspondences?
How hard can it be?
9
Ref.
Tst.
How to find per-pixel correspondence? Simple principles may not work for passive setting.
Stereo calibration and other tips
10
Outline
11
Stereo vision 101
Recent non-learning methods
Recent learning methods
Datasets & benchmarks
Advanced learning
Uncertainty. Occlusion. Guided. Cost. Adaptive & online learning. Multi-task.
Related CV tasks
Datasets & benchmarks.
Frequently used.
KITTI stereo
Scene Flow
Middlebury
Monocular, multi-view
NYU-Depth-v2
ETH3D
12
13
KITTI stereo http://www.cvlibs.net/datasets/kitti/eval_stereo.php
376 x 1241. Small capacity. Sparse label. Outdoor, self-driving. KITTI 2012, KITTI 2015
14
Scene Flow https://lmb.informatik.uni-freiburg.de/resources/datasets/SceneFlowDatasets.en.html
540 x 960, 30k training, 4000+ testing, good disparity range upto 192. Simulation, complex geometry.
15
Middlebury http://vision.middlebury.edu/stereo/data/
Small capacity < 100 cases. Large image size. Indoor. Complex geometry. Occlusion mask.
16
NYU-Depth-v2: https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html
480 x 640. Large capacity. Active RGBD sensor. Indoor.
Monocular.
RGB image
Depth
Segmentation
17
ETH3D https://www.eth3d.net/
4141 x 6220 DSLR images. Indoor and outdoor. Monocular. Challenging scenes.
Point cloud
Image
18
TartanAir. https://www.aicrowd.com/challenges/tartanair-visual-slam-mono-track
700k. 480x640. Depth + optical flow + pose. Simulation, environments not like scene flow.
Challenging scenes.
What metrics are used?
EPE (MAE)
1-pixel, 3-pixel
19
Geiger, Andreas, Philip Lenz, and Raquel Urtasun. "Are we ready for autonomous driving? the kitti vision benchmark suite." CVPR 2012.
Tips and summary
20
Outline
21
Stereo vision 101
Recent non-learning methods
Recent learning methods
Datasets & benchmarks
Advanced learning
Uncertainty. Occlusion. Guided. Cost. Adaptive & online learning. Multi-task.
Related CV tasks
22
Recent non-learning methods
23
Hirschmüller, H. (2005, June). Accurate and efficient stereo processing by semi-global matching and mutual information. In null (pp. 807-814). IEEE.
Lr
From boundary to current pixel.
For every pixel in Ref., how many S(p,d)?
What Lr encourages?
How to compute C(p, d)?
A little about matching cost
24
Ref.
Tst.
A little about matching cost
25
Ref.
Tst.
max number of disp.
For a pixel in Ref. image, compute matching cost against the Tst. image at all possible x-coordinate locations
max disp.
min disp.
true disp.
What is matching cost? A measure of similarity.
How to measure similarity?
26
Have a guess which one OpenCV uses?
Which one deep-learning models use?
DL
DL
DL
DL
What SGBM offers? As a baseline.
27
28
Recent non-learning methods
Slanted-Plane Smoothing Stereo
Key Idea:
29
xL
xR
b
f
z
x
y
d
image plane
3D plane
In pixel-disparity space
p = [xp, yp, d]T
n
3D plane
In camera frame
In pixel-disparity space.
Plane-let / segment
30
x
y
d
image plane
3D plane
Pixel-disparity space.
p = [xp, yp, d]T
ni
Later: To calculate a pixel’s disparity
Quiz:
Are xp and yp world coordinates?
What is i ?
What are pi and ni?
What happens if ndi = 0?
Is the segment a plane in 3D camera frame? Why?
Need an initial disparity map to fit to. Where to get it?
Assume that the world is made by piecewise flat planes.
SPS-Stereo
31
SGM
L
R
Segmentation & outlier pixels
Smoothing at boundaries, identify boundary type
Smoothing across segments, modify 𝜽i
Yamaguchi, K., McAllester, D., & Urtasun, R. “Efficient joint segmentation, occlusion labeling, stereo and flow estimation.” ECCV 2014.
Smoothing objective of SPS-Stereo. Energy function.
32
Census transform: Robert Spangenberg,et. al. "Weighted semi-global matching and center-symmetric census transform for robust driver assistance." 2013. Convert pixel values to binary coding. Use Hamming distance to measure difference.
Smoothing objective of SPS-Stereo. Energy function.
33
seg.
plane params.
outlier flag
line label
ref. image
disp. SGM
color
position
depth
plane smoothness
label prior
boundary length
may be wrong
Census transform: Robert Spangenberg,et. al. "Weighted semi-global matching and center-symmetric census transform for robust driver assistance." 2013. Convert pixel values to binary coding. Use Hamming distance to measure difference.
Use gradient + Hamming distance of Census transform of image patches as matching cost.
SPS-Stereo
34
SGM
L
R
Segmentation & outlier pixels
Smoothing at boundaries, identify boundary type
Smoothing across segments, modify 𝜽i
Outer iteration
Inner iteration
TPS (Topology Preserving Segmentation), ETPS
35
s0, 𝜽0
s1, 𝜽1
s2, 𝜽2
s3, 𝜽3
s0, 𝜽0
s1, 𝜽1
s2, 𝜽2
s3, 𝜽3
Results
36
There is another recent related work from Michael Kaess’ team:
Zhang, Shuangli, Weijian Xie, Guofeng Zhang, Hujun Bao, and Michael Kaess. "Robust stereo matching with surface normal prediction." ICRA 2017.
Ref.
Disparity
Segmentation and boundaries
Gray: coplanar
Green: hinge
Red/blue: occlusion
Point cloud!
37
Recent non-learning methods
Fuse sparse depth measurements
38
Shivakumar, Shreyas S., Kartik Mohta, Bernd Pfrommer, Vijay Kumar, and Camillo J. Taylor. "Real time dense depth estimation by fusing stereo with sparse depth measurements." ICRA 2019.
RGB image
SGM
Neighborhood Support
Diffusion based
Anisotropic diffusion
KITTI LiDAR true depth
15% are sampled
39
3 types of fusion by equations.(2/3)
if current pixel has a measurement, trust the measurement.
Naïve Fusion
loop all d on the center pixel
fixed d on neighbor pixel
pixel guided weight
“We use the grayscale image as the guide, assuming that within small windowed regions, the grayscale intensities of two points on a surface having similar depth also have similar intensities.”
No update
Quiz: USHRT_MAX = ?
Neighborhood Promotion
40
3 types of fusion by equations.(3/3)
The original paper did not explain this part clearly.
> 0.7
0.4 < W <= 0.7
|dk - dv| > 1 -> dk != dv
Codes tell the truth.
41
Recent non-learning methods
42
Lr
From boundary to current pixel.
Dense guidance: pull down the cost values inside disparity regions predicted by a deep-learning model
43
Center of the predicted disparity range by DL.
Constant, 0.1
Half range width.
From DL
Constant, 0.1
Cost from SGBM
(OpenCV)
Updated cost.
44
45
Recent non-learning methods
46
Ye, Mao, et. al. "3D reconstruction in the presence of glasses by acoustic and stereo fusion." CVPR 2015.
Keller, John, and Sebastian Scherer. "A Stereo Algorithm for Thin Obstacles and Reflective Objects." arXiv preprint arXiv:1910.04874 (2019).
Many work on fusion with the ToF (time-of-flight) cameras.
Summary
Non-learning methods
Matching cost + cost aggregation + search for the best + post-process + parameters
Multi-task
Guided
Fused
47
Hands on
non-learning methods
Let’s rock!
48
Sample data
Middlebury teddy. Size, disparity range. True disparity and mask.
Ways disparity maps can be represented (png with u16 type, pfm).
Scale factor for disparity maps with unsigned short type. (256)
The most important parameters for the SGM based methods are the min and max disparities.
49
50
Outline
51
Stereo vision 101
Recent non-learning methods
Recent learning methods
Datasets & benchmarks
Advanced learning
Uncertainty. Occlusion. Guided. Cost. Adaptive & online learning. Multi-task.
Related CV tasks
Deep-learning methods
52
A summary based on KITTI stereo benchmark.
53
Deep-learning methods
Common structure (usually supervised) & common components
54
Kendall, Alex, et. al. "End-to-end learning of geometry and context for deep stereo regression." ICCV 2017.
Dosovitskiy, Alexey, "Flownet: Learning optical flow with convolutional networks." ICCV 2015.
Cost volume
Cost regulation
Feat. Ext.
Multi-scale
Spatial pooling
Classification & Regression
Refinement
Feature extraction. Front end.
Pre-trained, backbone: VGG1, ResNet2
VGG16, VGG19, ResNet50, ResNet101
Auto-encoder like, encoder-decoder like: UNet3
Enlarge receptive field: SPP4
Feature manipulation: warping
55
56
UNet
57
Variants:
Weighted summation.
Zhao, Hengshuang, et. al. "Pyramid scene parsing network." CVPR 2017.
Yang, Gengshan, et. al. "Hierarchical deep stereo matching on high-resolution images." CVPR 2019.
SPP
58
1 Ilg, Eddy, et. al. "Flownet 2.0: Evolution of optical flow estimation with deep networks." CVPR 2017.
2 Jaderberg, Max, et. al. "Spatial transformer networks." NIPS 2015.
Warping1,2, a per-pixel sample of an image. In a differentiable way.
left
right
disparity
warped right
right -> left
59
A good opportunity to dive into the source code.
How to make warping differentiable?
Sun, Deqing, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. "Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume." CVPR 2018.
60
Loss and optimizer
61
Supervised:
Smooth L1 loss is often good enough. May add weighting based on intensity gradient1.
Adam optimizer is often good enough.
Unsupervised:
SSIM2, edge-aware smoothness2, consistency
(Save to unsupervised section.)
1 Pu, Can, Runzi Song, Radim Tylecek, Nanbo Li, and Robert B. Fisher. "Sdf-gan: Semi-supervised depth fusion with multi-scale adversarial networks." arXiv preprint arXiv:1803.06657 (2018).
2 Godard, Clément, Oisin Mac Aodha, and Gabriel J. Brostow. "Unsupervised monocular depth estimation with left-right consistency." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 270-279. 2017.
62
Deep-learning methods
Why cost volume?
Fully convoluted neural networks give fuzzy disparity predictions.
Cost volume is a medium through which disparity prediction becomes a classification over all possible integer number of disparities.
It is not a new idea in the CV community. Used before deep learning is popular.
63
A little about matching cost
64
Ref.
Tst.
max number of disp.
For a pixel in Ref. image, compute matching cost against the Tst. image at all possible x-coordinate locations
max disp.
min disp.
true disp.
Key idea:
Hey my neural-net, let me help to arrange the pixels so that you can easily compare the similarity between them. No worries, I’ve got everything in order.
65
Wu, Zhenyao, et. al. "Semantic stereo matching with pyramid cost volumes." ICCV 2019.
In their paper, there is also an equation shows the classification and regression of disparity values.
66
(C, H, W)
(C, H, W)
D=0
D=1
D=5
D=max
At new dimension D.
( D, C, H, W )
H
W
67
Yang, Gengshan, et. al. "Hierarchical deep stereo matching on high-resolution images." CVPR 2019.
Chang, Jia-Ren, and Yong-Sheng Chen. "Pyramid stereo matching network." CVPR 2018.
Need a classification layer (on D) and a disparity regression layer to compute the disparity.
68
Deep-learning methods
Dosovitskiy, Alexey, et. al. "Flownet: Learning optical flow with convolutional networks." ICCV 2015.
Ilg, Eddy, et. al. "Flownet 2.0: Evolution of optical flow estimation with deep networks." CVPR 2017.
Why bother?
Have a guess?
69
70
k
k
k
k
Ref.
Tst.
Notes: We have to do cross-correlation for every pixel in Ref. against the Tst. image across all possible x-coordinate locations.
The input features are (C, H, W). The result cost volume is ( 2D+1, H, W ).
71
Enough talking, let’s look at the code!
72
Enough talking, let’s look at the code!
Comments:
kernel size = 1
73
New variant: Try not to do correlation over all input channels.
Correlation
Group-wise correlation
Guo, Xiaoyang, et. al. "Group-wise correlation stereo network." CVPR 2019.
What we have discussed so far?
74
1 Pang, Jiahao, et. al. "Cascade residual learning: A two-stage convolutional neural network for stereo matching." ICCVW 2017
2 Batsos, Konstantinos, et. al. "Recresnet: A recurrent residual cnn architecture for disparity map enhancement." 3DV 2018.
75
Deep-learning methods
Why?
76
77
Appearance/photometric
Edge-aware smoothness
Left-right consistency
Image patch should look similar after warping. (SSIM)
Pixel intensities should be the same between corresponding pixels.
Disparity discontinuity should only happen at object boundaries.
Later in the joint-edge prediction work3, the smoothness is defined based on detected edges.
Corresponding pixels in Ref. and Tst. images should agree with each other.
78
1 Stein, Fridtjof. "Efficient computation of optical flow using the census transform." In Joint Pattern Recognition Symposium, pp. 79-86. Springer, Berlin, Heidelberg, 2004.
2 Meister, Simon, Junhwa Hur, and Stefan Roth. "Unflow: Unsupervised learning of optical flow with a bidirectional census loss." AAAI 2018. https://github.com/simonmeister/UnFlow/blob/master/src/e2eflow/core/losses.py
Census trans.
Ternary trans.1
How to make this differentiable?2
Why?
Disparity prediction not seemed realistic.
Some areas of the Ref. image are not possible to find any match. Why?
So let’s make the neural net model the distribution of the training data and do better in guessing the missing matches.
79
80
81
Deep-learning methods
Why?
Accurate models are heavy.
Why we need extreme accuracy (global EPE < 1.0 pixel) anyway?
Also, for autonomous navigation, we do not need to pursue high density.
Two types. On desktop-class GPU. On mobile GPU (TX2).
82
Real-time learning based method.
GANet
AnyNet
Basic idea?
Two types. On desktop-class GPU. On mobile GPU (TX2).
83
84
AnyNet1: disparity prediction in any time.
1 Wang, Yan, et. al. "Anytime stereo image depth estimation on mobile devices." ICRA 2019.
2 Liu, Sifei, et. al. "Learning affinity via spatial propagation networks." NIPS 2017.
85
Spatial propagation model: SPNet.
86
What performance AnyNet achieves?
Unfortunately, AnyNet is implemented on PyTorch 0.4.0 with custom layers (C++ and CUDA) which are deprecated. If you are interested, I have a Docker image for AnyNet saved on perceptron:/data/datasets/yaoyuh/Docker/Anaconda/a3py3.6pt0.4.0.tar (7.1GB). And I have my modified and tested version hosted at https://github.com/huyaoyu/AnyNet
87
Something is missing. How to make a deep-neural net fast?
No magic happens.
Mittal, Sparsh. "A Survey on optimized implementation of deep learning models on the NVIDIA Jetson platform." Journal of Systems Architecture 97 (2019): 428-442.
Could be multi-task.
88
Dovesi, Pier Luigi, et. al. "Real-time semantic stereo matching." ICRA 2020. No code available.
More accurate but slower than AnyNet.
89
Source code: TensorFlow
Tonioni, Alessio, et. al. "Real-time self-adaptive deep stereo." CVPR 2019.
Domain shift. Adaption. Online learning.
Key idea: ❐Unsupervised learning. ❐Multi-scale with separate layers. ❐Only back-prop one layer at a time. ❐Try to find out which layer to train upon each new input.
Summary
90
Cost volume
Cost regulation
Feat. Ext.
Multi-scale
Spatial pooling
Classification & Regression
Refinement
+ Unsupervised methods. Real-time considerations.
Outline
91
Stereo vision 101
Recent non-learning methods
Recent learning methods
Datasets & benchmarks
Advanced learning
Uncertainty. Occlusion. Guided. Cost. Adaptive & online learning. Multi-task.
Related CV tasks
What additional training data are available?
92
CityScapes: Segmentation.
Cordts, M., et. al. “The cityscapes dataset for semantic urban scene understanding.” CVPR 2016.
93
Advanced learning
Briefly review on the SOTA.
Again, matching cost.
94
Ref.
Tst.
max number of disp.
For a pixel in Ref. image, compute matching cost against the Tst. image at all possible x-coordinate locations
max disp.
min disp.
true disp.
What is matching cost? A measure of similarity.
A natural question: What about confidence?
Classical confidence measures.
95
# | Category | Abbreviation | Name |
1 | Matching cost | MSM/MC | Matching Score Measure/Minimum Cost |
2 | Local properties of the cost curve | CUR | Curvature |
3 | Local minima of the cost curve | PKR | Peak Ratio |
4 | PKRN | Naive Peak Ratio | |
5 | MM | Maximum Margin | |
6 | MMN | Naive Maximum Margin | |
7 | The entire cost curve | PRB | Probabilistic Measure |
8 | MLM | Maximum Likelihood Measure | |
9 | AML | Attainable Maximum Likelihood | |
10 | NEM | Negative Entropy Measure | |
11 | NOI | Number of Inflection Points | |
12 | WMN | Winner Margin | |
13 | WMNN | Naive Winner Margin | |
14 | Consistency between the left and right disparity maps | LRC | Left-Right Consistency |
15 | LRD | Left-Right Difference | |
16 | Distinctiveness-based confidence measure | DSM | Distinctive Similarity Measure |
17 | SAMM | Self-Aware Matching Measure |
Hu, Xiaoyan, and Philippos Mordohai. "A quantitative evaluation of confidence measures for stereo vision." IEEE transactions on pattern analysis and machine intelligence 34, no. 11 (2012): 2121-2133.
Guess which one is the most effective tested on multiple tasks?
My comments:
Try to reason about the costs.
More classical measures.
96
# | Abbreviation | Name |
1 | PER | Perturbation measure |
2 | MDD | Median Deviation |
3 | MND | Mean Deviation |
4 | DD/DTD | Distance to Depth Discontinuity |
5 | IVAR | Variance of Intensities |
6 | GRAD | Magnitude of image gradients |
7 | DTE | Distance to Edge |
8 | DLB | Distance to the left border |
9 | DIB | Distance to the image border |
10 | DVAR | Disparity variance |
11 | SKEW | Skewness of the disparity |
Park, Min-Gyu, and Kuk-Jin Yoon. "Learning and selecting confidence measures for robust stereo matching." IEEE transactions on pattern analysis and machine intelligence 41, no. 6 (2018): 1397-1411.
My comments:
Try to reason beyond the costs.
Confidence & uncertainty from learning-based methods.
97
Direct/Classification/Supervised
Probabilistic/Regression/Unsupervised
Try to directly tell if the disparity prediction on a pixel is confident or not. (0-1 classification)
Try to estimate how much uncertainty the model have from
Classification on ending faetures1. Classification on confidence measures2.1.
98
1 Shaked, Amit, and Lior Wolf. "Improved stereo matching with constant highway networks and reflective confidence learning." CVPR 2017.
2.1 Poggi, Matteo, and Stefano Mattoccia. "Learning from scratch a confidence measure." In BMVC. 2016. (CCNN)
2.2 Poggi, Matteo, and Stefano Mattoccia. "Learning to predict stereo reliability enforcing local consistency of confidence maps." CVPR 2017.
Classification: Is the prediction close to the true value with a predefined margin? (1 or 3 pixels.)
Various kinds of confidence measures2.2.
99
1 Mehltretter, Max, and Christian Heipke. "CNN-based Cost Volume Analysis as Confidence Measure for Dense Matching." ICCV 2019.
2 Kim, Sunok, et. at. "Laf-net: Locally adaptive fusion networks for stereo confidence estimation."CVPR 2019.
There is a line of related work.
Confidence & uncertainty from learning-based methods.
100
Direct/Classification/Supervised
Probabilistic/Regression/Unsupervised
Try to directly tell if the disparity prediction on a pixel is confident or not. (0-1 classification)
Try to estimate how much uncertainty the model has from
101
Epistemic
Kendall, Alex, and Yarin Gal. "What uncertainties do we need in bayesian deep learning for computer vision?." NIPS 2017.
Kendall, A. G. (2019). Geometry and Uncertainty in Deep Learning for Computer Vision (Doctoral dissertation, University of Cambridge). The ideas may be traced back to (Le et al., 2005; Nix and Weigend, 1994).
Aleatoric
Homoscedastic
Heteroscedastic
Constant among different observations.
Observation specific.
102
Epistemic
Aleatoric
Uncertainty from training data.
Uncertainty from model.
Heteroscedastic
103
High supervised loss
Penalize small sp
False positive confidence
Uncertain
Low supervised loss
Penalize large sp
False negative confidence
Certain
Loss
=
Disparity loss
+
Regularization
Implementation could be as simple as adding a regression layer for sp right before disparity regresion.
104
Ground truth
Ground truth
𝞼
pred. w/ 𝞼
pred. w/o 𝞼
105
Ground truth
Ground truth
𝞼
pred. w/ 𝞼
pred. w/o 𝞼
106
My comments:
107
Advanced learning
Briefly review on the SOTA.
How does the non-learning methods handle occlusions?
108
For SGBM (OpenCV) it is identified as a left-right inconsistency.
Latest advancement:
Yan, Tingman, Yangzhou Gan, Zeyang Xia, and Qunfei Zhao. "Segment-based disparity refinement with occlusion handling for stereo matching." IEEE Transactions on Image Processing 28, no. 8 (2019): 3885-3897.
Learning-based methods.
109
Wang, Jialiang, and Todd Zickler. "Local detection of stereo occlusion boundaries." CVPR 2019.
Zhao, Shengyu, et. al. "MaskFlownet: Asymmetric Feature Matching with Learnable Occlusion Mask." CVPR 2020.
Uncertainty
T. Laidlow, J. Czarnowski, A. Nicastro, R. Clark, and S. Leutenegger, “Towards the Probabilistic Fusion of Learned Priors into Standard Pipelines for 3D Reconstruction,” presented at the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, Aug. 2020.
H. Kim and B. Lee, “Probabilistic TSDF Fusion Using Bayesian Deep Learning for Dense 3D Reconstruction with a Single RGB Camera,” presented at the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, Aug. 2020. (Fuse multiple estimations.)
Duggal, Shivam, Shenlong Wang, Wei-Chiu Ma, Rui Hu, and Raquel Urtasun. "DeepPruner: Learning Efficient Stereo Matching via Differentiable PatchMatch." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4384-4393. 2019.
C. Liu, J. Gu, K. Kim, S. G. Narasimhan, and J. Kautz, “Neural RGB(r)D Sensing: Depth and Uncertainty From a Video Camera,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2019.
110
111
Advanced learning
112
Sparse measurement fusion. LiDAR fusion.
1 Uhrig, Jonas, et. al. "Sparsity invariant cnns." 3DV 2017.
2 Park, Kihong, et. al. "High-precision depth estimation with the 3d lidar and stereo fusion." ICRA 2018.
* Can perform depth completion to get a densified depth from the sparse measurements.
How to deal with extrinsic calibration? How to deal with the sparsity of data? How to deal with noise?
Known calibration.
Unknown calibration.
Interpolation*.
Direct.
Sparse invariant CNN1.
Known calibration + interpolation + RGB guidance. NOT end-to-end, need disparity as inputs2.
Cross comparison (Probability of two sensors both failing is low)
Add assumptions.
113
Zhang, Junming, et. al. “LiStereo: Generate Dense Depth Maps from LIDAR and Stereo Imagery.” ICRA 2020.
End-to-end. Supervised and unsupervised.
100% LiDAR
10% LiDAR
1% LiDAR
Sparsity-invariant Convolutions is NOT better than regular convolution layers.
Another interesting work on enhancing available model without retrain the model.
Wang, Tsun-Hsuan, et. al. "Plug-and-play: Improve depth prediction via sparse data propagation." ICRA 2019.
114
LidarStereoNet
Cheng, Xuelian, et. al. "Noise-aware unsupervised deep lidar-stereo fusion." CVPR 2019.
Retain the sparse Lidar points (Dscl , Dscr ) that
are consistent in both stereo matching and Lidar measurements
115
Advanced learning
116
Cost volume
Cost regulation
Feat. Ext.
Multi-scale
Spatial pooling
Classification & Regression
Refinement
Previously, we discussed the common components.
Cost manipulations focus on constructing and regulating the cost representations.
117
Zhang, Feihu, et. al. "Ga-net: Guided aggregation net for end-to-end stereo matching." CVPR 2019.
At each pixel, for a disparity channel d, apply kernel on channel d-1, d, and d+1.
118
Advanced learning
Attention
Correct the wrong prediction.
Enhance meaningful (cross-modal) information, mask and cover the misleading information.
119
120
Jie, Zequn, et. al. "Left-right comparative recurrent model for stereo matching." CVPR 2018. No special loss definitions.
Left-Right Comparative Recurrent (LRCR)
Use the error of previous step’s LR comparison
soft attention
LR comparison
121
Kim, Sunok, et. al. "Laf-net: Locally adaptive fusion networks for stereo confidence estimation." CVRP 2019.
Similar for multi-view stereo: Luo, Keyang, et. al. "Attention-Aware Multi-View Stereo." CVPR 2020.
Locally Adaptive Fusion Networks
Multiplication
cost
disparity
color
No special loss definitions.
122
Adaptively sampling1 or cost volume size2.
1 Xu, Haofei, and Juyong Zhang. "AANet: Adaptive Aggregation Network for Efficient Stereo Matching." CVPR 2020.
2 Cheng, Shuo, et. al. "Deep stereo using adaptive thin volume representation with uncertainty awareness." CVPR 2020.
123
Advanced learning
Why?
124
Issue: naive training leads to biased update.
125
Tonioni, Alessio, et. al. "Learning to adapt for stereo." CVPR 2019.
Meta-learning.
Loss for base model adaption
Base model
Updated base model
Intermediate models
126
Layers close to input have more severe domain shift issues.
1 Zhang, Zhenyu, et. al. "Online Adaptation through Meta-Learning for Stereo Depth Estimation." arXiv 2019.
2 Zhang, Feihu, et. al. "Domain-invariant Stereo Matching Networks." arXiv 2019.
3 Song, Xiao, et. al. "AdaStereo: A Simple and Efficient Approach for Adaptive Stereo Matching." arXiv 2020.
Gradually change the first components’ BatchNorm layers1 or new normalization approach2.
Change the the color style of available training data with ground truth3. And manually normalize internal feature layers.
127
Generate sparse supervision from other methods.
New domain
Computed sparse disparity from another method
Confidence map
Trust all
Medium confidence
Trust high confidence only
Tonioni, Alessio, et. al. "Unsupervised adaptation for deep stereo." ICCV 2017.
Multi-task: surface normal, segmentation, edge
Ramamonjisoa, Michaël, and Vincent Lepetit. "Sharpnet: Fast and accurate recovery of occluding contours in monocular depth estimation." In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0-0. 2019.
StereoNet: Guided Hierarchical Refinement for Real-Time Edge-Aware Depth Prediction
128
129
Advanced learning
Why?
Depth inference only works with knowledges and assumptions of the real-world.
We are looking for cues for reliable stereo matches and reasonable spatial relationships for regulation (loss function).
Why don’t we add more knowledges as strong cues and regulations.
130
Cues:
Segmentation, surface normal, and edge.
131
SegStereo. Key questions:
Yang, Guorun, et al. "Segstereo: Exploiting semantic information for disparity estimation." ECCV 2018.
Form of merging and splitting task-specific structures?
Loss function?
Supervised or unsupervised?
Segmentation is supervised
Disparity can be unsupervised.
132
Without segmentation cues.
With segmentation cues.
133
Wu, Zhenyao, et. al. "Semantic stereo matching with pyramid cost volumes." ICCV 2019.
Complex procedures for merging cost volumes.
Special boundary-loss function.
Assumption:
Disparity dis-continuities always happen at segmentation boundaries.
Similar to unsupervised intensity-guided edge aware smoothness loss.
134
Kusupati, Uday, et. al. "Normal assisted stereo depth estimation." CVPR 2020. Lowest EPE on Scene Flow at the moment.
Joint normal estimation.
For geometrical concept such as surface normal, an additional loss function can be defined by geometric constraint: depth (disparity) of points located on a spatial surface should be consistent with the surface normal. The depth cannot change arbitrarily.
depth gradient based on disparity prediction
depth gradient based on consistency with surface normal
135
Joint edge prediction.
Song, Xiao, et. al. "Edgestereo: An effective multi-task learning network for stereo matching and edge detection." International Journal of Computer Vision (2020): 1-21.
Concatenate the features.
Smoothness loss use detected edges.
Edge detection trained by supervised learning on special datasets.
Outline
136
Stereo vision 101
Recent non-learning methods
Recent learning methods
Datasets & benchmarks
Advanced learning
Uncertainty. Occlusion. Guided. Cost. Adaptive & online learning. Multi-task.
Related CV tasks
137
Related CV tasks
Mono. depth + cam. pose
Optical flow + cam. pose
Mono. depth
Optical flow
Domain translation
(GAN or adversarial)
Cross-spectrum
Multi-spectrum
e.g. themal
Depth completion
Multi-view stereo
Depth super resolution
Scene reconstruction
Dense map fusion
Active sensing
e.g. Realsense
Photometric stereo
More sensor fusion
e.g. ToF
Autonomy
Perception
Reconstruction
Summary
138
Stereo vision 101
Recent non-learning methods
Recent learning methods
Datasets & benchmarks
Advanced learning
Uncertainty. Occlusion. Guided. Cost. Adaptive & online learning. Multi-task.
Related CV tasks
Hands on
learning-based methods
Let’s rock ’n’ roll!
139
Google Colab books. Please use your andrew account to login.
https://colab.research.google.com/drive/1eT_MVSnGy12ZwSWvmi6GodKcr5932t5W?usp=sharing
https://colab.research.google.com/drive/1LLn9CmoqFaqyulo8InUadPamxXfnjcD5?usp=sharing
Download the pre-trained models.
https://drive.google.com/drive/folders/1y0iGPGRdhwhW0lnLap4zJC15RuUkmk0b?usp=sharing
140
Correlation model has limited disparity range.
141
142
BACKUP SLIDES
Cross-spectrum
143
Liang, Mingyang, et. al. "Unsupervised Cross-Spectral Stereo Matching by Learning to Synthesize." AAAI 2019.
144
CommonPython package
Data augmentation
If resize, remember to scale the disparity ture data and disparity prediction.
Install libpng++.
Install Eigen3, latest version.
Install cnpy.
git clone the ROS code and rename to src. Create new catkin workspace.
Hands on
145
Docker
Images yaoyuh/cuda_ros_ocv4
python2, ROS kinetic basic, cuda9.2, OpenCV4.1.1 compiled with cuda basic.
cmake 3.14.6.
A virtualevn with python 3.5.2, pytorch 1.2, torchvision 0.4.0, OpenCV 4.1.1, NumPy 1.17.2
Command
nvidia-docker run -it --rm -v /data/datasets/:/data/datasets/ yaoyuh/cuda_ros_ocv4 /bin/bash
User:
yaoyuh:frc_member
stereo_sparse_depth_fusion
146
Sparse mask
SGM
Fuse naive
Fuse diffusion
Fuse neigh support
args: input directory.
SPS-Stereo
147
Run LocalRun.sh in src directory.
Disparity
Segmentation
occlusion (red/blue)
hinge (green)
coplanar (gray)
The severe issue of frontal parallel constraint/assumption.
148