1 of 148

Recent Advances of

Binocular Stereo Vision

Hands on new models

1

Yaoyu Hu

yaoyuh@andrew.cmu.edu

2020-07-09

2 of 148

2

We will be focusing on a small part of computer vision tasks,

the passive binocular stereo reconstruction.

Since most of the RI people

may have already got certain level of experiences,

let’s get a little bit more involved.

3 of 148

Outline

3

Stereo vision 101

Recent non-learning methods

  • Basic principles.
  • Stereo calibration.

Recent learning methods

Datasets & benchmarks

  • Common components: ResNet, UNet, SPP.
  • 3D cost volume: direct construction, difference & construct.
  • Cross-correlation: 2D optical flow or 1D disparity.
  • Unsupervised training: SSIM, edge/gradient aware smoothness.
  • Real-time.

Advanced learning

Uncertainty. Occlusion. Guided. Cost. Adaptive & online learning. Multi-task.

Related CV tasks

  • SPS-Stereo
  • Guided

4 of 148

A recent review article.

Poggi, Matteo, Fabio Tosi, Konstantinos Batsos, Philippos Mordohai, and Stefano Mattoccia. "On the Synergies between Machine Learning and Stereo: a Survey." arXiv preprint arXiv:2004.08566 (2020).

4

5 of 148

Stereo vision 101

Quick review of binocular stereo vision.

Some tips on stereo calibration.

5

When people say “reconstruction”,

we usually refer to “dense reconstruction” or

“surface reconstruction”.

For dense reconstruction,

we often talk about reconstruction error,

validity, uncertainty, occlusion and efficiency.

6 of 148

6

Camera makes a movement along the x-axis.

Scene is stationary.

Two identical cameras placed along the x-axis.

Images are captured simultaneously.

Key observation: the images of objects move horizontally with magnitudes inverse proportional to the distance between the objects and the camera.

x

7 of 148

7

Image courtesy: Ioannis Gkioulekas, Course 15-862 @ CMU, Computational Photography, 2018 Fall.

Ref.

Tst.

  • Calibrated cameras.
  • Images are undistorted and rectified.

8 of 148

8

Names:

binocular stereo reconstruction, stereo vision, stereo depth prediction

Disparity sensitivity

metric unit

pixel

For lower error sensitivity:

Move the camera closer to the object, use larger baseline, use longer focal length lens.

Does higher resolution/larger image size help?

If you have a disparity map for Ref., how can you recover the 3D points?

Orientation of coordinate systems? Points at infinity?

The remaining question and the fundamental question:

how to find per-pixel correspondences?

How hard can it be?

9 of 148

9

Ref.

Tst.

How to find per-pixel correspondence? Simple principles may not work for passive setting.

  • Low texture, repetitive texture.
  • Occlusions.
  • Color inconsistency. Vignetting effects. Reflections. Defocus.
  • Low lighting and noise. Motion blur.
  • Chalibration changes.
  • Shape changes with viewing points. No exact match if not parallel to the camera.
  • Images are discrete in nature. Per-pixel.

10 of 148

Stereo calibration and other tips

  • Tools
    • ROS camera_calibration & OpenCV & Matlab
    • kalibr
  • For custom stereo setup, make sure the time is synced and the timestamps are the same. For Ximea xiC cameras, we have a working setup.
  • For accuracy, we may have to calibrate the camera individually, and then do stereo calibration.
  • For custom stereo setup, you may have to implement custom auto-exposure functionalities. A simple implementation (brightness sample & PD control).
  • Dark environments, fight against vignetting (auto or calibration) and noise.
  • You may have to calibrate color as well. For Shimizu project, we are using a color target. Only works offline.

10

11 of 148

Outline

11

Stereo vision 101

Recent non-learning methods

  • Basic principles.
  • Stereo calibration.

Recent learning methods

Datasets & benchmarks

  • Common components.
  • 3D cost volume: direct construction, difference & construct.
  • Cross-correlation: 2D optical flow or 1D disparity.
  • Unsupervised training: SSIM, edge/gradient aware smoothness.
  • Real-time.

Advanced learning

Uncertainty. Occlusion. Guided. Cost. Adaptive & online learning. Multi-task.

Related CV tasks

  • SPS-Stereo
  • Guided

12 of 148

Datasets & benchmarks.

Frequently used.

KITTI stereo

Scene Flow

Middlebury

Monocular, multi-view

NYU-Depth-v2

ETH3D

12

13 of 148

13

KITTI stereo http://www.cvlibs.net/datasets/kitti/eval_stereo.php

376 x 1241. Small capacity. Sparse label. Outdoor, self-driving. KITTI 2012, KITTI 2015

14 of 148

14

Scene Flow https://lmb.informatik.uni-freiburg.de/resources/datasets/SceneFlowDatasets.en.html

540 x 960, 30k training, 4000+ testing, good disparity range upto 192. Simulation, complex geometry.

  • Large valid disparities.
  • Extreme large disparities.
  • Another driving scene.

15 of 148

15

Middlebury http://vision.middlebury.edu/stereo/data/

Small capacity < 100 cases. Large image size. Indoor. Complex geometry. Occlusion mask.

16 of 148

16

NYU-Depth-v2: https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html

480 x 640. Large capacity. Active RGBD sensor. Indoor.

Monocular.

RGB image

Depth

Segmentation

17 of 148

17

ETH3D https://www.eth3d.net/

4141 x 6220 DSLR images. Indoor and outdoor. Monocular. Challenging scenes.

Point cloud

Image

18 of 148

18

TartanAir. https://www.aicrowd.com/challenges/tartanair-visual-slam-mono-track

700k. 480x640. Depth + optical flow + pose. Simulation, environments not like scene flow.

Challenging scenes.

19 of 148

What metrics are used?

EPE (MAE)

1-pixel, 3-pixel

19

Geiger, Andreas, Philip Lenz, and Raquel Urtasun. "Are we ready for autonomous driving? the kitti vision benchmark suite." CVPR 2012.

20 of 148

Tips and summary

  • For training deep neural networks
    • Filter bad training data.
    • Filter disparity range according to your needs.
    • Augmentation.
  • Summary

20

21 of 148

Outline

21

Stereo vision 101

Recent non-learning methods

  • Basic principles.
  • Stereo calibration.

Recent learning methods

Datasets & benchmarks

  • Common components: ResNet, UNet, SPP.
  • 3D cost volume: direct construction, difference & construct.
  • Cross-correlation: 2D optical flow or 1D disparity.
  • Unsupervised training: SSIM, edge/gradient aware smoothness.
  • Real-time.

Advanced learning

Uncertainty. Occlusion. Guided. Cost. Adaptive & online learning. Multi-task.

Related CV tasks

  • SPS-Stereo
  • Guided

22 of 148

  • Baseline: SGM, SGBM
  • Multi-task
    • SPS-Stereo: Plane segmentation + stereo reconstruction.
  • Guided
    • Spares depth measurements. ICRA 2019.
    • Dense guided. Our previous effort on high-resolution images.
  • Other

22

Recent non-learning methods

23 of 148

23

Hirschmüller, H. (2005, June). Accurate and efficient stereo processing by semi-global matching and mutual information. In null (pp. 807-814). IEEE.

Lr

From boundary to current pixel.

For every pixel in Ref., how many S(p,d)?

What Lr encourages?

How to compute C(p, d)?

24 of 148

A little about matching cost

24

Ref.

Tst.

25 of 148

A little about matching cost

25

Ref.

Tst.

max number of disp.

For a pixel in Ref. image, compute matching cost against the Tst. image at all possible x-coordinate locations

max disp.

min disp.

true disp.

What is matching cost? A measure of similarity.

26 of 148

How to measure similarity?

26

Have a guess which one OpenCV uses?

Which one deep-learning models use?

  • Direct subtraction (absolute difference of intensity, photometric difference, difference of gradient). Sum of absolute difference (SAD). Fronto-parallel.
  • All kinds of norms of patch difference.
  • Cross-correlation (zero-normal cross-correlation, cosine distance).
  • Mutual information.
  • Bilateral filter. (Weighted difference)
  • SSIM (Structural SIMlarity Index, statistics). https://www.cns.nyu.edu/~lcv/ssim/
  • Hamming distance of Census transform.
  • PatchMatch. Probabilistic method.

DL

DL

DL

DL

27 of 148

What SGBM offers? As a baseline.

27

Post-process

Speckle filter + median filter.

Case-dependent parameters.

28 of 148

  • Baseline: SGM, SGBM
  • Multi-task
    • SPS-Stereo: Plane segmentation + stereo reconstruction.
  • Guided
    • Spares depth measurements. ICRA 2019.
    • Dense guided. Our previous effort on high-resolution images.
  • Other

28

Recent non-learning methods

29 of 148

Slanted-Plane Smoothing Stereo

Key Idea:

  • Disparity values change slowly between visually similar neighbors.

29

xL

xR

b

f

z

x

y

d

image plane

3D plane

In pixel-disparity space

p = [xp, yp, d]T

n

3D plane

In camera frame

In pixel-disparity space.

30 of 148

Plane-let / segment

30

x

y

d

image plane

3D plane

Pixel-disparity space.

p = [xp, yp, d]T

ni

Later: To calculate a pixel’s disparity

  • Find which plane-let/segment it belongs to.
  • Identify the segment index i.
  • Retrieve Ai, Bi, and Ci.
  • Use p’s coordinates xp and yp to calculate d.

Quiz:

Are xp and yp world coordinates?

What is i ?

What are pi and ni?

What happens if ndi = 0?

Is the segment a plane in 3D camera frame? Why?

Need an initial disparity map to fit to. Where to get it?

Assume that the world is made by piecewise flat planes.

31 of 148

SPS-Stereo

31

SGM

L

R

Segmentation & outlier pixels

Smoothing at boundaries, identify boundary type

Smoothing across segments, modify 𝜽i

Yamaguchi, K., McAllester, D., & Urtasun, R. “Efficient joint segmentation, occlusion labeling, stereo and flow estimation.” ECCV 2014.

32 of 148

Smoothing objective of SPS-Stereo. Energy function.

32

Census transform: Robert Spangenberg,et. al. "Weighted semi-global matching and center-symmetric census transform for robust driver assistance." 2013. Convert pixel values to binary coding. Use Hamming distance to measure difference.

33 of 148

Smoothing objective of SPS-Stereo. Energy function.

33

seg.

plane params.

outlier flag

line label

ref. image

disp. SGM

color

position

depth

plane smoothness

label prior

boundary length

may be wrong

Census transform: Robert Spangenberg,et. al. "Weighted semi-global matching and center-symmetric census transform for robust driver assistance." 2013. Convert pixel values to binary coding. Use Hamming distance to measure difference.

Use gradient + Hamming distance of Census transform of image patches as matching cost.

34 of 148

SPS-Stereo

34

SGM

L

R

Segmentation & outlier pixels

Smoothing at boundaries, identify boundary type

Smoothing across segments, modify 𝜽i

Outer iteration

Inner iteration

35 of 148

TPS (Topology Preserving Segmentation), ETPS

35

s0, 𝜽0

s1, 𝜽1

s2, 𝜽2

s3, 𝜽3

s0, 𝜽0

s1, 𝜽1

s2, 𝜽2

s3, 𝜽3

  • pixel may be considered as a outlier.
  • pixels on the boundaries will be assigned a boundary type out of three types ( coplanar, hinge, occlusion )

36 of 148

Results

36

There is another recent related work from Michael Kaess’ team:

Zhang, Shuangli, Weijian Xie, Guofeng Zhang, Hujun Bao, and Michael Kaess. "Robust stereo matching with surface normal prediction." ICRA 2017.

Ref.

Disparity

Segmentation and boundaries

Gray: coplanar

Green: hinge

Red/blue: occlusion

Point cloud!

37 of 148

  • Baseline: SGM, SGBM
  • Multi-task
    • SPS-Stereo: Plane segmentation + stereo reconstruction.
  • Guided
    • Spares depth measurements. ICRA 2019.
    • Dense guided. Our previous effort on high-resolution images.
  • Other

37

Recent non-learning methods

38 of 148

Fuse sparse depth measurements

38

Shivakumar, Shreyas S., Kartik Mohta, Bernd Pfrommer, Vijay Kumar, and Camillo J. Taylor. "Real time dense depth estimation by fusing stereo with sparse depth measurements." ICRA 2019.

  • Depth estimation that fuses information from a stereo pair with sparse range measurements derived from a LIDAR sensor or a range camera.
  • Anisotropic diffusion and semi-global matching.
  • 3 types of fusions.

RGB image

SGM

Neighborhood Support

Diffusion based

Anisotropic diffusion

KITTI LiDAR true depth

15% are sampled

39 of 148

39

3 types of fusion by equations.(2/3)

if current pixel has a measurement, trust the measurement.

Naïve Fusion

loop all d on the center pixel

fixed d on neighbor pixel

pixel guided weight

“We use the grayscale image as the guide, assuming that within small windowed regions, the grayscale intensities of two points on a surface having similar depth also have similar intensities.”

  • Note: this only happens along d at xm, ym.
  • τd = 2
  • β = USHRT_MAX/10
  • ε = 0
  • Note: this only happens at neighbors within 3 pixels with d = dm.
  • β =USHRT_MAX/100
  • τn = T_CONF_WEIGHT; (0.3)
  • ε = 0. This is aggressive.

No update

Quiz: USHRT_MAX = ?

Neighborhood Promotion

40 of 148

40

3 types of fusion by equations.(3/3)

The original paper did not explain this part clearly.

  • Enlarge the effective area of sparse measurement.
  • Whin in a neighborhood of depth measurement, interpolate the measurement.
  • Bilateral filter. W is the weight related to the nearest measurement.
  • ε = 0.
  • γ =USHRT_MAX/100
  • β =USHRT_MAX/10

> 0.7

0.4 < W <= 0.7

|dk - dv| > 1 -> dk != dv

Codes tell the truth.

41 of 148

  • Baseline: SGM, SGBM
  • Multi-task
    • SPS-Stereo: Plane segmentation + stereo reconstruction.
  • Guided
    • Spares depth measurements. ICRA 2019.
    • Dense guided. Our previous effort on high-resolution images.
  • Other

41

Recent non-learning methods

42 of 148

42

Lr

From boundary to current pixel.

43 of 148

Dense guidance: pull down the cost values inside disparity regions predicted by a deep-learning model

43

Center of the predicted disparity range by DL.

Constant, 0.1

Half range width.

From DL

Constant, 0.1

Cost from SGBM

(OpenCV)

Updated cost.

44 of 148

44

45 of 148

  • Baseline: SGM, SGBM
  • Multi-task
    • SPS-Stereo: Plane segmentation + stereo reconstruction.
  • Guided
    • Spares depth measurements. ICRA 2019.
    • Dense guided. Our previous effort on high-resolution images.
  • Other

45

Recent non-learning methods

46 of 148

46

Ye, Mao, et. al. "3D reconstruction in the presence of glasses by acoustic and stereo fusion." CVPR 2015.

Keller, John, and Sebastian Scherer. "A Stereo Algorithm for Thin Obstacles and Reflective Objects." arXiv preprint arXiv:1910.04874 (2019).

Many work on fusion with the ToF (time-of-flight) cameras.

47 of 148

Summary

Non-learning methods

Matching cost + cost aggregation + search for the best + post-process + parameters

Multi-task

Guided

Fused

47

48 of 148

Hands on

non-learning methods

Let’s rock!

48

49 of 148

Sample data

Middlebury teddy. Size, disparity range. True disparity and mask.

Ways disparity maps can be represented (png with u16 type, pfm).

Scale factor for disparity maps with unsigned short type. (256)

The most important parameters for the SGM based methods are the min and max disparities.

49

50 of 148

  • Try different sample fraction.
  • Checkout the point cloud.
    • CloudCompare measure.
    • Manually compute the disparity values.

50

51 of 148

Outline

51

Stereo vision 101

Recent non-learning methods

  • Basic principles.
  • Stereo calibration.

Recent learning methods

Datasets & benchmarks

  • Common components.
  • 3D cost volume: direct construction, difference & construct.
  • Cross-correlation: 2D optical flow or 1D disparity.
  • Unsupervised training: SSIM, edge/gradient aware smoothness.
  • Real-time.

Advanced learning

Uncertainty. Occlusion. Guided. Cost. Adaptive & online learning. Multi-task.

Related CV tasks

  • SPS-Stereo
  • Guided

52 of 148

Deep-learning methods

52

A summary based on KITTI stereo benchmark.

53 of 148

  • Common components
  • 3D cost volume
  • Cross-correlation
  • Unsupervised training
  • Real-time

53

Deep-learning methods

54 of 148

Common structure (usually supervised) & common components

54

Kendall, Alex, et. al. "End-to-end learning of geometry and context for deep stereo regression." ICCV 2017.

Dosovitskiy, Alexey, "Flownet: Learning optical flow with convolutional networks." ICCV 2015.

Cost volume

Cost regulation

Feat. Ext.

Multi-scale

Spatial pooling

Classification & Regression

Refinement

55 of 148

Feature extraction. Front end.

Pre-trained, backbone: VGG1, ResNet2

VGG16, VGG19, ResNet50, ResNet101

Auto-encoder like, encoder-decoder like: UNet3

Enlarge receptive field: SPP4

Feature manipulation: warping

55

  1. Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
  2. He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778. 2016.
  3. Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." In International Conference on Medical image computing and computer-assisted intervention, pp. 234-241. Springer, Cham, 2015.
  4. He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Spatial pyramid pooling in deep convolutional networks for visual recognition." IEEE transactions on pattern analysis and machine intelligence 37, no. 9 (2015): 1904-1916.

56 of 148

56

UNet

57 of 148

57

Variants:

Weighted summation.

Zhao, Hengshuang, et. al. "Pyramid scene parsing network." CVPR 2017.

Yang, Gengshan, et. al. "Hierarchical deep stereo matching on high-resolution images." CVPR 2019.

SPP

58 of 148

58

1 Ilg, Eddy, et. al. "Flownet 2.0: Evolution of optical flow estimation with deep networks." CVPR 2017.

2 Jaderberg, Max, et. al. "Spatial transformer networks." NIPS 2015.

Warping1,2, a per-pixel sample of an image. In a differentiable way.

left

right

disparity

warped right

right -> left

59 of 148

59

A good opportunity to dive into the source code.

How to make warping differentiable?

Sun, Deqing, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. "Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume." CVPR 2018.

60 of 148

60

61 of 148

Loss and optimizer

61

Supervised:

Smooth L1 loss is often good enough. May add weighting based on intensity gradient1.

Adam optimizer is often good enough.

Unsupervised:

SSIM2, edge-aware smoothness2, consistency

(Save to unsupervised section.)

1 Pu, Can, Runzi Song, Radim Tylecek, Nanbo Li, and Robert B. Fisher. "Sdf-gan: Semi-supervised depth fusion with multi-scale adversarial networks." arXiv preprint arXiv:1803.06657 (2018).

2 Godard, Clément, Oisin Mac Aodha, and Gabriel J. Brostow. "Unsupervised monocular depth estimation with left-right consistency." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 270-279. 2017.

62 of 148

  • Common components
  • 3D cost volume
  • Cross-correlation
  • Unsupervised training
  • Real-time.

62

Deep-learning methods

63 of 148

Why cost volume?

Fully convoluted neural networks give fuzzy disparity predictions.

Cost volume is a medium through which disparity prediction becomes a classification over all possible integer number of disparities.

It is not a new idea in the CV community. Used before deep learning is popular.

63

64 of 148

A little about matching cost

64

Ref.

Tst.

max number of disp.

For a pixel in Ref. image, compute matching cost against the Tst. image at all possible x-coordinate locations

max disp.

min disp.

true disp.

Key idea:

Hey my neural-net, let me help to arrange the pixels so that you can easily compare the similarity between them. No worries, I’ve got everything in order.

65 of 148

65

Wu, Zhenyao, et. al. "Semantic stereo matching with pyramid cost volumes." ICCV 2019.

In their paper, there is also an equation shows the classification and regression of disparity values.

66 of 148

66

(C, H, W)

(C, H, W)

D=0

D=1

D=5

D=max

At new dimension D.

( D, C, H, W )

H

W

  • Only place the feature side by side.
  • Can also compute difference between features and place them in similar way.

67 of 148

67

Yang, Gengshan, et. al. "Hierarchical deep stereo matching on high-resolution images." CVPR 2019.

Chang, Jia-Ren, and Yong-Sheng Chen. "Pyramid stereo matching network." CVPR 2018.

Need a classification layer (on D) and a disparity regression layer to compute the disparity.

68 of 148

  • Common components
  • 3D cost volume
  • Cross-correlation
  • Unsupervised training
  • Adaptive & online learning.
  • Real-time.

68

Deep-learning methods

Dosovitskiy, Alexey, et. al. "Flownet: Learning optical flow with convolutional networks." ICCV 2015.

Ilg, Eddy, et. al. "Flownet 2.0: Evolution of optical flow estimation with deep networks." CVPR 2017.

69 of 148

Why bother?

Have a guess?

69

70 of 148

70

k

k

k

k

Ref.

Tst.

Notes: We have to do cross-correlation for every pixel in Ref. against the Tst. image across all possible x-coordinate locations.

The input features are (C, H, W). The result cost volume is ( 2D+1, H, W ).

71 of 148

71

Enough talking, let’s look at the code!

72 of 148

72

Enough talking, let’s look at the code!

Comments:

  • Guess what kernel size people are using?
  • Is cross-correlation really measuring similarity?
  • Zero-normal cross-correlation or cosine distance is also possible but slow.
  • Normalize the input features instead.

kernel size = 1

73 of 148

73

New variant: Try not to do correlation over all input channels.

Correlation

Group-wise correlation

  • The correlation is computed on a subset of the total channels, Nc.
  • There are Ng groups of channels.

Guo, Xiaoyang, et. al. "Group-wise correlation stereo network." CVPR 2019.

74 of 148

What we have discussed so far?

  • Common components: UNet like front end, SPP, warping, loss functions, optimizer.
  • 3D cost volume.
  • Cross-correlation cost volume.
  • What will be skipped?
    • Cost volume regulation. Multi-layer with skip connections. Hour-glass.
    • Disparity refinement. If the final disparity is not sharp enough.
    • For refinement, a UNet like encoder-decoder is usually good enough. ( Only needs local information. )
    • Dilated kernel.
    • More complex refinement can be residual learning1 or recurrent model2. Attention may be another way.

74

1 Pang, Jiahao, et. al. "Cascade residual learning: A two-stage convolutional neural network for stereo matching." ICCVW 2017

2 Batsos, Konstantinos, et. al. "Recresnet: A recurrent residual cnn architecture for disparity map enhancement." 3DV 2018.

75 of 148

  • Common components
  • 3D cost volume
  • Cross-correlation
  • Unsupervised training
  • Real-time.

75

Deep-learning methods

76 of 148

Why?

  • The literature shows most of the stereo models are trained in supervised manner.
    • Zhou, Chao, et. al. "Unsupervised learning of stereo matching." ICCV 2017.
    • Zhong, Yiran, et. al. "Self-supervised learning for stereo matching with self-improving ability." arXiv 2017.
    • Smolyanskiy, Nikolai, et. al. "On the importance of stereo for accurate depth estimation: An efficient semi-supervised deep neural network approach." CVPRW 2018.
  • Getting ground truth data is hard.
  • Lots of multi-task model have to be trained unsupervised.
    • Joint optical flow and stereo or depth. Occlusion. Adaptive model. Attention. Adversarial models. Scene understanding.
  • Multi-view stereo.
  • Monocular-depth.
  • What are we going to discuss?
    • Disparity/depth related loss function definitions.
    • Appearance/photometric loss (L1 + SSIM)
    • Edge-aware smoothness
    • Consistency
    • Census loss

76

77 of 148

77

  1. Godard, Clément, et. al. "Unsupervised monocular depth estimation with left-right consistency." CVPR 2017.
  2. Clément Godard, et. al. “Digging into Self-Supervised Monocular Depth Prediction.” ICCV 2019.
  3. Song, Xiao, et. al. "Edgestereo: An effective multi-task learning network for stereo matching and edge detection." International Journal of Computer Vision 2020.

Appearance/photometric

Edge-aware smoothness

Left-right consistency

Image patch should look similar after warping. (SSIM)

Pixel intensities should be the same between corresponding pixels.

Disparity discontinuity should only happen at object boundaries.

Later in the joint-edge prediction work3, the smoothness is defined based on detected edges.

Corresponding pixels in Ref. and Tst. images should agree with each other.

78 of 148

78

1 Stein, Fridtjof. "Efficient computation of optical flow using the census transform." In Joint Pattern Recognition Symposium, pp. 79-86. Springer, Berlin, Heidelberg, 2004.

2 Meister, Simon, Junhwa Hur, and Stefan Roth. "Unflow: Unsupervised learning of optical flow with a bidirectional census loss." AAAI 2018. https://github.com/simonmeister/UnFlow/blob/master/src/e2eflow/core/losses.py

Census trans.

Ternary trans.1

How to make this differentiable?2

79 of 148

Why?

Disparity prediction not seemed realistic.

Some areas of the Ref. image are not possible to find any match. Why?

So let’s make the neural net model the distribution of the training data and do better in guessing the missing matches.

79

80 of 148

80

81 of 148

  • Common components
  • 3D cost volume
  • Cross-correlation
  • Unsupervised training
  • Real-time.

81

Deep-learning methods

82 of 148

Why?

Accurate models are heavy.

Why we need extreme accuracy (global EPE < 1.0 pixel) anyway?

Also, for autonomous navigation, we do not need to pursue high density.

Two types. On desktop-class GPU. On mobile GPU (TX2).

82

83 of 148

Real-time learning based method.

GANet

AnyNet

Basic idea?

Two types. On desktop-class GPU. On mobile GPU (TX2).

83

84 of 148

84

AnyNet1: disparity prediction in any time.

1 Wang, Yan, et. al. "Anytime stereo image depth estimation on mobile devices." ICRA 2019.

2 Liu, Sifei, et. al. "Learning affinity via spatial propagation networks." NIPS 2017.

  • Multi-level prediction. Residual prediction.
  • Spatial propagation model2 as refinement. (Improve the result significantly according to the authors. This model is originally designed to refine segmentation boundaries.)
  • Key feature: model can be queried at any time to output its current best estimate from different stages.

85 of 148

85

Spatial propagation model: SPNet.

86 of 148

86

What performance AnyNet achieves?

Unfortunately, AnyNet is implemented on PyTorch 0.4.0 with custom layers (C++ and CUDA) which are deprecated. If you are interested, I have a Docker image for AnyNet saved on perceptron:/data/datasets/yaoyuh/Docker/Anaconda/a3py3.6pt0.4.0.tar (7.1GB). And I have my modified and tested version hosted at https://github.com/huyaoyu/AnyNet

87 of 148

87

Something is missing. How to make a deep-neural net fast?

  • Do less stuff .
    • Low resolution.
    • Less levels.
  • Use small model and less parameters.
    • Lightweight model.
    • Knowledge distillation.
  • Multi-scale.
    • Predict residual instead of full disparity.
    • Give low-res result if requested.

No magic happens.

Mittal, Sparsh. "A Survey on optimized implementation of deep learning models on the NVIDIA Jetson platform." Journal of Systems Architecture 97 (2019): 428-442.

Could be multi-task.

88 of 148

88

Dovesi, Pier Luigi, et. al. "Real-time semantic stereo matching." ICRA 2020. No code available.

  • Multi-stage.
  • Residuals.
  • AnyTime.

More accurate but slower than AnyNet.

89 of 148

89

Source code: TensorFlow

Tonioni, Alessio, et. al. "Real-time self-adaptive deep stereo." CVPR 2019.

Domain shift. Adaption. Online learning.

Key idea: ❐Unsupervised learning. ❐Multi-scale with separate layers. ❐Only back-prop one layer at a time. ❐Try to find out which layer to train upon each new input.

  • 60FPS on nvidia 1080 Ti
  • 3FPS on TX2

90 of 148

Summary

90

Cost volume

Cost regulation

Feat. Ext.

Multi-scale

Spatial pooling

Classification & Regression

Refinement

+ Unsupervised methods. Real-time considerations.

91 of 148

Outline

91

  • Common components.
  • 3D cost volume: direct construction, difference & construct.
  • Cross-correlation: 2D optical flow or 1D disparity.
  • Unsupervised training: SSIM, edge/gradient aware smoothness.
  • Real-time.

Stereo vision 101

Recent non-learning methods

  • Basic principles.
  • Stereo calibration.

Recent learning methods

Datasets & benchmarks

Advanced learning

Uncertainty. Occlusion. Guided. Cost. Adaptive & online learning. Multi-task.

Related CV tasks

  • SPS-Stereo
  • Guided

92 of 148

What additional training data are available?

92

CityScapes: Segmentation.

Cordts, M., et. al. “The cityscapes dataset for semantic urban scene understanding.” CVPR 2016.

93 of 148

  • Uncertainty estimation.
  • Occlusion.
  • Guided and fusion.
  • New cost manipulations.
  • Attention.
  • Adaptive & online learning.
  • Joint-target
    • Segmentation.
    • Surface normal.
    • Edge.

93

Advanced learning

Briefly review on the SOTA.

94 of 148

Again, matching cost.

94

Ref.

Tst.

max number of disp.

For a pixel in Ref. image, compute matching cost against the Tst. image at all possible x-coordinate locations

max disp.

min disp.

true disp.

What is matching cost? A measure of similarity.

A natural question: What about confidence?

95 of 148

Classical confidence measures.

95

#

Category

Abbreviation

Name

1

Matching cost

MSM/MC

Matching Score Measure/Minimum Cost

2

Local properties of the cost curve

CUR

Curvature

3

Local minima of the cost curve

PKR

Peak Ratio

4

PKRN

Naive Peak Ratio

5

MM

Maximum Margin

6

MMN

Naive Maximum Margin

7

The entire cost curve

PRB

Probabilistic Measure

8

MLM

Maximum Likelihood Measure

9

AML

Attainable Maximum Likelihood

10

NEM

Negative Entropy Measure

11

NOI

Number of Inflection Points

12

WMN

Winner Margin

13

WMNN

Naive Winner Margin

14

Consistency between the left and right disparity maps

LRC

Left-Right Consistency

15

LRD

Left-Right Difference

16

Distinctiveness-based confidence measure

DSM

Distinctive Similarity Measure

17

SAMM

Self-Aware Matching Measure

Hu, Xiaoyan, and Philippos Mordohai. "A quantitative evaluation of confidence measures for stereo vision." IEEE transactions on pattern analysis and machine intelligence 34, no. 11 (2012): 2121-2133.

Guess which one is the most effective tested on multiple tasks?

My comments:

Try to reason about the costs.

96 of 148

More classical measures.

96

#

Abbreviation

Name

1

PER

Perturbation measure

2

MDD

Median Deviation

3

MND

Mean Deviation

4

DD/DTD

Distance to Depth Discontinuity

5

IVAR

Variance of Intensities

6

GRAD

Magnitude of image gradients

7

DTE

Distance to Edge

8

DLB

Distance to the left border

9

DIB

Distance to the image border

10

DVAR

Disparity variance

11

SKEW

Skewness of the disparity

Park, Min-Gyu, and Kuk-Jin Yoon. "Learning and selecting confidence measures for robust stereo matching." IEEE transactions on pattern analysis and machine intelligence 41, no. 6 (2018): 1397-1411.

My comments:

Try to reason beyond the costs.

97 of 148

Confidence & uncertainty from learning-based methods.

97

Direct/Classification/Supervised

Probabilistic/Regression/Unsupervised

Try to directly tell if the disparity prediction on a pixel is confident or not. (0-1 classification)

Try to estimate how much uncertainty the model have from

  1. the internal of the model itself and
  2. the data a model is trained on.

98 of 148

Classification on ending faetures1. Classification on confidence measures2.1.

98

1 Shaked, Amit, and Lior Wolf. "Improved stereo matching with constant highway networks and reflective confidence learning." CVPR 2017.

2.1 Poggi, Matteo, and Stefano Mattoccia. "Learning from scratch a confidence measure." In BMVC. 2016. (CCNN)

2.2 Poggi, Matteo, and Stefano Mattoccia. "Learning to predict stereo reliability enforcing local consistency of confidence maps." CVPR 2017.

Classification: Is the prediction close to the true value with a predefined margin? (1 or 3 pixels.)

Various kinds of confidence measures2.2.

99 of 148

99

1 Mehltretter, Max, and Christian Heipke. "CNN-based Cost Volume Analysis as Confidence Measure for Dense Matching." ICCV 2019.

2 Kim, Sunok, et. at. "Laf-net: Locally adaptive fusion networks for stereo confidence estimation."CVPR 2019.

There is a line of related work.

100 of 148

Confidence & uncertainty from learning-based methods.

100

Direct/Classification/Supervised

Probabilistic/Regression/Unsupervised

Try to directly tell if the disparity prediction on a pixel is confident or not. (0-1 classification)

Try to estimate how much uncertainty the model has from

  • the internal of the model itself and
  • the data a model is trained on.

101 of 148

101

Epistemic

  • Uncertainty of the model.
  • Due to the ignorance about which model generated the observed data.
  • Can be reduced by collecting more data.

Kendall, Alex, and Yarin Gal. "What uncertainties do we need in bayesian deep learning for computer vision?." NIPS 2017.

Kendall, A. G. (2019). Geometry and Uncertainty in Deep Learning for Computer Vision (Doctoral dissertation, University of Cambridge). The ideas may be traced back to (Le et al., 2005; Nix and Weigend, 1994).

Aleatoric

Homoscedastic

Heteroscedastic

  • From data.
  • Observation noise. E.g., sensor noise.
  • Cannot be reduced with more data.

Constant among different observations.

Observation specific.

102 of 148

102

Epistemic

Aleatoric

Uncertainty from training data.

Uncertainty from model.

Heteroscedastic

  • BNN: Let the model “randomly” give multiple predictions, and estimate the statistics.
  • Monte Carlo drop-out. Both training and testing.
  • Assume the disparity prediction follows a probabilistic distribution, try to estimate the distribution from the training data.
  • Regression.

103 of 148

103

High supervised loss

Penalize small sp

False positive confidence

Uncertain

Low supervised loss

Penalize large sp

False negative confidence

Certain

Loss

=

Disparity loss

+

Regularization

Implementation could be as simple as adding a regression layer for sp right before disparity regresion.

104 of 148

104

Ground truth

Ground truth

𝞼

pred. w/ 𝞼

pred. w/o 𝞼

105 of 148

105

Ground truth

Ground truth

𝞼

pred. w/ 𝞼

pred. w/o 𝞼

106 of 148

106

  1. Liu, Chao, et. al. "Neural rgb (r) d sensing: Depth and uncertainty from a video camera." CVPR 2019.
  2. Poggi, Matteo & Stefano Mattoccia. "On the uncertainty of self-supervised monocular depth estimation." CVPR 2020.
  3. Yang, Nan, & Daniel Cremers. "D3VO: Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual Odometry." CVPR 2020.
  4. Tonioni, Alessio, et. al. "Unsupervised adaptation for deep stereo." ICCV 2017.

My comments:

  • This uncertainty treatment is not frequently seen in stereo work. But is often seen in other tasks1,2,3.
  • An additional use case for uncertainty estimation is to aid the adaptive or online learning4.

107 of 148

  • Uncertainty estimation.
  • Occlusion.
  • Guided and fusion.
  • New cost manipulations.
  • Attention.
  • Adaptive & online learning.
  • Joint-target
    • Segmentation.
    • Surface normal.
    • Edge.

107

Advanced learning

Briefly review on the SOTA.

108 of 148

How does the non-learning methods handle occlusions?

108

For SGBM (OpenCV) it is identified as a left-right inconsistency.

Latest advancement:

Yan, Tingman, Yangzhou Gan, Zeyang Xia, and Qunfei Zhao. "Segment-based disparity refinement with occlusion handling for stereo matching." IEEE Transactions on Image Processing 28, no. 8 (2019): 3885-3897.

109 of 148

Learning-based methods.

109

Wang, Jialiang, and Todd Zickler. "Local detection of stereo occlusion boundaries." CVPR 2019.

Zhao, Shengyu, et. al. "MaskFlownet: Asymmetric Feature Matching with Learnable Occlusion Mask." CVPR 2020.

110 of 148

Uncertainty

T. Laidlow, J. Czarnowski, A. Nicastro, R. Clark, and S. Leutenegger, “Towards the Probabilistic Fusion of Learned Priors into Standard Pipelines for 3D Reconstruction,” presented at the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, Aug. 2020.

H. Kim and B. Lee, “Probabilistic TSDF Fusion Using Bayesian Deep Learning for Dense 3D Reconstruction with a Single RGB Camera,” presented at the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, Aug. 2020. (Fuse multiple estimations.)

Duggal, Shivam, Shenlong Wang, Wei-Chiu Ma, Rui Hu, and Raquel Urtasun. "DeepPruner: Learning Efficient Stereo Matching via Differentiable PatchMatch." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4384-4393. 2019.

C. Liu, J. Gu, K. Kim, S. G. Narasimhan, and J. Kautz, “Neural RGB(r)D Sensing: Depth and Uncertainty From a Video Camera,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2019.

110

111 of 148

  • Uncertainty estimation.
  • Occlusion.
  • Guided and fusion.
  • New cost manipulations.
  • Attention.
  • Adaptive & online learning.
  • Joint-target
    • Segmentation.
    • Surface normal.
    • Edge.

111

Advanced learning

112 of 148

112

Sparse measurement fusion. LiDAR fusion.

1 Uhrig, Jonas, et. al. "Sparsity invariant cnns." 3DV 2017.

2 Park, Kihong, et. al. "High-precision depth estimation with the 3d lidar and stereo fusion." ICRA 2018.

* Can perform depth completion to get a densified depth from the sparse measurements.

How to deal with extrinsic calibration? How to deal with the sparsity of data? How to deal with noise?

Known calibration.

Unknown calibration.

Interpolation*.

Direct.

Sparse invariant CNN1.

Known calibration + interpolation + RGB guidance. NOT end-to-end, need disparity as inputs2.

Cross comparison (Probability of two sensors both failing is low)

Add assumptions.

113 of 148

113

Zhang, Junming, et. al. “LiStereo: Generate Dense Depth Maps from LIDAR and Stereo Imagery.” ICRA 2020.

End-to-end. Supervised and unsupervised.

100% LiDAR

10% LiDAR

1% LiDAR

Sparsity-invariant Convolutions is NOT better than regular convolution layers.

Another interesting work on enhancing available model without retrain the model.

Wang, Tsun-Hsuan, et. al. "Plug-and-play: Improve depth prediction via sparse data propagation." ICRA 2019.

114 of 148

114

LidarStereoNet

Cheng, Xuelian, et. al. "Noise-aware unsupervised deep lidar-stereo fusion." CVPR 2019.

  • Try to filter the LiDAR data, removing noisy LiDAR points.
  • Unsupervised + sparse invariant CNN.

Retain the sparse Lidar points (Dscl , Dscr ) that

are consistent in both stereo matching and Lidar measurements

115 of 148

  • Uncertainty estimation.
  • Occlusion.
  • Guided and fusion.
  • New cost manipulations.
  • Attention.
  • Adaptive & online learning.
  • Joint-target
    • Segmentation.
    • Surface normal.
    • Edge.

115

Advanced learning

116 of 148

116

Cost volume

Cost regulation

Feat. Ext.

Multi-scale

Spatial pooling

Classification & Regression

Refinement

Previously, we discussed the common components.

Cost manipulations focus on constructing and regulating the cost representations.

117 of 148

117

Zhang, Feihu, et. al. "Ga-net: Guided aggregation net for end-to-end stereo matching." CVPR 2019.

At each pixel, for a disparity channel d, apply kernel on channel d-1, d, and d+1.

118 of 148

  • Uncertainty estimation.
  • Occlusion.
  • Guided and fusion.
  • New cost manipulations.
  • Attention.
  • Adaptive & online learning.
  • Joint-target
    • Segmentation.
    • Surface normal.
    • Edge.

118

Advanced learning

119 of 148

Attention

Correct the wrong prediction.

Enhance meaningful (cross-modal) information, mask and cover the misleading information.

119

120 of 148

120

Jie, Zequn, et. al. "Left-right comparative recurrent model for stereo matching." CVPR 2018. No special loss definitions.

Left-Right Comparative Recurrent (LRCR)

Use the error of previous step’s LR comparison

soft attention

LR comparison

121 of 148

121

Kim, Sunok, et. al. "Laf-net: Locally adaptive fusion networks for stereo confidence estimation." CVRP 2019.

Similar for multi-view stereo: Luo, Keyang, et. al. "Attention-Aware Multi-View Stereo." CVPR 2020.

Locally Adaptive Fusion Networks

Multiplication

cost

disparity

color

No special loss definitions.

122 of 148

122

Adaptively sampling1 or cost volume size2.

1 Xu, Haofei, and Juyong Zhang. "AANet: Adaptive Aggregation Network for Efficient Stereo Matching." CVPR 2020.

2 Cheng, Shuo, et. al. "Deep stereo using adaptive thin volume representation with uncertainty awareness." CVPR 2020.

123 of 148

  • Uncertainty estimation.
  • Occlusion.
  • Guided and fusion.
  • New cost manipulations.
  • Attention.
  • Adaptive & online learning.
  • Joint-target
    • Segmentation.
    • Surface normal.
    • Edge.

123

Advanced learning

124 of 148

Why?

124

  • Data are available sequentially or the target domain distribution changes continuously.
  • No true disparity/depth for fine-tuning.
  • Fundamental question:
  • Why not just use unsupervised learning on the fly?

125 of 148

Issue: naive training leads to biased update.

125

Tonioni, Alessio, et. al. "Learning to adapt for stereo." CVPR 2019.

Meta-learning.

Loss for base model adaption

Base model

Updated base model

Intermediate models

126 of 148

126

Layers close to input have more severe domain shift issues.

1 Zhang, Zhenyu, et. al. "Online Adaptation through Meta-Learning for Stereo Depth Estimation." arXiv 2019.

2 Zhang, Feihu, et. al. "Domain-invariant Stereo Matching Networks." arXiv 2019.

3 Song, Xiao, et. al. "AdaStereo: A Simple and Efficient Approach for Adaptive Stereo Matching." arXiv 2020.

Gradually change the first components’ BatchNorm layers1 or new normalization approach2.

Change the the color style of available training data with ground truth3. And manually normalize internal feature layers.

127 of 148

127

Generate sparse supervision from other methods.

New domain

Computed sparse disparity from another method

Confidence map

Trust all

Medium confidence

Trust high confidence only

Tonioni, Alessio, et. al. "Unsupervised adaptation for deep stereo." ICCV 2017.

128 of 148

Multi-task: surface normal, segmentation, edge

Ramamonjisoa, Michaël, and Vincent Lepetit. "Sharpnet: Fast and accurate recovery of occluding contours in monocular depth estimation." In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0-0. 2019.

StereoNet: Guided Hierarchical Refinement for Real-Time Edge-Aware Depth Prediction

128

129 of 148

  • Uncertainty estimation.
  • Occlusion.
  • Guided and fusion.
  • New cost manipulations.
  • Attention.
  • Adaptive & online learning.
  • Joint-target
    • Segmentation.
    • Surface normal.
    • Edge.

129

Advanced learning

130 of 148

Why?

Depth inference only works with knowledges and assumptions of the real-world.

We are looking for cues for reliable stereo matches and reasonable spatial relationships for regulation (loss function).

Why don’t we add more knowledges as strong cues and regulations.

130

Cues:

Segmentation, surface normal, and edge.

131 of 148

131

SegStereo. Key questions:

Yang, Guorun, et al. "Segstereo: Exploiting semantic information for disparity estimation." ECCV 2018.

Form of merging and splitting task-specific structures?

Loss function?

Supervised or unsupervised?

Segmentation is supervised

Disparity can be unsupervised.

132 of 148

132

Without segmentation cues.

With segmentation cues.

133 of 148

133

Wu, Zhenyao, et. al. "Semantic stereo matching with pyramid cost volumes." ICCV 2019.

Complex procedures for merging cost volumes.

Special boundary-loss function.

Assumption:

Disparity dis-continuities always happen at segmentation boundaries.

Similar to unsupervised intensity-guided edge aware smoothness loss.

134 of 148

134

Kusupati, Uday, et. al. "Normal assisted stereo depth estimation." CVPR 2020. Lowest EPE on Scene Flow at the moment.

Joint normal estimation.

For geometrical concept such as surface normal, an additional loss function can be defined by geometric constraint: depth (disparity) of points located on a spatial surface should be consistent with the surface normal. The depth cannot change arbitrarily.

depth gradient based on disparity prediction

depth gradient based on consistency with surface normal

135 of 148

135

Joint edge prediction.

Song, Xiao, et. al. "Edgestereo: An effective multi-task learning network for stereo matching and edge detection." International Journal of Computer Vision (2020): 1-21.

  • Direct edge detection and supervision (trained separately).
  • Edge-aware smoothness loss (Use detected edge, not intensity gradients.)

Concatenate the features.

Smoothness loss use detected edges.

Edge detection trained by supervised learning on special datasets.

136 of 148

Outline

136

  • Common components.
  • 3D cost volume: direct construction, difference & construct.
  • Cross-correlation: 2D optical flow or 1D disparity.
  • Unsupervised training: SSIM, edge/gradient aware smoothness.
  • Real-time.

Stereo vision 101

Recent non-learning methods

  • Basic principles.
  • Stereo calibration.

Recent learning methods

Datasets & benchmarks

Advanced learning

Uncertainty. Occlusion. Guided. Cost. Adaptive & online learning. Multi-task.

Related CV tasks

  • SPS-Stereo
  • Guided

137 of 148

137

Related CV tasks

Mono. depth + cam. pose

Optical flow + cam. pose

Mono. depth

Optical flow

Domain translation

(GAN or adversarial)

Cross-spectrum

Multi-spectrum

e.g. themal

Depth completion

Multi-view stereo

Depth super resolution

Scene reconstruction

Dense map fusion

Active sensing

e.g. Realsense

Photometric stereo

More sensor fusion

e.g. ToF

Autonomy

Perception

Reconstruction

138 of 148

Summary

138

  • Common components
  • 3D cost volume
  • Cross-correlation
  • Unsupervised training
  • Real-time.

Stereo vision 101

Recent non-learning methods

  • Basic principles.
  • Stereo calibration.

Recent learning methods

Datasets & benchmarks

Advanced learning

Uncertainty. Occlusion. Guided. Cost. Adaptive & online learning. Multi-task.

Related CV tasks

  • SPS-Stereo
  • Guided

139 of 148

Hands on

learning-based methods

Let’s rock ’n’ roll!

139

140 of 148

140

141 of 148

Correlation model has limited disparity range.

141

142 of 148

142

BACKUP SLIDES

143 of 148

Cross-spectrum

143

Liang, Mingyang, et. al. "Unsupervised Cross-Spectral Stereo Matching by Learning to Synthesize." AAAI 2019.

144 of 148

144

CommonPython package

Data augmentation

If resize, remember to scale the disparity ture data and disparity prediction.

Install libpng++.

Install Eigen3, latest version.

Install cnpy.

git clone the ROS code and rename to src. Create new catkin workspace.

145 of 148

Hands on

145

Docker

Images yaoyuh/cuda_ros_ocv4

python2, ROS kinetic basic, cuda9.2, OpenCV4.1.1 compiled with cuda basic.

cmake 3.14.6.

A virtualevn with python 3.5.2, pytorch 1.2, torchvision 0.4.0, OpenCV 4.1.1, NumPy 1.17.2

Command

nvidia-docker run -it --rm -v /data/datasets/:/data/datasets/ yaoyuh/cuda_ros_ocv4 /bin/bash

User:

yaoyuh:frc_member

146 of 148

stereo_sparse_depth_fusion

146

Sparse mask

SGM

Fuse naive

Fuse diffusion

Fuse neigh support

args: input directory.

147 of 148

SPS-Stereo

147

Run LocalRun.sh in src directory.

Disparity

Segmentation

occlusion (red/blue)

hinge (green)

coplanar (gray)

148 of 148

The severe issue of frontal parallel constraint/assumption.

148