1 of 98

Oct 31

2 of 98

Lecturer

Yue Zeng

3 of 98

Background

Qualitative vs. Quantitative Shape Recovery

4 of 98

Background

Geometric Approaches

5 of 98

Background

Origami World

6 of 98

Background

19th century empiricists Hermann von Helmholtz's

Theory of Unconscious Inference:

“Our perception of the scene is based not only on the immediate sensory evidence, but on our long history of visual experiences and interactions with the world.”

Koenderink and colleagues find out Human has:

Limited depth perception from 2D images
Accurate perception of local surface orientation

7 of 98

Background

Intrinsic Images

Inferring scene layout

knowledge-based interpretation of outdoor natural scenes

8 of 98

Motivation

Our methodology:

aligns with Helmholtz’s philosophy of intuition and empiricism:

learning surface models from “experience” with a training set & drawing on diverse visual cues

corresponds Gibson’s notions of basic surface type:

classify an image into support, vertical, and sky regions, differ from his belief in the primacy of gradient-based methods.

Our surface layout is also philosophically similar to Marr’s sketch.

9 of 98

Key Challenges

1. Outdoor scenes often lack easily analyzable structured features, such as consistent vanishing lines, which complicates the estimation of 3D orientation.

generates multiple segmentations of an image and uses a probabilistic approach to label regions

2. Segmenting images into meaningful regions consistent with the 3D structure of the scene is hard because existing segmentation algorithms may not produce regions corresponding to the entire surface.

combines various cues, such as color, texture, perspective, and location, to improve the confidence in geometric labeling

10 of 98

Problem Setting

1) we use statistical learning

2) we are interested in a rough sense of the scene surfaces, not exact orientations

3) Our surface layout complements the original image data, not replaces it

11 of 98

Geometric Classes

Goal: label an image of an outdoor scene into coarse geometric classes

300 outdoor images collected using Google image search:

Nearly all pixels belong to horizontal surfaces, nearly vertical surfaces, or the sky.
The camera axis is roughly aligned with the ground plane

12 of 98

Geometric Classes

Main Classes: “support”, “vertical”, and “sky”

Support surfaces are roughly parallel to the ground and could potentially support a solid object (road surfaces, lawns, dirt paths, lakes, and table tops).
Vertical surfaces are solid surfaces that are too steep to support an object (walls, cliffs, the curb sides, people, trees, or cows).
The sky is simply the image region corresponding to the open air and clouds.

Subclasses: “planar surfaces” vs “non-planar surfaces”

planar surfaces facing to the “left”, “center”, or “right” of the viewer

non-planar surfaces that are either “porous” or “solid”.

13 of 98

Cues for Labeling Surfaces

Location

Color

likelihoods for each of the geometric main classes and subclasses given hue or saturation

likelihood of each geometric class given the x-y position

14 of 98

Cues for Labeling Surfaces

Texture (apply a subset of the filter bank designed by Leung and Malik)

15 filters: 6 edge, 6 bar, 1 Gaussian and 2 Laplacian of Gaussian, with 19x19 pixel support, a scale of for oriented and blob filters, and 6 orientations.

15 of 98

Cues for Labeling Surfaces

Perspective a “soft” estimate & an explicit estimate of vanishing points

Analyze features like lines, intersections, vanishing points, texture gradients, and horizon position to infer the 3D orientation and spatial relationships of planes in the scene.

Get preliminary information about the vanishing point to infer the plane direction

Compute statistics of lines and intersection points, finding long, straight edges in the image.

Determine which planes are likely to be vertical or horizontal

Compute vanishing points across the entire image using the EM approach.

16 of 98

Cues for Labeling Surfaces

Determine the orientation of the planes in the area

Compute statistics related to vanishing points for each segment, including the number of pixels contributing to vertical or horizontal vanishing points, and use these to infer surface orientation.

Provide more clues to the surface direction

Calculate texture gradient to provide orientation cues for natural surfaces, using the difference between the segment's center of mass and the center of gradient magnitude to represent texture information.

Provide more accurate directional clues

Estimate the horizon position from vanishing points near the image center (within 75% of image height). Average y-positions of multiple points; use horizon-relative segment coordinates (Features L3-L4) over absolute ones.

17 of 98

Surface layout estimation algorithm

small, nearly-uniform regions in the image

Pros:

group large homogeneous regions of the image together
divide heterogeneous regions into many smaller superpixels.

(1)

(2)

Our solution is to slowly build structural knowledge of the image: from pixels to superpixels to multiple segmentations (see Figure 6).

compute multiple segmentations based on simple cues and then use the increased spatial support provided by each segment to better evaluate its quality.

we sample a small number of segmentations that are representative of the entire distribution. We compute the segmentations using a simple method (described in Figure 8) that groups superpixels into larger continuous segments. Our method is based on pairwise same-label likelihoods, which are learned from training images.

Our multiple segmentation method has advantages of being task-based, efficient, and empirically able to generate a reasonable sampling of segmentations.

18 of 98

Surface layout estimation algorithm

Combines estimates from all of the segmentations

Need larger regions to use the more complex cues!

How can we find such regions?

Pros:

task-based, efficient, and empirically generate sampling of segmentations.

sample a small number of segmentations represent the entire distribution
compute the segmentations using a simple method that groups superpixels into larger continuous segments

(3)

(4)

Our solution is to slowly build structural knowledge of the image: from pixels to superpix- els to multiple segmentations (see Figure 6).

compute multiple segmentations based on simple cues and then use the increased spatial support provided by each segment to better evaluate its quality.

we sample a small number of segmentations that are representative of the entire distribution. We compute the segmentations using a simple method (described in Figure 8) that groups superpixels into larger continuous segments. Our method is based on pairwise same-label likelihoods, which are learned from training images.

Our multiple segmentation method has advantages of being task-based, efficient, and empirically able to generate a reasonable sampling of segmentations.

19 of 98

Classifier

we use boosted decision trees for each classifier, using the logistic regression version of Adaboost

20 of 98

Experimental results

Pros: This algorithm is not highly sensitive to the number of segmentations and classification parameters.

21 of 98

Experimental results

Also easily extends to indoor images!

two experiments:

measuring the accuracy when all classifiers are trained on outdoor images
when homogeneity and label classifiers are trained on indoor images in five-fold cross-validation.

	main classes	subclasses
Before retraining	76.8%	44.9%
After re-training	93.0%	76.3%

average classification accuracy of indoor images

22 of 98

Results from multiple segmentations method. This figure displays an evenly-spaced sample of the best two-thirds of all of our results, sorted by main class accuracy from highest (upper-left) to lowest (lower-right).

23 of 98

Results from multiple segmentations method. This figure displays an evenly-spaced sample of the worst third of all of our results, sorted by main class accuracy from highest (upper-left) to lowest (lower-right).

24 of 98

Ablation

Explore two alternative frameworks for recovering surface layout:

a simple random field framework in which unary and pairwise potentials are defined over superpixels

a simulated annealing approach that searches for the most likely segmentation and labeling.

Multiple Segmentation Method

main class: 86.2% subclass: 53.5%

main class: 85.9% subclass: 61.6%

main class: 88.1% subclass: 61.5%

25 of 98

Applications

automatic 3D reconstruction based on surface layout

object detection

26 of 98

Applications

navigation application

27 of 98

Future improvement

the surface layout estimation could benefit from additional image cues, more accurate segmentations, and models of label relationships.
a more complete notion of surface layout is required.
we need to use our information about the surfaces and space of the scene in conjunction with other types of scene information.

28 of 98

Archeologist 1:

Past/Concurrent Monocular Geometry Methods

Christopher Conway

29 of 98

Overview of Monocular Geometry

Monocular geometry problems have been well researched over many years
Classic works such as Kanade’s Recovery of 3D Shape use various geometry assumptions and image shading or texture information
More modern works (covered by Archaeologist 2) have shifted to utilize deep learning methods
Derek Hoeim contributed a number of related works during their PhD
This section will summarize the key intuition behind a few of these works

30 of 98

Recovery of 3D Shape of an Object from Single View

One of the classic works in monocular geometry is by Takeo Kanade in 1981

Method consists of two parts:

Origami theory: model the world as a collection of plane surfaces, recovering shapes qualitatively
Map image regularities into shape constraints for probable shape recovery

31 of 98

Automatic Photo Pop-up

This 2005 paper by Derek Hoeim recovers 3D models as “texture mapped planar billboards”
Geometric classes of “ground” “sky” and “vertical” are labeled from constellation groups
Regions are “cut and fold” into the pop-up model

32 of 98

Putting Objects in Perspective

In 2006 Derek Hoeim focused on interplay between objects in the scene, specifically putting objects in perspective by scale and location variance
The method applies an object detector, filters by object orientation, estimates viewpoint and uses this information to find the object (e.g. pedestrian)

33 of 98

Make3D: Learning 3D Scene Structure from a Single Still Image

A concurrent work is Make3D by Ashutosh Saxena et. al in 2008
It uses Markov Random Field (MRF) to infer plane parameters in image patches, assuming the environment is made of many small planes
MRF is trained to model depth cues and relationships in the image

34 of 98

Closing the Loop on Scene Interpretation

In 2008 Derek Hoiem published a combined framework integrating estimates of surface orientation, occlusion boundary, objects, viewpoint, and relative depth
The training method gets multiple segmentations, estimates horizon, and performs local object detection

35 of 98

Closing the Loop on Scene Interpretation

The combined framework provides better results than previous work such as the 2006 pop-up paper

36 of 98

Archeologist 2:

Subsequent and Recent works

Yufeng Liu

37 of 98

38 of 98

Geometry constraints

Line Segment

Detection

How to predict box?

Vanishing points

Box Transform

Iteratively optimize surface assignment

And box transform

How to assign objects to box surfaces?

39 of 98

Geometry constraints

Line Segment

Detection

Structured Learning

{ line to vanishing point membership } -> box

Vanishing points

Box Transform

Iteratively optimize surface assignment

And box transform

“Recovering Surface Layout from an Image”

40 of 98

41 of 98

Lines segments detection

Principal direction

Rotate

How to estimate planes?

How to label segments?

How to predict relation?

How to get segments?

42 of 98

Lines segments detection

Principal direction

Rotate

Dense Graph Cut

Alpha expansion

Integer Programming

MAP of scene configuration

Derek’s work

Hierarchical merging regions

43 of 98

Contribution:

Teacher-student model for cheap large scale MDE training on unlabeled data
Propose to inherit rich semantic priors from pretrained encoders

44 of 98

{ Lines, Texture, Shape, etc } -> encoder

{ Surfaces, Normals } -> latent space

Structured Learning -> ViT attention

{ Integer Programming } -> decoder

45 of 98

Vlas Zyrianov

Private Investigator

46 of 98

EECS Professor at Berkeley
PhD from UC Berkeley

Dean of CMU School of CS
PhD from University of Paris

CS Professor at UIUC
PhD from CMU

At the time: Professors at CMU and Prof. Hoiem’s PhD Co-Advisors

47 of 98

What inspired the work?

At the time, Prof. Hoiem was taking a CV class which had homework on implementing convolutional filters. During this time, he implemented a proof-of-concept convolution-based texture feature extractor. The approach successfully segmented ground vs. vertical pixels in an “image of a dirt pile.”

48 of 98

What inspired the work?

At the time, Prof. Hoiem was taking a CV class which had homework on implementing convolutional filters. During this time, he implemented a proof-of-concept convolution-based texture feature extractor. The approach successfully segmented ground vs. vertical pixels in an “image of a dirt pile.”

Insight: Local features can be a powerful tool for many downstream applications.

49 of 98

What inspired the work?

At the time, Prof. Hoiem was taking a CV class which had homework on implementing convolutional filters. During this time, he implemented a proof-of-concept convolution-based texture feature extractor. The approach successfully segmented ground vs. vertical pixels in an “image of a dirt pile.”

Insight: Local features can be a powerful tool for many downstream applications.

This insight was used to develop Automatic Photo Pop-up

Automatic Photo Pop-Up, SIGGRAPH’05

50 of 98

What inspired the work?

At the time, Prof. Hoiem was taking a CV class which had homework on implementing convolutional filters. During this time, he implemented a proof-of-concept convolution-based texture feature extractor. The approach successfully segmented ground vs. vertical pixels in an “image of a dirt pile.”

Insight: Local features can be a powerful tool for many downstream applications.

Automatic Photo Pop-Up, SIGGRAPH’05

Recovering Surface Layout from an Image, IJCV’07

51 of 98

What inspired the work?

Automatic Photo Pop-Up, SIGGRAPH’05

Recovering Surface Layout from an Image, IJCV’07

52 of 98

What inspired the work?

Similar theme: What local features can be extracted from images and what applications can it have?

Photo-pop up
Semantic Image Retrieval
Navigation

53 of 98

Industrial Practitioner

Zixuan

54 of 98

SkyPath AI: Next-Generation Pure Vision-based Drone Navigation

Reliable Navigation in Cluttered Urban and Indoor Spaces

55 of 98

SkyPath AI: Goal

Food, grocery and medicine delivered

to your table top

56 of 98

SkyPath AI: Goal

Food, grocery and medicine delivered

to your table top
in 15 min

57 of 98

SkyPath AI: Goal

Food, grocery and medicine delivered

to your table top
in 15 min
$0 tip

58 of 98

Existing Products – Large Market Size!

59 of 98

Limitations of Existing Products

Why aren’t we using it already?

High Cost: Lidar on the drone is expensive
Small Delivery Range: Short battery life with too many navigation equipments
Far Delivery Spot: Only deliver to receiving shelves which can be far from customer

60 of 98

Limitations of Existing Products

SkyPath AI addresses all these limitations via a pure vision navigation algorithm!

Low Cost: Only camera and GPS are needed
Large Delivery Range: Long battery life with minimum navigation equipments
Accurate Delivery Spot: Navigates in cluttered environment and deliver to your table top!

61 of 98

How does SkyPath AI works?

SkyPath AI builds visual representation of surrounding surfaces and estimates surface direction for navigation purpose.

Visual Input

Surface Estimation

Path Planning (with SLAM in 3D)

62 of 98

Market Projection – Huge Potential

63 of 98

Potential Impacts

Positive impact:

Enables affordable drone navigation for small businesses.
Short delivery time and cheap price make people more willing to order delivery, help the food industry and stimulate economy.

Negative impact:

Drone freely capturing surrounding environment will lead to privacy concerns.
Accidental dropping of the deliverable might hurt pedestrians or cause car accident.

64 of 98

Critic

65 of 98

Coarse Geometric Classes

The authors’ approach classifies image pixels into coarse geometric classes like ground, vertical surfaces, and sky. While this makes the task computationally feasible, it oversimplifies real-world scenarios. For instance, complex objects like stairs, ramps, or transparent surfaces (glass walls) may defy easy classification. Could the system be too rigid, missing out on important subtle surface transitions that are critical for tasks like navigation or detailed object recognition?

66 of 98

Applications and Real-World Use Cases

The paper claims potential applications in navigation, object recognition, and scene understanding. However, there is a gap in discussing how well this method integrates with real-world systems. How does it perform in dynamic environments like autonomous vehicles where scenes change rapidly? What are the system’s time performance constraints, and is it fast enough for real-time processing? These practical considerations are crucial but are not thoroughly discussed in the paper.

67 of 98

Evaluation

The paper presents results primarily on outdoor images, but the evaluation set appears somewhat constrained. There is no mention of tests under challenging scenarios (e.g., nighttime or complex urban landscapes). This narrow evaluation potentially limits the scope of the findings. Could the model break down under these more difficult conditions, and if so, how can it be improved?

68 of 98

Graduate Student

Jiahua Dong

69 of 98

Single-image geometry gives clue of 3D geometry

Efficiency & Generalizability

Single-image geometry still benefits from more data
Easy to build on small device with cheaper price, lower computational cost

Dust3R (3 views)

MariGold (single-image)

Dust3R (1 view)

70 of 98

Single-image geometry gives clue of 3D geometry

Consistent image & video editing

ControlNet

ControlVideo

71 of 98

Naive extension on layout guided generation

The benefits of “Recovering Surface Layout from an Image”

Efficient
Tends to be a robust layout representation (high performance)

It could serve as another representation to “condition on” for diffusion models.

Diffusion

72 of 98

Idea 1 Grounded camera pose correction & 3D reconstruction

The camera pose is often unknown for single-image
Grounding information is important for 3D reconstruction

Without grounding information, the point cloud is distorted

73 of 98

Idea 1 Grounded camera pose correction & 3D reconstruction

Our method:

By leveraging depth estimator and “Surface layout”, we can give a better camera assumption for monocular 3D point cloud reconstruction

Surface layout

74 of 98

Idea 1 Grounded camera pose correction & 3D reconstruction

Our method:

By leveraging depth estimator and “Surface layout”, we can give a better camera assumption for monocular 3D point cloud reconstruction

Application:

Aligning the domain shift of 3D perception models

Training scenes like ScanNet often have Z≈0 assumption

Better single-view point cloud reconstruction

75 of 98

Idea 2 Single-image 3D reconstruction with 3DGS

The current works try to benefit from sparse-view information for reconstructions

Sensitive to domain shift and camera pose shift
Need multi-views

PixelSplat

MVSplat

76 of 98

Idea 2 Single-image 3D reconstruction with 3DGS

The current works try to benefit from sparse-view information for reconstructions

Sensitive to domain shift and camera pose shift
Need multi-views

Monocular methods has very strong domain assumption

Triplane Meets Gaussian Splatting

77 of 98

Idea 2 Single-image 3D reconstruction with 3DGS

We leverage single-image geometry for single-view 3DGS generation

Local plane & semantic information from surface layout
Relative depth from MariGold for geometry scale

Supervision

Nearby views for novel view rendering

Output

3DGS

78 of 98

Idea 2 Single-image 3D reconstruction with 3DGS

We leverage single-image geometry for single-view 3DGS generation

Local plane & semantic information from surface layout
Relative depth from MariGold for geometry scale

Supervision

Nearby views for novel view rendering

Output

3DGS

Dataset

First train a 3DGS on each scene of ScanNet
Generate nearby novel views for training out method

79 of 98

Hacker 1

Rachel Moan

80 of 98

Putting Objects in Perspective

From “Putting Objects in perspective”

h_i → px height

y_c –? → camera height

v_o → horizon position

v_i → bottom position

Goal: draw possible standing locations of people in an image

81 of 98

Extract depth maps and surface normals

Get depth maps and surface normals from GeoWizard

82 of 98

Estimate horizon line

Find the ground points
Fit a plane using RANSAC
Find the intersection of that plane with the image

83 of 98

Draw people at random ground plane locations

84 of 98

85 of 98

Inserting objects

+

86 of 98

Inserting objects

Segment the elephant and get its mask using yolov8

87 of 98

Inserting objects

Choose some pixel location for the bottom of the elephant

Set the elephants world height

Compute the elephant’s pixel height

88 of 98

89 of 98

Hacker 2

Ziyang Xie

90 of 98

91 of 98

Single View 3D Mesh Reconstruction

Goal: Single RGB Input → Colored Mesh

92 of 98

Leverage SOTA Depth Estimator

93 of 98

Mesh Sheet Method

94 of 98

Compared with Poisson Mesh Reconstruction

Mesh Sheet Introduce Connectivity Prior and more robust to outliers

95 of 98

Cut Mesh Connectivity Based on Depth Gradient

96 of 98

Comparison

w Gradient Cut

w/o Gradient Cut

97 of 98

More Results

98 of 98

Another Application

Single View 3D for Object Insertion

Insert & Render

Room + Carpet

User Placement