Oct 31
Lecturer
Yue Zeng
Background
Qualitative vs. Quantitative Shape Recovery
Background
Geometric Approaches
Background
Origami World
Background
19th century empiricists Hermann von Helmholtz's
Theory of Unconscious Inference:
“Our perception of the scene is based not only on the immediate sensory evidence, but on our long history of visual experiences and interactions with the world.”
Koenderink and colleagues find out Human has:
Background
Intrinsic Images
Inferring scene layout
knowledge-based interpretation of outdoor natural scenes
Motivation
Our methodology:
aligns with Helmholtz’s philosophy of intuition and empiricism:
corresponds Gibson’s notions of basic surface type:
Our surface layout is also philosophically similar to Marr’s sketch.
Key Challenges
1. Outdoor scenes often lack easily analyzable structured features, such as consistent vanishing lines, which complicates the estimation of 3D orientation.
generates multiple segmentations of an image and uses a probabilistic approach to label regions
2. Segmenting images into meaningful regions consistent with the 3D structure of the scene is hard because existing segmentation algorithms may not produce regions corresponding to the entire surface.
combines various cues, such as color, texture, perspective, and location, to improve the confidence in geometric labeling
Problem Setting
1) we use statistical learning
2) we are interested in a rough sense of the scene surfaces, not exact orientations
3) Our surface layout complements the original image data, not replaces it
Geometric Classes
Goal: label an image of an outdoor scene into coarse geometric classes
300 outdoor images collected using Google image search:
Geometric Classes
Main Classes: “support”, “vertical”, and “sky”
Subclasses: “planar surfaces” vs “non-planar surfaces”
planar surfaces facing to the “left”, “center”, or “right” of the viewer
non-planar surfaces that are either “porous” or “solid”.
Cues for Labeling Surfaces
Location
Color
likelihoods for each of the geometric main classes and subclasses given hue or saturation
likelihood of each geometric class given the x-y position
Cues for Labeling Surfaces
Texture (apply a subset of the filter bank designed by Leung and Malik)
15 filters: 6 edge, 6 bar, 1 Gaussian and 2 Laplacian of Gaussian, with 19x19 pixel support, a scale of for oriented and blob filters, and 6 orientations.
Cues for Labeling Surfaces
Perspective a “soft” estimate & an explicit estimate of vanishing points
Analyze features like lines, intersections, vanishing points, texture gradients, and horizon position to infer the 3D orientation and spatial relationships of planes in the scene.
Get preliminary information about the vanishing point to infer the plane direction
Determine which planes are likely to be vertical or horizontal
Cues for Labeling Surfaces
Determine the orientation of the planes in the area
Provide more clues to the surface direction
Provide more accurate directional clues
Surface layout estimation algorithm
small, nearly-uniform regions in the image
Pros:
(1)
(2)
Surface layout estimation algorithm
Combines estimates from all of the segmentations
Need larger regions to use the more complex cues!
How can we find such regions?
Pros:
task-based, efficient, and empirically generate sampling of segmentations.
(3)
(4)
Classifier
we use boosted decision trees for each classifier, using the logistic regression version of Adaboost
Experimental results
Pros: This algorithm is not highly sensitive to the number of segmentations and classification parameters.
Experimental results
Also easily extends to indoor images!
two experiments:
| main classes | subclasses |
Before retraining | 76.8% | 44.9% |
After re-training | 93.0% | 76.3% |
average classification accuracy of indoor images
Results from multiple segmentations method. This figure displays an evenly-spaced sample of the best two-thirds of all of our results, sorted by main class accuracy from highest (upper-left) to lowest (lower-right).
Results from multiple segmentations method. This figure displays an evenly-spaced sample of the worst third of all of our results, sorted by main class accuracy from highest (upper-left) to lowest (lower-right).
Ablation
Explore two alternative frameworks for recovering surface layout:
main class: 86.2% subclass: 53.5%
main class: 85.9% subclass: 61.6%
main class: 88.1% subclass: 61.5%
Applications
automatic 3D reconstruction based on surface layout
object detection
Applications
navigation application
Future improvement
Archeologist 1:
Past/Concurrent Monocular Geometry Methods
Christopher Conway
Overview of Monocular Geometry
Recovery of 3D Shape of an Object from Single View
One of the classic works in monocular geometry is by Takeo Kanade in 1981
Method consists of two parts:
Automatic Photo Pop-up
Putting Objects in Perspective
Make3D: Learning 3D Scene Structure from a Single Still Image
Closing the Loop on Scene Interpretation
Closing the Loop on Scene Interpretation
Archeologist 2:
Subsequent and Recent works
Yufeng Liu
Geometry constraints
Line Segment
Detection
How to predict box?
Vanishing points
Box Transform
Iteratively optimize surface assignment
And box transform
How to assign objects to box surfaces?
Geometry constraints
Line Segment
Detection
Structured Learning
{ line to vanishing point membership } -> box
Vanishing points
Box Transform
Iteratively optimize surface assignment
And box transform
“Recovering Surface Layout from an Image”
Lines segments detection
Principal direction
Rotate
How to estimate planes?
How to label segments?
How to predict relation?
How to get segments?
Lines segments detection
Principal direction
Rotate
Dense Graph Cut
Alpha expansion
Integer Programming
MAP of scene configuration
Derek’s work
Hierarchical merging regions
Contribution:
{ Lines, Texture, Shape, etc } -> encoder
{ Surfaces, Normals } -> latent space
Structured Learning -> ViT attention
{ Integer Programming } -> decoder
Vlas Zyrianov
Private Investigator
At the time: Professors at CMU and Prof. Hoiem’s PhD Co-Advisors
What inspired the work?
At the time, Prof. Hoiem was taking a CV class which had homework on implementing convolutional filters. During this time, he implemented a proof-of-concept convolution-based texture feature extractor. The approach successfully segmented ground vs. vertical pixels in an “image of a dirt pile.”
What inspired the work?
At the time, Prof. Hoiem was taking a CV class which had homework on implementing convolutional filters. During this time, he implemented a proof-of-concept convolution-based texture feature extractor. The approach successfully segmented ground vs. vertical pixels in an “image of a dirt pile.”
Insight: Local features can be a powerful tool for many downstream applications.
What inspired the work?
At the time, Prof. Hoiem was taking a CV class which had homework on implementing convolutional filters. During this time, he implemented a proof-of-concept convolution-based texture feature extractor. The approach successfully segmented ground vs. vertical pixels in an “image of a dirt pile.”
Insight: Local features can be a powerful tool for many downstream applications.
This insight was used to develop Automatic Photo Pop-up
Automatic Photo Pop-Up, SIGGRAPH’05
What inspired the work?
At the time, Prof. Hoiem was taking a CV class which had homework on implementing convolutional filters. During this time, he implemented a proof-of-concept convolution-based texture feature extractor. The approach successfully segmented ground vs. vertical pixels in an “image of a dirt pile.”
Insight: Local features can be a powerful tool for many downstream applications.
Automatic Photo Pop-Up, SIGGRAPH’05
Recovering Surface Layout from an Image, IJCV’07
What inspired the work?
Automatic Photo Pop-Up, SIGGRAPH’05
Recovering Surface Layout from an Image, IJCV’07
What inspired the work?
Similar theme: What local features can be extracted from images and what applications can it have?
Industrial Practitioner
Zixuan
SkyPath AI: Next-Generation Pure Vision-based Drone Navigation
Reliable Navigation in Cluttered Urban and Indoor Spaces
SkyPath AI: Goal
Food, grocery and medicine delivered
SkyPath AI: Goal
Food, grocery and medicine delivered
SkyPath AI: Goal
Food, grocery and medicine delivered
Existing Products – Large Market Size!
Limitations of Existing Products
Why aren’t we using it already?
Limitations of Existing Products
SkyPath AI addresses all these limitations via a pure vision navigation algorithm!
How does SkyPath AI works?
SkyPath AI builds visual representation of surrounding surfaces and estimates surface direction for navigation purpose.
Visual Input
Surface Estimation
Path Planning (with SLAM in 3D)
Market Projection – Huge Potential
Potential Impacts
Positive impact:
Negative impact:
Critic
Coarse Geometric Classes
The authors’ approach classifies image pixels into coarse geometric classes like ground, vertical surfaces, and sky. While this makes the task computationally feasible, it oversimplifies real-world scenarios. For instance, complex objects like stairs, ramps, or transparent surfaces (glass walls) may defy easy classification. Could the system be too rigid, missing out on important subtle surface transitions that are critical for tasks like navigation or detailed object recognition?
Applications and Real-World Use Cases
The paper claims potential applications in navigation, object recognition, and scene understanding. However, there is a gap in discussing how well this method integrates with real-world systems. How does it perform in dynamic environments like autonomous vehicles where scenes change rapidly? What are the system’s time performance constraints, and is it fast enough for real-time processing? These practical considerations are crucial but are not thoroughly discussed in the paper.
Evaluation
The paper presents results primarily on outdoor images, but the evaluation set appears somewhat constrained. There is no mention of tests under challenging scenarios (e.g., nighttime or complex urban landscapes). This narrow evaluation potentially limits the scope of the findings. Could the model break down under these more difficult conditions, and if so, how can it be improved?
Graduate Student
Jiahua Dong
Single-image geometry gives clue of 3D geometry
Dust3R (3 views)
MariGold (single-image)
Dust3R (1 view)
Single-image geometry gives clue of 3D geometry
ControlNet
ControlVideo
Naive extension on layout guided generation
The benefits of “Recovering Surface Layout from an Image”
It could serve as another representation to “condition on” for diffusion models.
Diffusion
Idea 1 Grounded camera pose correction & 3D reconstruction
Without grounding information, the point cloud is distorted
Idea 1 Grounded camera pose correction & 3D reconstruction
Our method:
Surface layout
Idea 1 Grounded camera pose correction & 3D reconstruction
Our method:
Application:
Idea 2 Single-image 3D reconstruction with 3DGS
PixelSplat
MVSplat
Idea 2 Single-image 3D reconstruction with 3DGS
Triplane Meets Gaussian Splatting
Idea 2 Single-image 3D reconstruction with 3DGS
Idea 2 Single-image 3D reconstruction with 3DGS
Hacker 1
Rachel Moan
Putting Objects in Perspective
From “Putting Objects in perspective”
h_i → px height
y_c –? → camera height
v_o → horizon position
v_i → bottom position
Goal: draw possible standing locations of people in an image
Extract depth maps and surface normals
Get depth maps and surface normals from GeoWizard
Estimate horizon line
Draw people at random ground plane locations
Inserting objects
+
Inserting objects
Segment the elephant and get its mask using yolov8
Inserting objects
Choose some pixel location for the bottom of the elephant
Set the elephants world height
Compute the elephant’s pixel height
Hacker 2
Ziyang Xie
Single View 3D Mesh Reconstruction
Goal: Single RGB Input → Colored Mesh
Leverage SOTA Depth Estimator
Mesh Sheet Method
Compared with Poisson Mesh Reconstruction
Mesh Sheet Introduce Connectivity Prior and more robust to outliers
Cut Mesh Connectivity Based on Depth Gradient
Comparison
w Gradient Cut
w/o Gradient Cut
More Results
Another Application
Single View 3D for Object Insertion
Insert & Render
Room + Carpet
User Placement