Advances in Computer Vision
Lecture TR 1pm - 2:30pm, room 26-100
6.8300/6.8301 Advances in Computer Vision
Spring 2024
Sara Beery, Kaiming He, Mina Konaković Luković, Vincent Sitzmann
Course Instructors
2
CI-M Instructors
TAs
Head TA: David Forman
Sarah Alnegheimish, Hyojin Bahng, Puja Balaji, Mehul Damani, David Fang, Pourya Habibzadeh, Yingcheng Liu, Joanna Materzynska, Safa Medin, Anushka Nair, McKinley Polen, Ishana Shastri, Demircan Tas, Clinton Wang, Sarah Zhang
Course content
https://advances-in-vision.github.io/
3
We will cover: Cameras, optics, signals, deep learning, applications, and practical research skills.
Undergraduate and Graduate versions of this class share the same lectures.
4
Grad vs UG Clarifications
5
CI-M Clarifications
6
Grading
7
Problem sets
8
1.0
0.5
Due date
1 week later
credit multiplier
Collaboration Policy
AI Assistants Policy
Final Project
11
Schedule Overview
12
cameras, optics
signals
Schedule Overview
13
deep learning foundations
modern CV
Schedule Overview
14
applications
Schedule Overview
15
CV in practice
Additional Information
16
Other questions:
We’ll be in the lobby just outside 26-100 after this class for any immediate questions today.
Other mechanisms to answer general questions about the class:
17
Lecture 1
Introduction to computer vision
6.8300/6.8301 Advances in Computer Vision
Spring 2024
Sara Beery, Kaiming He, Mina Konaković Luković, Vincent Sitzmann
1. Introduction to computer vision
To see
“What does it mean, to see? The plain man's answer (and Aristotle's, too). would be, to know what is where by looking.”
To discover from images what is present in the world, where things are, what actions are taking place, to predict and anticipate events in the world.
Exciting times in computer vision
Healthcare
Robotics
Driving
Gaming
Accessibility
Ecology
Exciting times in computer vision
“A cup of cat”
“A cup of coffee”
“A cat”
DALL-E 2 (Open AI)
https://www.reddit.com/r/dalle2/comments/y4mygn/a_cup_of_cat/
Slide credit: Shuang Li
When some of us started…
Sheep
Airplane
Bed
Horse
Why is vision hard?
The input
What the machine gets
The input
What we see
What the machine gets
The camera is a measurement device, not a vision system
To see: perception vs. measurement
To see: perception vs. measurement
To see: perception vs. measurement
by Roger Shepard (”Turning the Tables”)
Depth processing is automatic, and we can not shut it down…
To see: perception vs. measurement
by Roger Shepard (”Turning the Tables”)
Depth processing is automatic, and we can not shut it down…
A short story of vision research
The Greeks
Intromission theory
simulacra
Democritus (460 - 370 B.C)
The eye
The Greeks
Empedocles (500 BC)
Plato (360 BC)
“So much of fire as would not burn, but gave a gentle light”
Plato
Extramission (emission) theory
Extramission theory
“And of the organs they first contrived the eyes to give light, and the principle according to which they were inserted was as follows: So much of fire as would not burn, but gave a gentle light, they formed into a substance akin to the light of every-day life; and the pure fire which is within us and related thereto they made to flow through the eyes in a stream smooth and dense, compressing the whole eye, and especially the centre part, so that it kept out everything of a coarser nature, and allowed to pass only this pure element. When the light of day surrounds the stream of vision, then like falls upon like, and they coalesce, and one body is formed by natural affinity in the line of vision, wherever the light that falls from within meets with an external object.”
Plato’s theory of vision (427-347 BC)
Euclid (325 BC)
http://philomatica.org/wp-content/uploads/2013/01/Optics-of-Euclid.pdf
“Let it be assumed that lines draw directly from the eye pass through a space of great extent; and that the form of the space included within our vision is a cone…” Euclid (translated by Burton)
Remarkable key idea: light travels in straight lines
The goal of the first lecture
and pset1 is �to solve vision
Task: given a picture…
… recover the 3D scene structure
3D
Depth map
A Simple Visual System
A Simple World
A Simple World
http://www.packet.cc/files/mach-per-3D-solids.html
… first computer vision PhD
Build your own simple world
A simple goal
To recover the 3D structure of the world from the 2D image
We will make this goal more explicit later.
A simple image formation model
Simple world rules:
A simple image formation model
Perspective projection
A simple image formation model
World and image coordinate systems
θ
Z
X
Y
World coordinates
(right-handed reference system)
A simple image formation model
(right-handed reference system)
Camera plane
World reference system
A simple image formation model
X + x0
cos(θ) Y – sin(θ) Z + y0
World coordinates
image �coordinates
x
y
A simple goal
To recover the 3D structure of the world from the 2D image
We want to recover X(x,y), Y(x,y), Z(x,y) using as input I(x,y)
x
y
I(x,y)
Why is this hard?
Why is this hard?
Why is this hard?
A simple visual system�The input image
y
x
I(x,y)
0
255
In this representation, the image is an array of intensity values (color values) indexed by location.
A better representation: Figure/ground
Ground
In our simple world:�Using the fact that objects have color�and are darker than the ground.
For ground pixels, we know that Y(x, y) = 0
Figure/ground segmentation
classical visual illusion “two faces or a vase”
A better representation: Edges
Occlusion
Change of
Surface orientation
Finding edges in the image
Image gradient:
Approximation image derivative:
Edge strength
Edge orientation:
Edge normal:
I(x,y)
Finding edges in the image
E(x,y)
n(x,y)
and
I(x,y)
Edge classification
From edges to surface constraints
Y(x,y)
Z(x,y)
X(x,y)
?
From edges to surface constraints
… now things get a bit more complicated.
Generic view assumption
Image
Non-accidental properties
D. Lowe, 1985
Biederman_RBC_1987
Non-accidental properties�in the simple world
From edges to surface constraints
How can we relate the information in the pixels with 3D surfaces in the world?
Given the image, what can we say about X, Y and Z in the pixels that belong�to a vertical edge?
From edges to surface constraints
Given the image, what can we say about X, Y and Z in the pixels that belong�to an horizontal 3D edge?
From edges to surface constraints
?
Assumption of planar faces:
Information has to be propagated from the edges
The “Rule of Nothing” (Ted Adelson): where you see nothing, assume nothing happens, and just propagate information from where something happened.
A simple inference scheme
All the constraints are linear
Y(x,y) = 0
if (x,y) belongs to a ground pixel
if (x,y) belongs to a vertical edge
if (x,y) belongs to an horizontal edge
if (x,y) is not on an edge
A similar set of constraints could be derived for Z
Discrete approximation
We can transform every differential constraint into a discrete linear constraint on Y(x,y)
Y(x,y)
111 | 115 | 113 | 111 | 112 | 111 | 112 | 111 |
135 | 138 | 137 | 139 | 145 | 146 | 149 | 147 |
163 | 168 | 188 | 196 | 206 | 202 | 206 | 207 |
180 | 184 | 206 | 219 | 202 | 200 | 195 | 193 |
189 | 193 | 214 | 216 | 104 | 79 | 83 | 77 |
191 | 201 | 217 | 220 | 103 | 59 | 60 | 68 |
195 | 205 | 216 | 222 | 113 | 68 | 69 | 83 |
199 | 203 | 223 | 228 | 108 | 68 | 71 | 77 |
-1 | 1 |
-1 | 0 | 1 |
-2 | 0 | 2 |
-1 | 0 | 1 |
A slightly better approximation
(it is symmetric, and it averages horizontal derivatives over 3 vertical locations)
Discrete approximation
Y(x,y)
Transform the “image” Y(x,y) into a column vector:
| | | |
| | | |
| | | |
| | | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
x=0�y=0
A simple inference scheme
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
=
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A Y = b
Constraint weights
Y
b
Y = (ATA)-1 ATb
Results
X
Y
Z
Input
Representation 2
Output
Linear system
Figure/ground
Changing view point
Input
New view points:
Generalization
Input
New view point:
It seems to work!
… but the representation is wrong!
Violations of simple world assumptions
Violations of simple world assumptions
Shading is due to painted stripes
Violations of simple world assumptions
Shading is due to illumination
Impossible steps
Generalization
2nd test
Impossible steps
Some keywords
Tasks: generic formulation
Image / �Sequence
Labels
Image/sequence
Tasks: what humans care about
Tasks: what humans care about
Verification: is this a building?
Recognition: which building is this?
Tasks: what humans care about
Image classification: list all the objects present�in the image
Tasks: what humans care about
Scene categorization
Tasks: what humans care about
Semantic segmentation: �Assign labels to all the pixels in the image
Building
People
Grass
Tree
Sky
Related tasks:
Tasks: what humans care about
Detection: Locate all the people in this image
Tasks: what humans care about
Recognition: who is this person?
Tasks: what humans care about
Rough 3D layout, �depth ordering
Tasks: what humans care about
Making new images
Tasks: what humans care about
Adding missing content
Input image
Colorized output
Tasks: what humans care about
Predicting future events
What is going to happen?
1. Introduction to computer vision