1 of 99

Advances in Computer Vision

Lecture TR 1pm - 2:30pm, room 26-100

https://advances-in-vision.github.io/

6.8300/6.8301 Advances in Computer Vision

Spring 2024

Sara Beery, Kaiming He, Mina Konaković Luković, Vincent Sitzmann

2 of 99

Course Instructors

2

CI-M Instructors

TAs

Head TA: David Forman

Sarah Alnegheimish, Hyojin Bahng, Puja Balaji, Mehul Damani, David Fang, Pourya Habibzadeh, Yingcheng Liu, Joanna Materzynska, Safa Medin, Anushka Nair, McKinley Polen, Ishana Shastri, Demircan Tas, Clinton Wang, Sarah Zhang

3 of 99

Course content

See course web page for schedule/syllabus:

https://advances-in-vision.github.io/

Math: Linear algebra, geometry, multivariate calculus, optimization, probabilistic inference, machine learning , deep nets.
Coding: Python, PyTorch

Tutorials in Python and Pytorch will be offered before the assignments that first use them, see course webpage for tutorial schedule.

3

We will cover: Cameras, optics, signals, deep learning, applications, and practical research skills.

4 of 99

Undergraduate and Graduate versions of this class share the same lectures.

Undergraduate version, 6.8301, 15 units:

Satisfies MIT’s CI-M requirement (Communication Intensive, within the Major).
WRAP (Writing, Rhetoric, And Professional communication) staff will offer ~8 recitations and provide coaching on communication aspects of the class. .
Their assessments and your participation in the required CI-M recitations will be 10% of your final grade. Please fill-out this Google doc to help schedule the sections.
Non-MIT students taking the undergraduate class must still fulfill (and thus benefit from) the CI-M components of the class.
Summary: More in-class time that graduate version: about 8 required recitations—coaching related to communication elements of the class. Sometimes shorter problem sets and shorter final project.

Graduate version, 6.8300, 12 units:

Problem sets will usually have one or two problems for the graduate students only (or for extra credit for undergraduates).
Final projects will be longer and graded to a higher standard than undergraduate version final projects.
We’re sorry, but the CI-M recitations and instruction are only available to those enrolled in 6.8301. (But the MIT Writing and Communication Center, not part of this class, is available to all MIT students: http://cmsw.mit.edu/writing-and-communication-center)
Summary: Somewhat longer problem sets and final project than the undergraduate version, no recitations.

4

5 of 99

Grad vs UG Clarifications

If you need CI-M, you must be enrolled in 6.8301
If you are an UG enrolled in 6.8300

You will not gain CI-M credit for this class
You will not need to go to CI-M recitations
You will have psets with additional more challenging problems
Your final project will be graded according to the grad grading scale (content over comms)
Your final project partner must also be enrolled in 6.8300

If you are a Grad enrolled in 6.8301, you must swap, this is not allowed

5

6 of 99

CI-M Clarifications

Anyone enrolled in 6.8301 has to go to CI-M recitations
A link will be sent out for 6.8301 enrollees to voice preferences for CI-M session times
All 6.8301 final project teammates must be in the same CI-M recitation

6

7 of 99

Grading

Problem sets (60%)
Final Project (40%)

6.8301

Proposal (5%)
Report (17.5%)
Presentation (7.5%)
CI-M (10%)

6.8300

Proposal (5%)
Report (25%)
Presentation (10%)

No exams or quizzes

7

8 of 99

Problem sets

Problem sets will be posted usually weekly, usually due one week later. See course web page, https://advances-in-vision.github.io/, for the schedule.
Grades returned two weeks after due date.
Late penalty: submission deadline is 23:59 on the due date. Late submissions accepted up to one week after deadline, but grade decays linearly down to 1/2 credit over that time (then 0 credit).
Important-reason grace allowance for late submissions: 3 days, for any important reason. No need to clear it with us, but there’s no “saving” it—must be used with any extension. Any request beyond that allowance requires S^3 approval (MIT’s student support services).
Only electronic problem set submissions will be accepted, no hard copies.

8

1.0

0.5

Due date

1 week later

credit multiplier

9 of 99

Collaboration Policy

Psets should be written up individually and should reflect your own individual work. However, you may discuss with your peers, TAs, and instructors.
You should not copy or share complete solutions or code.
If you work with anyone on the pset (other than TAs and instructors), list their names at the top of the pset

10 of 99

AI Assistants Policy

Our policy for using ChatGPT and other AI assistants is identical to our policy for using human assistants.
This is an AI class and you should try out all the latest AI assistants (they are pretty much all using deep learning). It's very important to play with them to learn what they can do and what they can't do. That's a part of the content of this course.
Just like you can come to office hours and ask a human questions (about the lecture material, clarifications about pset questions, tips for getting started, etc), you are very welcome to do the same with AI assistants.
But: just like you are not allowed to ask an expert friend to do your homework for you, you also should not ask an expert AI.
If it is ever unclear, just imagine the AI as a human and apply the same norm as you would with a human.
If you work with any AI on a pset, briefly describe which AI and how you used it at the top of the pset (a few sentences is enough).

11 of 99

Final Project

We will provide a list of projects to pick from, or you can propose your own. Can work in pairs, or individually.
You’ll write a final project proposal, and (for 6.8301) a revision of that proposal.
Every person gives a short presentation of their project during the final week, and submits their written final project.

11

12 of 99

Schedule Overview

12

cameras, optics

signals

13 of 99

Schedule Overview

13

deep learning foundations

modern CV

14 of 99

Schedule Overview

14

applications

15 of 99

Schedule Overview

15

CV in practice

16 of 99

Additional Information

For office hours, see course website, https://advances-in-vision.github.io/.

Use TA office hours: for psets questions.
Use faculty office hours: for questions about lectures or projects.

Piazza: to ask questions of other students and TA’s, use Piazza.
Canvas: for submitting work and accessing recorded lectures
Textbook: we will post relevant chapters from forthcoming MIT Press computer vision textbook. Other resources are listed on course web page, many of which are free and online.

16

17 of 99

18 of 99

Lecture 1

Introduction to computer vision

6.8300/6.8301 Advances in Computer Vision

Spring 2024

Sara Beery, Kaiming He, Mina Konaković Luković, Vincent Sitzmann

19 of 99

1. Introduction to computer vision

History
Simple vision system
Taxonomy of computer vision tasks

20 of 99

To see

“What does it mean, to see? The plain man's answer (and Aristotle's, too). would be, to know what is where by looking.”

To discover from images what is present in the world, where things are, what actions are taking place, to predict and anticipate events in the world.

21 of 99

Exciting times in computer vision

Healthcare

Robotics

Driving

Gaming

Accessibility

Ecology

22 of 99

Exciting times in computer vision

“A cup of cat”

“A cup of coffee”

“A cat”

DALL-E 2 (Open AI)

https://www.reddit.com/r/dalle2/comments/y4mygn/a_cup_of_cat/

Slide credit: Shuang Li

23 of 99

When some of us started…

Sheep

Airplane

Bed

Horse

24 of 99

Why is vision hard?

25 of 99

The input

What the machine gets

26 of 99

The input

What we see

What the machine gets

The camera is a measurement device, not a vision system

27 of 99

To see: perception vs. measurement

28 of 99

To see: perception vs. measurement

29 of 99

To see: perception vs. measurement

by Roger Shepard (”Turning the Tables”)

Depth processing is automatic, and we can not shut it down…

30 of 99

To see: perception vs. measurement

by Roger Shepard (”Turning the Tables”)

Depth processing is automatic, and we can not shut it down…

31 of 99

32 of 99

33 of 99

34 of 99

35 of 99

A short story of vision research

36 of 99

The Greeks

Intromission theory

simulacra

Democritus (460 - 370 B.C)

The eye

37 of 99

The Greeks

Empedocles (500 BC)

Plato (360 BC)

“So much of fire as would not burn, but gave a gentle light”

Plato

Extramission (emission) theory

38 of 99

Extramission theory

“And of the organs they first contrived the eyes to give light, and the principle according to which they were inserted was as follows: So much of fire as would not burn, but gave a gentle light, they formed into a substance akin to the light of every-day life; and the pure fire which is within us and related thereto they made to flow through the eyes in a stream smooth and dense, compressing the whole eye, and especially the centre part, so that it kept out everything of a coarser nature, and allowed to pass only this pure element. When the light of day surrounds the stream of vision, then like falls upon like, and they coalesce, and one body is formed by natural affinity in the line of vision, wherever the light that falls from within meets with an external object.”

Plato’s theory of vision (427-347 BC)

39 of 99

Euclid (325 BC)

http://philomatica.org/wp-content/uploads/2013/01/Optics-of-Euclid.pdf

“Let it be assumed that lines draw directly from the eye pass through a space of great extent; and that the form of the space included within our vision is a cone…” Euclid (translated by Burton)

Remarkable key idea: light travels in straight lines

7 axioms of light, extracted from the translation by Burton in 1945:

• “Let it be assumed that lines draw directly from the eye pass through a space of great extent;

• and that the form of the space included within our vision is a cone, with its apex in the eye and its base at the limits of our vision;

• and that those things upon which vision falls are seen, and that those things upon which vision does not fall are not seen;

• and that those things seen within a larger angle appear larger, and that those seen within a smaller angle appear smaller, and those seen within equal angles appear to be of the same size;

• and that things seen within the higher visual range appear higher, while those within the lower range appear lower;

• and, similarly, that those seen within the visual range on the right appear on the right, while those within that on the left appear on the left;

• but that things seen within several angles appear to be more clear.”

40 of 99

41 of 99

42 of 99

The goal of the first lecture

and pset1 is �to solve vision

43 of 99

Task: given a picture…

44 of 99

… recover the 3D scene structure

3D

Depth map

45 of 99

A Simple Visual System

A simple world
A simple goal
A simple image formation model

46 of 99

A Simple World

47 of 99

A Simple World

http://www.packet.cc/files/mach-per-3D-solids.html

… first computer vision PhD

48 of 99

Build your own simple world

49 of 99

A simple goal

To recover the 3D structure of the world from the 2D image

We will make this goal more explicit later.

50 of 99

A simple image formation model

Simple world rules:

Surfaces can be horizontal or vertical.
Objects will be resting on a white horizontal ground plane

51 of 99

A simple image formation model

Perspective projection

52 of 99

A simple image formation model

World and image coordinate systems

θ

Z

X

Y

World coordinates

(right-handed reference system)

53 of 99

A simple image formation model

(right-handed reference system)

Camera plane

World reference system

54 of 99

A simple image formation model

X + x₀

cos(θ) Y – sin(θ) Z + y₀

World coordinates

image �coordinates

x

y

55 of 99

A simple goal

To recover the 3D structure of the world from the 2D image

We want to recover X(x,y), Y(x,y), Z(x,y) using as input I(x,y)

x

y

I(x,y)

56 of 99

Why is this hard?

57 of 99

Why is this hard?

58 of 99

Why is this hard?

59 of 99

A simple visual system�The input image

y

x

I(x,y)

0

255

In this representation, the image is an array of intensity values (color values) indexed by location.

60 of 99

A better representation: Figure/ground

Ground

In our simple world:�Using the fact that objects have color�and are darker than the ground.

For ground pixels, we know that Y(x, y) = 0

61 of 99

Figure/ground segmentation

classical visual illusion “two faces or a vase”

62 of 99

A better representation: Edges

Occlusion

Change of

Surface orientation

63 of 99

Finding edges in the image

Image gradient:

Approximation image derivative:

Edge strength

Edge orientation:

Edge normal:

I(x,y)

64 of 99

Finding edges in the image

E(x,y)

n(x,y)

and

I(x,y)

65 of 99

Edge classification

Figure/ground segmentation

Using the fact that objects have� color

Occlusion edges

Occlusion edges are owned by �the foreground

Contact edges

66 of 99

From edges to surface constraints

Y(x,y)

Z(x,y)

X(x,y)

?

67 of 99

From edges to surface constraints

Ground

Contact edge

… now things get a bit more complicated.

What happens inside the objects?

68 of 99

Generic view assumption

Image

69 of 99

Non-accidental properties

D. Lowe, 1985

Biederman_RBC_1987

70 of 99

Non-accidental properties�in the simple world

71 of 99

From edges to surface constraints

Vertical edges are 3D vertical lines

How can we relate the information in the pixels with 3D surfaces in the world?

Given the image, what can we say about X, Y and Z in the pixels that belong�to a vertical edge?

72 of 99

From edges to surface constraints

Horizontal edges

Given the image, what can we say about X, Y and Z in the pixels that belong�to an horizontal 3D edge?

73 of 99

From edges to surface constraints

What happens where there are no edges?

?

Assumption of planar faces:

Information has to be propagated from the edges

The “Rule of Nothing” (Ted Adelson): where you see nothing, assume nothing happens, and just propagate information from where something happened.

74 of 99

A simple inference scheme

All the constraints are linear

Y(x,y) = 0

if (x,y) belongs to a ground pixel

if (x,y) belongs to a vertical edge

if (x,y) belongs to an horizontal edge

if (x,y) is not on an edge

A similar set of constraints could be derived for Z

75 of 99

Discrete approximation

We can transform every differential constraint into a discrete linear constraint on Y(x,y)

Y(x,y)

111	115	113	111	112	111	112	111
135	138	137	139	145	146	149	147
163	168	188	196	206	202	206	207
180	184	206	219	202	200	195	193
189	193	214	216	104	79	83	77
191	201	217	220	103	59	60	68
195	205	216	222	113	68	69	83
199	203	223	228	108	68	71	77

-1	1

-1	0	1
-2	0	2
-1	0	1

A slightly better approximation

(it is symmetric, and it averages horizontal derivatives over 3 vertical locations)

76 of 99

Discrete approximation

Y(x,y)

Transform the “image” Y(x,y) into a column vector:

0	0	0	0	0	-1	0	0	0	1	0	0	0	0	0	0

x=0�y=0

77 of 99

A simple inference scheme

=

A Y = b

Constraint weights

Y

b

Y = (A^TA)^-1 A^Tb

78 of 99

Results

X

Y

Z

Input

Representation 2

Output

Linear system

Figure/ground

79 of 99

Changing view point

Input

New view points:

80 of 99

Generalization

Input

New view point:

It seems to work!

… but the representation is wrong!

81 of 99

Violations of simple world assumptions

82 of 99

Violations of simple world assumptions

Shading is due to painted stripes

83 of 99

Violations of simple world assumptions

Shading is due to illumination

84 of 99

Impossible steps

Generalization

2nd test

85 of 99

Impossible steps

86 of 99

Some keywords

Light rays
Image formation, parallel projection
3D, World and image coordinates
Representation
Figure / ground
Edges
Accidental views (generic view assumption)
Image gradients and discrete approximation
Inference
Generalization

87 of 99

Tasks: generic formulation

Image / �Sequence

Labels

Image/sequence

88 of 99

Tasks: what humans care about

89 of 99

Tasks: what humans care about

Verification: is this a building?

Recognition: which building is this?

90 of 99

Tasks: what humans care about

Image classification: list all the objects present�in the image

Building
Grass
People
Trees
Sky
Columns
…

91 of 99

Tasks: what humans care about

Scene categorization

Outdoor
Campus
Garden
Clear sky
Spring
Group picture
…

92 of 99

Tasks: what humans care about

Semantic segmentation: �Assign labels to all the pixels in the image

Building

People

Grass

Tree

Sky

Related tasks:

Semantic segmentation
Object categorization

93 of 99

Tasks: what humans care about

Detection: Locate all the people in this image

94 of 99

Tasks: what humans care about

Recognition: who is this person?

95 of 99

Tasks: what humans care about

Rough 3D layout, �depth ordering

96 of 99

Tasks: what humans care about

Making new images

97 of 99

Tasks: what humans care about

Adding missing content

Input image

Colorized output

98 of 99

Tasks: what humans care about

Predicting future events

What is going to happen?

99 of 99

1. Introduction to computer vision

History
Simple vision system
Taxonomy of computer vision tasks

111	115	113	111	112	111	112	111
135	138	137	139	145	146	149	147
163	168	188	196	206	202	206	207
180	184	206	219	202	200	195	193
189	193	214	216	104	79	83	77
191	201	217	220	103	59	60	68
195	205	216	222	113	68	69	83
199	203	223	228	108	68	71	77

111	115	113	111	112	111	112	111
135	138	137	139	145	146	149	147
163	168	188	196	206	202	206	207
180	184	206	219	202	200	195	193
189	193	214	216	104	79	83	77
191	201	217	220	103	59	60	68
195	205	216	222	113	68	69	83
199	203	223	228	108	68	71	77

111	115	113	111	112	111	112	111
135	138	137	139	145	146	149	147
163	168	188	196	206	202	206	207
180	184	206	219	202	200	195	193
189	193	214	216	104	79	83	77
191	201	217	220	103	59	60	68
195	205	216	222	113	68	69	83
199	203	223	228	108	68	71	77