1 of 132

Lecture 6

Introduction to Machine Learning

6.8300/1 Advances in Computer Vision

Spring 2024

Sara Beery, Kaiming He, Vincent Sitzmann, Mina Konaković Luković

2 of 132

Announcements

Pset 2 due on Thursday
PyTorch tutorials Tues 9:30-11am in 54-100 and Thurs 2:30-4pm 2-190

3 of 132

6. Introduction to Machine Learning

“It’s all about the data!”
Formalisms of learning (Data, Compute, Objective Function, Hypothesis Space, Optimizer)
Case study #1: Linear least-squares

Empirical risk minimization

Case study #2: Image classification

Softmax regression

The problem of generalization

4 of 132

Edges

Occlusion

Change of

Surface orientation

Contact edge

Vertical 3D edge

Horizontal 3D edge

5 of 132

[Slide credit: Alexei Efros]

6 of 132

[Hays and Efros. Scene Completion Using Millions of Photographs. SIGGRAPH 2007 and CACM October 2008.]

[Slide credit: Alexei Efros]

7 of 132

2 Million Flickr Images

[Slide credit: Alexei Efros]

8 of 132

… 200 total

[Slide credit: Alexei Efros]

9 of 132

[Slide credit: Alexei Efros]

10 of 132

11 of 132

[Slide credit: Alexei Efros]

12 of 132

[Slide credit: Alexei Efros]

13 of 132

[Slide credit: Alexei Efros]

14 of 132

[Slide credit: Alexei Efros]

15 of 132

[Slide credit: Alexei Efros]

16 of 132

[Slide credit: Alexei Efros]

17 of 132

[Slide credit: Alexei Efros]

18 of 132

… 200 scene matches

[Slide credit: Alexei Efros]

19 of 132

20 of 132

21 of 132

[Slide credit: Alexei Efros]

22 of 132

[Slide credit: Alexei Efros]

23 of 132

[Slide credit: Alexei Efros]

24 of 132

Why does it work?

[Slide credit: Alexei Efros]

25 of 132

[Slide credit: Alexei Efros]

26 of 132

Nearest neighbors from a�collection of 20 thousand images

[Slide credit: Alexei Efros]

27 of 132

Nearest neighbors from a�collection of 2 million images

[Slide credit: Alexei Efros]

28 of 132

“Unreasonable Effectiveness of Data”

Parts of our world can be explained by elegant mathematics

physics, chemistry, astronomy, etc.

But much cannot

psychology, economics, genetics, etc.

Enter The Data!

Great advances in several fields:

e.g., speech recognition, machine translation

Case study: Google

[Halevy, Norvig, Pereira 2009]

“For many tasks, once we have a billion or so examples, we essentially have a closed set that represents (or at least approximates) what we need…”

[Slide credit: Alexei Efros]

29 of 132

Learning versus inference

This lecture: Learning

Given data, automatically come up with the model that best explains it

Last lecture: Inference

Given a model , make inferences such as

30 of 132

31 of 132

The goal of learning is to extract lessons from past experience in order to solve future problems.

32 of 132

2 ☆ 3 = 36

7 ☆ 1 = 49

5 ☆ 2 = 100

2 ☆ 2 = 16

What does ☆ do?

33 of 132

Past experience

2 ☆ 3 = 36

7 ☆ 1 = 49

5 ☆ 2 = 100

2 ☆ 2 = 16

Future query

3 ☆ 5 = ?

Goal: answer future queries involving ☆

Approach: figure out what ☆ is doing by observing its behavior on examples

34 of 132

2 ☆ 3 = 36

7 ☆ 1 = 49

5 ☆ 2 = 100

2 ☆ 2 = 16

☆

3 ☆ 5

225

35 of 132

Learning from examples

(aka supervised learning)

36 of 132

Learning from examples

(aka supervised learning)

37 of 132

…

Learning from examples

(aka supervised learning)

38 of 132

Case study #1: Linear least squares

39 of 132

?

40 of 132

Hypothesis space

The relationship between X and Y is roughly linear:

41 of 132

Search for the parameters, ,

that best fit the data.

Best fit in what sense?

42 of 132

Best fit in what sense?

Search for the parameters, ,

that best fit the data.

The least-squares objective (aka loss) says the best fit is the function that minimizes the squared error between predictions and target values:

43 of 132

Best fit in what sense?

Search for the parameters, ,

that best fit the data.

The least-squares objective (aka loss) says the best fit is the function that minimizes the squared error between predictions and target values:

44 of 132

Complete learning problem:

45 of 132

?

46 of 132

47 of 132

Follows from two modeling assumptions

1. Observed Y’s are corrupted by Gaussian noise:

Why is squared error a good objective?

2. Training datapoints are independently and identically distributed (iid):

48 of 132

Why is squared error a good objective?

49 of 132

This is called the max likelihood estimator of θ.

Why is squared error a good objective?

Under the assumption that the data is distributed as:

The most probable estimate of θ is:

50 of 132

How to minimize the objective w.r.t. θ?

Use an optimizer!

Output

Score

Input

Machine with knobs

51 of 132

How to minimize the objective w.r.t. θ?

In the linear case:

Learning problem

Solution

52 of 132

Empirical Risk Minimization

(formalization of supervised learning)

Linear least squares learning problem

53 of 132

Empirical Risk Minimization

(formalization of supervised learning)

Objective function

(loss)

Hypothesis space

Training data

54 of 132

Case study #1: Linear least squares

55 of 132

56 of 132

Example 1: Linear least squares

57 of 132

Example 2: Program induction

58 of 132

Example 3: Deep learning

59 of 132

Space of all functions

True solution (needle)

Hypothesis space (haystack)

Space we will search

60 of 132

Space of all functions

True solution (needle)

Hypothesis space (haystack)

Space we will search

Deep nets

61 of 132

Space of all functions

True solution (needle)

Hypothesis space (haystack)

Space we will search

Linear functions

True solution is linear

62 of 132

Space of all functions

True solution (needle)

Hypothesis space (haystack)

Space we will search

Linear functions

True solution is nonlinear

63 of 132

Space of all functions

True solution (needle)

Hypothesis space (haystack)

Hypotheses consistent with data

64 of 132

Space of all functions

What happens as we increase the data?

True solution (needle)

Hypothesis space (haystack)

Hypotheses consistent with data

65 of 132

Space of all functions

What happens as we shrink the hypothesis space?

True solution (needle)

Hypothesis space (haystack)

Hypotheses consistent with data

66 of 132

Learning for vision

Big questions:

How do you represent the input and output?
What is the objective?
What is the hypothesis space? (e.g., linear, polynomial, neural net?)
How do you optimize? (e.g., gradient descent, Newton’s method?)
What data do you train on?

67 of 132

Case study #2: Image classification

How do you represent the input and output?
What is the objective?
Assume hypothesis space is sufficienly expressive
Assume we optimize perfectly
Assume we train on exactly the data we care about

68 of 132

Image classification

“Fish”

Classifier

image x

label y

69 of 132

Image classification

“Fish”

Classifier

image x

label y

70 of 132

Image classification

“Fish”

Classifier

image x

label y

71 of 132

Image classification

…

image x

“Duck”

label y

Classifier

72 of 132

“Fish”

Training data

…

“Fish”

“Grizzly”

“Chameleon”

73 of 132

How to represent class labels?

Training data

…

“Fish”

“Grizzly”

“Chameleon”

Training data

…

1

2

3

One-hot vector

Training data

…

[1,0,0]

[0,1,0]

[0,0,1]

74 of 132

What should the loss be?

0-1 loss (number of misclassifications)

discrete, NP-hard to optimize!

continuous,

differentiable,

convex

75 of 132

[0,0,0,0,0,1,0,0,…]

Ground truth label

76 of 132

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

Ground truth label

0

1

Prob

77 of 132

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

…

Prediction

Ground truth label

-

0

1

log prob

Prob

0

78 of 132

Score

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

…

Prediction

Ground truth label

How much better you could have done

-

0

1

log prob

- Loss

Prob

0

-

0

79 of 132

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

…

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

Prediction

Ground truth label

Score

-

0

1

log prob

- Loss

Prob

0

-

0

80 of 132

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

…

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

Prediction

Ground truth label

Score

-

0

1

log prob

- Loss

Prob

0

-

0

81 of 132

Softmax regression (a.k.a. multinomial logistic regression)

logits: vector of K scores, one for each class

squash into a non-negative vector that sums to 1 — i.e. a probability mass function!

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

0

1

…

82 of 132

Softmax regression (a.k.a. multinomial logistic regression)

predicted probability of each class given input x

picks out the -log likelihood of the ground truth class under the model prediction

Probabilistic interpretation:

max likelihood learner!

83 of 132

Softmax regression (a.k.a. multinomial logistic regression)

84 of 132

The Problem of Generalization

85 of 132

Zhang et al. 2016

86 of 132

u/Rafael_P_S

Thylacine

Chopin

87 of 132

Domain gap between and will cause us to fail to generalize.

training domain

testing domain

(where we actual use our model)

Space of natural images

Training data

Test data

88 of 132

Linear regression

89 of 132

Linear regression

90 of 132

Polynomial regression

91 of 132

92 of 132

True data-generating process

93 of 132

True data-generating process

94 of 132

Training objective:

95 of 132

Training objective:

True objective:

96 of 132

Generalization

“The central challenge in machine learning is that our algorithm must perform well on new, previously unseen inputs—not just those on which our model was trained. The ability to perform well on previously unobserved inputs is called generalization.

… [this is what] separates machine learning from optimization.”

— Deep Learning textbook (Goodfellow et al.)

97 of 132

2 ☆ 3 = 36

7 ☆ 1 = 49

5 ☆ 2 = 100

2 ☆ 2 = 16

What does ☆ do?

98 of 132