1 of 132

Lecture 6

Introduction to Machine Learning

6.8300/1 Advances in Computer Vision

Spring 2024

Sara Beery, Kaiming He, Vincent Sitzmann, Mina Konaković Luković

2 of 132

Announcements

  • Pset 2 due on Thursday
  • PyTorch tutorials Tues 9:30-11am in 54-100 and Thurs 2:30-4pm 2-190

3 of 132

6. Introduction to Machine Learning

  • “It’s all about the data!”
  • Formalisms of learning (Data, Compute, Objective Function, Hypothesis Space, Optimizer)
  • Case study #1: Linear least-squares
    • Empirical risk minimization
  • Case study #2: Image classification
    • Softmax regression
  • The problem of generalization

4 of 132

Edges

Occlusion

Change of

Surface orientation

Contact edge

Vertical 3D edge

Horizontal 3D edge

5 of 132

[Slide credit: Alexei Efros]

6 of 132

[Hays and Efros. Scene Completion Using Millions of Photographs. SIGGRAPH 2007 and CACM October 2008.]

[Slide credit: Alexei Efros]

7 of 132

2 Million Flickr Images

[Slide credit: Alexei Efros]

8 of 132

… 200 total

[Slide credit: Alexei Efros]

9 of 132

[Slide credit: Alexei Efros]

10 of 132

11 of 132

[Slide credit: Alexei Efros]

12 of 132

[Slide credit: Alexei Efros]

13 of 132

[Slide credit: Alexei Efros]

14 of 132

[Slide credit: Alexei Efros]

15 of 132

[Slide credit: Alexei Efros]

16 of 132

[Slide credit: Alexei Efros]

17 of 132

[Slide credit: Alexei Efros]

18 of 132

… 200 scene matches

[Slide credit: Alexei Efros]

19 of 132

20 of 132

21 of 132

[Slide credit: Alexei Efros]

22 of 132

[Slide credit: Alexei Efros]

23 of 132

[Slide credit: Alexei Efros]

24 of 132

Why does it work?

[Slide credit: Alexei Efros]

25 of 132

[Slide credit: Alexei Efros]

26 of 132

Nearest neighbors from a�collection of 20 thousand images

[Slide credit: Alexei Efros]

27 of 132

Nearest neighbors from a�collection of 2 million images

[Slide credit: Alexei Efros]

28 of 132

“Unreasonable Effectiveness of Data”

Parts of our world can be explained by elegant mathematics

physics, chemistry, astronomy, etc.

But much cannot

psychology, economics, genetics, etc.

Enter The Data!

Great advances in several fields:

e.g., speech recognition, machine translation

Case study: Google

[Halevy, Norvig, Pereira 2009]

“For many tasks, once we have a billion or so examples, we essentially have a closed set that represents (or at least approximates) what we need…”

[Slide credit: Alexei Efros]

29 of 132

Learning versus inference

This lecture: Learning

Given data, automatically come up with the model that best explains it

Last lecture: Inference

Given a model , make inferences such as

30 of 132

31 of 132

The goal of learning is to extract lessons from past experience in order to solve future problems.

32 of 132

2 ☆ 3 = 36

7 ☆ 1 = 49

5 ☆ 2 = 100

2 ☆ 2 = 16

What does ☆ do?

33 of 132

Past experience

2 ☆ 3 = 36

7 ☆ 1 = 49

5 ☆ 2 = 100

2 ☆ 2 = 16

Future query

3 ☆ 5 = ?

Goal: answer future queries involving ☆

Approach: figure out what ☆ is doing by observing its behavior on examples

34 of 132

2 ☆ 3 = 36

7 ☆ 1 = 49

5 ☆ 2 = 100

2 ☆ 2 = 16

3 ☆ 5

225

35 of 132

Learning from examples

(aka supervised learning)

36 of 132

Learning from examples

(aka supervised learning)

37 of 132

Learning from examples

(aka supervised learning)

38 of 132

Case study #1: Linear least squares

39 of 132

?

40 of 132

Hypothesis space

The relationship between X and Y is roughly linear:

41 of 132

Search for the parameters, ,

that best fit the data.

Best fit in what sense?

42 of 132

Best fit in what sense?

Search for the parameters, ,

that best fit the data.

The least-squares objective (aka loss) says the best fit is the function that minimizes the squared error between predictions and target values:

43 of 132

Best fit in what sense?

Search for the parameters, ,

that best fit the data.

The least-squares objective (aka loss) says the best fit is the function that minimizes the squared error between predictions and target values:

44 of 132

Complete learning problem:

45 of 132

?

46 of 132

47 of 132

Follows from two modeling assumptions

1. Observed Y’s are corrupted by Gaussian noise:

Why is squared error a good objective?

2. Training datapoints are independently and identically distributed (iid):

48 of 132

Why is squared error a good objective?

49 of 132

This is called the max likelihood estimator of θ.

Why is squared error a good objective?

Under the assumption that the data is distributed as:

The most probable estimate of θ is:

50 of 132

How to minimize the objective w.r.t. θ?

Use an optimizer!

Output

Score

Input

Machine with knobs

51 of 132

How to minimize the objective w.r.t. θ?

In the linear case:

Learning problem

Solution

52 of 132

Empirical Risk Minimization

(formalization of supervised learning)

Linear least squares learning problem

53 of 132

Empirical Risk Minimization

(formalization of supervised learning)

Objective function

(loss)

Hypothesis space

Training data

54 of 132

Case study #1: Linear least squares

55 of 132

56 of 132

Example 1: Linear least squares

57 of 132

Example 2: Program induction

58 of 132

Example 3: Deep learning

59 of 132

Space of all functions

True solution (needle)

Hypothesis space (haystack)

Space we will search

60 of 132

Space of all functions

True solution (needle)

Hypothesis space (haystack)

Space we will search

Deep nets

61 of 132

Space of all functions

True solution (needle)

Hypothesis space (haystack)

Space we will search

Linear functions

True solution is linear

62 of 132

Space of all functions

True solution (needle)

Hypothesis space (haystack)

Space we will search

Linear functions

True solution is nonlinear

63 of 132

Space of all functions

True solution (needle)

Hypothesis space (haystack)

Hypotheses consistent with data

64 of 132

Space of all functions

What happens as we increase the data?

True solution (needle)

Hypothesis space (haystack)

Hypotheses consistent with data

65 of 132

Space of all functions

What happens as we shrink the hypothesis space?

True solution (needle)

Hypothesis space (haystack)

Hypotheses consistent with data

66 of 132

Learning for vision

Big questions:

  1. How do you represent the input and output?
  2. What is the objective?
  3. What is the hypothesis space? (e.g., linear, polynomial, neural net?)
  4. How do you optimize? (e.g., gradient descent, Newton’s method?)
  5. What data do you train on?

67 of 132

Case study #2: Image classification

  1. How do you represent the input and output?
  2. What is the objective?
  3. Assume hypothesis space is sufficienly expressive
  4. Assume we optimize perfectly
  5. Assume we train on exactly the data we care about

68 of 132

Image classification

“Fish”

Classifier

image x

label y

69 of 132

Image classification

“Fish”

Classifier

image x

label y

70 of 132

Image classification

“Fish”

Classifier

image x

label y

71 of 132

Image classification

image x

“Duck”

label y

Classifier

72 of 132

“Fish”

Training data

“Fish”

“Grizzly”

“Chameleon”

73 of 132

How to represent class labels?

Training data

“Fish”

“Grizzly”

“Chameleon”

Training data

1

2

3

One-hot vector

Training data

[1,0,0]

[0,1,0]

[0,0,1]

74 of 132

What should the loss be?

0-1 loss (number of misclassifications)

discrete, NP-hard to optimize!

continuous,

differentiable,

convex

75 of 132

[0,0,0,0,0,1,0,0,…]

Ground truth label

76 of 132

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

Ground truth label

0

1

Prob

77 of 132

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

Prediction

Ground truth label

-

0

1

log prob

Prob

0

78 of 132

Score

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

Prediction

Ground truth label

How much better you could have done

-

0

1

log prob

- Loss

Prob

0

-

0

79 of 132

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

Prediction

Ground truth label

Score

-

0

1

log prob

- Loss

Prob

0

-

0

80 of 132

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

Prediction

Ground truth label

Score

-

0

1

log prob

- Loss

Prob

0

-

0

81 of 132

Softmax regression (a.k.a. multinomial logistic regression)

logits: vector of K scores, one for each class

squash into a non-negative vector that sums to 1 — i.e. a probability mass function!

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

0

1

82 of 132

Softmax regression (a.k.a. multinomial logistic regression)

predicted probability of each class given input x

picks out the -log likelihood of the ground truth class under the model prediction

Probabilistic interpretation:

max likelihood learner!

83 of 132

Softmax regression (a.k.a. multinomial logistic regression)

84 of 132

The Problem of Generalization

85 of 132

Zhang et al. 2016

86 of 132

Thylacine

Chopin

87 of 132

Domain gap between and will cause us to fail to generalize.

training domain

testing domain

(where we actual use our model)

Space of natural images

Training data

Test data

88 of 132

Linear regression

89 of 132

Linear regression

90 of 132

Polynomial regression

91 of 132

92 of 132

True data-generating process

93 of 132

True data-generating process

94 of 132

Training objective:

95 of 132

Training objective:

True objective:

96 of 132

Generalization

“The central challenge in machine learning is that our algorithm must perform well on new, previously unseen inputs—not just those on which our model was trained. The ability to perform well on previously unobserved inputs is called generalization.

… [this is what] separates machine learning from optimization.”

— Deep Learning textbook (Goodfellow et al.)

97 of 132

2 ☆ 3 = 36

7 ☆ 1 = 49

5 ☆ 2 = 100

2 ☆ 2 = 16

What does ☆ do?

98 of 132

What happens as we add more basis functions?

99 of 132

What happens as we add more basis functions?

K = 1

100 of 132

What happens as we add more basis functions?

K = 2

101 of 132

What happens as we add more basis functions?

K = 3

102 of 132

What happens as we add more basis functions?

K = 4

103 of 132

What happens as we add more basis functions?

K = 5

104 of 132

What happens as we add more basis functions?

K = 6

105 of 132

What happens as we add more basis functions?

K = 7

106 of 132

What happens as we add more basis functions?

K = 8

107 of 132

What happens as we add more basis functions?

K = 9

108 of 132

What happens as we add more basis functions?

K = 10

This phenomenon is called overfitting.

It occurs when we have too high capacity a model, e.g., too many free parameters, too few data points to pin these parameters down.

109 of 132

K = 1

When the model does not have the capacity to capture the true function, we call this underfitting.

An underfit model will have high error on the training points. This error is known as approximation error.

110 of 132

K = 2

The true function is a quadractic, so a quadractic model (K=2) fits quite well.

111 of 132

K = 10

Now we have zero approximation error — the curve passes exactly through each training point.

But we have high generalization error, reflected in the gap between the true function and the fit line. We want to do well on novel queries, which will be sampled from the green curve (plus noise).

112 of 132

Underfitting

K = 1

Appropriate model

K = 2

Overfitting

K = 10

High error on train set

High error on test set

Low error on train set

Low error on test set

Lowest error on train set

High error on test set

113 of 132

Simple model

Doesn’t fit the training data

Simple model

Fits the training data

Complex model

Fits the training data

Underfitting

K = 1

Appropriate model

K = 2

Overfitting

K = 10

114 of 132

Underfitting

K = 1

Appropriate model

K = 2

Overfitting

K = 10

115 of 132

Example #2: Fitting a classifier

Answer:

Answer:

Answer:

Underfitting

Appropriate model

Overfitting

116 of 132

We need to control the capacity of the model (e.g., use the appropriate number of free parameters).

The capacity may be defined as the number of hypotheses under consideration in the hypothesis space.

Complex models with many free parameters have high capacity.

Simple models have low capacity.

117 of 132

Training error versus generalization error

[“Deep Learning”, Goodfellow et al.]

118 of 132

How do we know if we are underfitting or overfitting?

Cross validation: measure prediction error on validation data

119 of 132

Fitting just right

Underfitting?

    • add more parameters (more features, more layers, etc.)

Overfitting?

    • remove parameters
    • add regularizers

Selecting a hypothesis space of functions with just the right capacity is known as model selection

120 of 132

Regularization

Empirical risk minimization:

121 of 132

Regularized least squares

Only use polynomial terms if you really need them! Most terms should be zero

ridge regression, a.k.a., Tikhonov regularization

(Probabilistic interpretation: R is a Gaussian prior over values of the parameters.)

122 of 132

[Adapted from “Deep Learning”, Goodfellow et al.]

High λ — ?

Medium λ — ?

Low λ — ?

A

C

B

123 of 132

[Adapted from “Deep Learning”, Goodfellow et al.]

High λ — A

Medium λ — B

Low λ — C

A

C

B

124 of 132

Regularized polynomial least squares regression

125 of 132

What’s the best prior? What’s a principle for selecting it?

[ITILA book, Chapter 28, MacKay, http://www.inference.org.uk/itprnn/book.pdf]

126 of 132

True data-generating process

127 of 132

This is a huge assumption!

Almost never true in practice!

128 of 132

Much more commonly, we have

129 of 132

Our training data didn’t cover the part of the distribution that was tested

(biased data)

130 of 132

Zhang et al. 2016

131 of 132

Thylacine

Chopin

132 of 132

6. Introduction to Machine Learning

  • “It’s all about the data!”
  • Formalisms of learning (Data, Compute, Objective Function, Hypothesis Space, Optimizer)
  • Case study #1: Linear least-squares
    • Empirical risk minimization
  • Case study #2: Image classification
    • Softmax regression
  • The problem of generalization