Lecture 6
Introduction to Machine Learning
6.8300/1 Advances in Computer Vision
Spring 2024
Sara Beery, Kaiming He, Vincent Sitzmann, Mina Konaković Luković
Announcements
6. Introduction to Machine Learning
Edges
Occlusion
Change of
Surface orientation
Contact edge
Vertical 3D edge
Horizontal 3D edge
[Slide credit: Alexei Efros]
[Hays and Efros. Scene Completion Using Millions of Photographs. SIGGRAPH 2007 and CACM October 2008.]
[Slide credit: Alexei Efros]
2 Million Flickr Images
[Slide credit: Alexei Efros]
… 200 total
[Slide credit: Alexei Efros]
[Slide credit: Alexei Efros]
[Slide credit: Alexei Efros]
[Slide credit: Alexei Efros]
[Slide credit: Alexei Efros]
[Slide credit: Alexei Efros]
[Slide credit: Alexei Efros]
[Slide credit: Alexei Efros]
[Slide credit: Alexei Efros]
… 200 scene matches
[Slide credit: Alexei Efros]
[Slide credit: Alexei Efros]
[Slide credit: Alexei Efros]
[Slide credit: Alexei Efros]
Why does it work?
[Slide credit: Alexei Efros]
[Slide credit: Alexei Efros]
Nearest neighbors from a�collection of 20 thousand images
[Slide credit: Alexei Efros]
Nearest neighbors from a�collection of 2 million images
[Slide credit: Alexei Efros]
“Unreasonable Effectiveness of Data”
Parts of our world can be explained by elegant mathematics
physics, chemistry, astronomy, etc.
But much cannot
psychology, economics, genetics, etc.
Enter The Data!
Great advances in several fields:
e.g., speech recognition, machine translation
Case study: Google
[Halevy, Norvig, Pereira 2009]
“For many tasks, once we have a billion or so examples, we essentially have a closed set that represents (or at least approximates) what we need…”
[Slide credit: Alexei Efros]
Learning versus inference
This lecture: Learning
Given data, automatically come up with the model that best explains it
Last lecture: Inference
Given a model , make inferences such as
The goal of learning is to extract lessons from past experience in order to solve future problems.
2 ☆ 3 = 36
7 ☆ 1 = 49
5 ☆ 2 = 100
2 ☆ 2 = 16
What does ☆ do?
Past experience
2 ☆ 3 = 36
7 ☆ 1 = 49
5 ☆ 2 = 100
2 ☆ 2 = 16
Future query
3 ☆ 5 = ?
Goal: answer future queries involving ☆
Approach: figure out what ☆ is doing by observing its behavior on examples
2 ☆ 3 = 36
7 ☆ 1 = 49
5 ☆ 2 = 100
2 ☆ 2 = 16
☆
☆
3 ☆ 5
225
Learning from examples
(aka supervised learning)
Learning from examples
(aka supervised learning)
…
Learning from examples
(aka supervised learning)
Case study #1: Linear least squares
?
Hypothesis space
The relationship between X and Y is roughly linear:
Search for the parameters, ,
that best fit the data.
Best fit in what sense?
Best fit in what sense?
Search for the parameters, ,
that best fit the data.
The least-squares objective (aka loss) says the best fit is the function that minimizes the squared error between predictions and target values:
Best fit in what sense?
Search for the parameters, ,
that best fit the data.
The least-squares objective (aka loss) says the best fit is the function that minimizes the squared error between predictions and target values:
Complete learning problem:
?
Follows from two modeling assumptions
1. Observed Y’s are corrupted by Gaussian noise:
Why is squared error a good objective?
2. Training datapoints are independently and identically distributed (iid):
Why is squared error a good objective?
This is called the max likelihood estimator of θ.
Why is squared error a good objective?
Under the assumption that the data is distributed as:
The most probable estimate of θ is:
How to minimize the objective w.r.t. θ?
Use an optimizer!
Output
Score
Input
Machine with knobs
How to minimize the objective w.r.t. θ?
In the linear case:
Learning problem
Solution
Empirical Risk Minimization
(formalization of supervised learning)
Linear least squares learning problem
Empirical Risk Minimization
(formalization of supervised learning)
Objective function
(loss)
Hypothesis space
Training data
Case study #1: Linear least squares
Example 1: Linear least squares
Example 2: Program induction
Example 3: Deep learning
Space of all functions
True solution (needle)
Hypothesis space (haystack)
Space we will search
Space of all functions
True solution (needle)
Hypothesis space (haystack)
Space we will search
Deep nets
Space of all functions
True solution (needle)
Hypothesis space (haystack)
Space we will search
Linear functions
True solution is linear
Space of all functions
True solution (needle)
Hypothesis space (haystack)
Space we will search
Linear functions
True solution is nonlinear
Space of all functions
True solution (needle)
Hypothesis space (haystack)
Hypotheses consistent with data
Space of all functions
What happens as we increase the data?
True solution (needle)
Hypothesis space (haystack)
Hypotheses consistent with data
Space of all functions
What happens as we shrink the hypothesis space?
True solution (needle)
Hypothesis space (haystack)
Hypotheses consistent with data
Learning for vision
Big questions:
Case study #2: Image classification
Image classification
“Fish”
Classifier
image x
label y
Image classification
“Fish”
Classifier
image x
label y
Image classification
“Fish”
Classifier
image x
label y
Image classification
…
image x
“Duck”
label y
Classifier
“Fish”
Training data
…
“Fish”
“Grizzly”
“Chameleon”
How to represent class labels?
Training data
…
“Fish”
“Grizzly”
“Chameleon”
Training data
…
1
2
3
One-hot vector
Training data
…
[1,0,0]
[0,1,0]
[0,0,1]
What should the loss be?
0-1 loss (number of misclassifications)
discrete, NP-hard to optimize!
continuous,
differentiable,
convex
[0,0,0,0,0,1,0,0,…]
Ground truth label
dolphin
cat
grizzly bear
angel fish
chameleon
iguana
elephant
clown fish
Ground truth label
0
1
Prob
dolphin
cat
grizzly bear
angel fish
chameleon
iguana
elephant
clown fish
dolphin
cat
grizzly bear
angel fish
chameleon
iguana
elephant
clown fish
…
Prediction
Ground truth label
-
0
1
log prob
Prob
0
Score
dolphin
cat
grizzly bear
angel fish
chameleon
iguana
elephant
clown fish
dolphin
cat
grizzly bear
angel fish
chameleon
iguana
elephant
clown fish
…
Prediction
Ground truth label
How much better you could have done
-
0
1
log prob
- Loss
Prob
0
-
0
dolphin
cat
grizzly bear
angel fish
chameleon
iguana
elephant
clown fish
…
dolphin
cat
grizzly bear
angel fish
chameleon
iguana
elephant
clown fish
Prediction
Ground truth label
Score
-
0
1
log prob
- Loss
Prob
0
-
0
dolphin
cat
grizzly bear
angel fish
chameleon
iguana
elephant
clown fish
…
dolphin
cat
grizzly bear
angel fish
chameleon
iguana
elephant
clown fish
Prediction
Ground truth label
Score
-
0
1
log prob
- Loss
Prob
0
-
0
Softmax regression (a.k.a. multinomial logistic regression)
logits: vector of K scores, one for each class
squash into a non-negative vector that sums to 1 — i.e. a probability mass function!
dolphin
cat
grizzly bear
angel fish
chameleon
iguana
elephant
clown fish
0
1
…
…
Softmax regression (a.k.a. multinomial logistic regression)
predicted probability of each class given input x
picks out the -log likelihood of the ground truth class under the model prediction
Probabilistic interpretation:
max likelihood learner!
Softmax regression (a.k.a. multinomial logistic regression)
The Problem of Generalization
Zhang et al. 2016
Thylacine
Chopin
Domain gap between and will cause us to fail to generalize.
training domain
testing domain
(where we actual use our model)
Space of natural images
Training data
Test data
Linear regression
Linear regression
Polynomial regression
True data-generating process
True data-generating process
Training objective:
Training objective:
True objective:
Generalization
“The central challenge in machine learning is that our algorithm must perform well on new, previously unseen inputs—not just those on which our model was trained. The ability to perform well on previously unobserved inputs is called generalization.
… [this is what] separates machine learning from optimization.”
— Deep Learning textbook (Goodfellow et al.)
2 ☆ 3 = 36
7 ☆ 1 = 49
5 ☆ 2 = 100
2 ☆ 2 = 16
What does ☆ do?
What happens as we add more basis functions?
What happens as we add more basis functions?
K = 1
What happens as we add more basis functions?
K = 2
What happens as we add more basis functions?
K = 3
What happens as we add more basis functions?
K = 4
What happens as we add more basis functions?
K = 5
What happens as we add more basis functions?
K = 6
What happens as we add more basis functions?
K = 7
What happens as we add more basis functions?
K = 8
What happens as we add more basis functions?
K = 9
What happens as we add more basis functions?
K = 10
This phenomenon is called overfitting.
It occurs when we have too high capacity a model, e.g., too many free parameters, too few data points to pin these parameters down.
K = 1
When the model does not have the capacity to capture the true function, we call this underfitting.
An underfit model will have high error on the training points. This error is known as approximation error.
K = 2
The true function is a quadractic, so a quadractic model (K=2) fits quite well.
K = 10
Now we have zero approximation error — the curve passes exactly through each training point.
But we have high generalization error, reflected in the gap between the true function and the fit line. We want to do well on novel queries, which will be sampled from the green curve (plus noise).
Underfitting
K = 1
Appropriate model
K = 2
Overfitting
K = 10
High error on train set
High error on test set
Low error on train set
Low error on test set
Lowest error on train set
High error on test set
Simple model
Doesn’t fit the training data
Simple model
Fits the training data
Complex model
Fits the training data
Underfitting
K = 1
Appropriate model
K = 2
Overfitting
K = 10
Underfitting
K = 1
Appropriate model
K = 2
Overfitting
K = 10
Example #2: Fitting a classifier
Answer:
Answer:
Answer:
Underfitting
Appropriate model
Overfitting
We need to control the capacity of the model (e.g., use the appropriate number of free parameters).
The capacity may be defined as the number of hypotheses under consideration in the hypothesis space.
Complex models with many free parameters have high capacity.
Simple models have low capacity.
Training error versus generalization error
[“Deep Learning”, Goodfellow et al.]
How do we know if we are underfitting or overfitting?
Cross validation: measure prediction error on validation data
Fitting just right
Underfitting?
Overfitting?
Selecting a hypothesis space of functions with just the right capacity is known as model selection
Regularization
Empirical risk minimization:
Regularized least squares
Only use polynomial terms if you really need them! Most terms should be zero
ridge regression, a.k.a., Tikhonov regularization
(Probabilistic interpretation: R is a Gaussian prior over values of the parameters.)
[Adapted from “Deep Learning”, Goodfellow et al.]
High λ — ?
Medium λ — ?
Low λ — ?
A
C
B
[Adapted from “Deep Learning”, Goodfellow et al.]
High λ — A
Medium λ — B
Low λ — C
A
C
B
Regularized polynomial least squares regression
What’s the best prior? What’s a principle for selecting it?
[ITILA book, Chapter 28, MacKay, http://www.inference.org.uk/itprnn/book.pdf]
True data-generating process
This is a huge assumption!
Almost never true in practice!
Much more commonly, we have
Our training data didn’t cover the part of the distribution that was tested
(biased data)
Zhang et al. 2016
Thylacine
Chopin
6. Introduction to Machine Learning