Ordinary Least Squares
Using linear algebra to derive the multiple linear regression model.
1
LECTURE 11
2
Join at slido.com�#2248037
ⓘ Start presenting to display the joining instructions on this slide.
Plan for Next Few Lectures: Modeling
3
Modeling I:�Intro to Modeling, Simple Linear Regression
Question & Problem
Formulation
Data
Acquisition
Exploratory Data Analysis
Prediction and
Inference
Reports, Decisions, and Solutions
?
Modeling II:�Different models, loss functions, linearization
Modeling III:�Multiple Linear Regression
(today)
Today’s Roadmap
Lecture 11, Data 100 Summer 2023
OLS Problem Formulation
Geometric Derivation
Performance: Residuals, Multiple R2
OLS Properties
4
Multiple Linear Regression Model
Lecture 11, Data 100 Summer 2023
OLS Problem Formulation
Geometric Derivation
Performance: Residuals, Multiple R2
OLS Properties
5
Multiple Linear Regression
Define the multiple linear regression model:
.
6
Predicted value of
single observation
(p features)
single�prediction
NBA 2018-2019 Dataset
How many points does an athlete score per game?�PTS (average points/game)
To name a few factors:
7
3PA
assist: a pass to a teammate that directly leads to a goal
Rows correspond to individual players.
FG
Multiple Linear Regression Model
How many points does an athlete score per game?�PTS (average points/game)
To name a few factors:
8
3PA
assist: a pass to a teammate that directly leads to a goal
Rows correspond to individual players.
0.4
0.8
1.5
FG
FG
AST
3PA
Today’s Goal: Ordinary Least Squares
9
2. Choose a loss function
3. Fit the model
4. Evaluate model performance
1. Choose a model
L2 Loss
Mean Squared Error (MSE)
Minimize�average loss�with calculus geometry
Multiple Linear Regression
Visualize, Root MSE Multiple R2
The solution to OLS are the minimizing loss for parameters , also called the�least squares estimate.
In statistics, this model + loss is called�Ordinary Least Squares (OLS).
Today’s Goal: Ordinary Least Squares
10
2. Choose a loss function
3. Fit the model
4. Evaluate model performance
1. Choose a model
L2 Loss
Mean Squared Error (MSE)
Minimize�average loss�with calculus geometry
Multiple Linear Regression
Visualize, Root MSE Multiple R2
For each of our data points:
Linear Algebra!!
From one feature to many features
11
Dataset for SLR
Dataset for Constant Model
Dataset for Multiple Linear Regression
From one feature to many features
12
Dataset for Multiple Linear Regression
Observation i
Feature 2
Model
[Linear Algebra] Vector Dot Product
The dot product (or inner product) is a vector operation that
Sidenote (not in scope): we can interpret dot product geometrically:
13
“u dot v”
Vector Notation
14
We want to collect all the ‘s into a single vector
This part looks a little like a dot product…
😤What about this one???
Vector Notation
15
We want to collect all the ‘s into a single vector
bias term, intercept term
Matrix Notation
16
where is datapoint/observation 1
where is datapoint/observation 2
where is datapoint/observation n
Matrix Notation
17
where is datapoint/observation 1
where is datapoint/observation 2
where is datapoint/observation n
For data point/observation 2, we have
Dimension check
also called scalars
Matrix Notation
18
…
…
n row vectors, each�with dimension (p+1)
…
Expand out each datapoint’s (transposed) input
Matrix Notation
19
…
n row vectors, each�with dimension (p+1)
Vectorize predictions and parameters to encapsulate all n equations into a single matrix equation.
Matrix Notation
20
Design matrix with�dimensions n x (p + 1)
…
The Design Matrix
We can use linear algebra to represent our predictions of all data points at once.
One step in this process is to stack all of our input features together into a design matrix:
21
Example design matrix 708 rows x (3+1) cols
Field Goals
Assists
3-Point� Attempts
What do the rows and columns of the design matrix represent in terms of the observed data?
🤔
The Design Matrix
We can use linear algebra to represent our predictions of all data points at once.
One step in this process is to stack all of our input features together into a design matrix:
22
A row corresponds to one observation, e.g., all (p+1) features for datapoint 3
A column corresponds to a feature,
e.g. feature 1 for all n data points
Special all-ones feature often called the bias/intercept
Example design matrix 708 rows x (3+1) cols
Field Goals
Assists
3-Point� Attempts
The Multiple Linear Regression Model using Matrix Notation
We can express our linear model on our entire dataset as follows:
23
Design matrix
Prediction vector
Parameter vector
Note that our true output is also a vector:
Linear in Theta
An expression is “linear in theta” if it is a linear combination of parameters .
24
1.
2.
3.
4.
5.
Which of these expressions are linear in theta?
25
Which of the following expressions are linear in theta?
ⓘ Start presenting to display the poll results on this slide.
Linear in Theta
An expression is “linear in theta” if it is a linear combination of parameters .
26
1.
2.
3.
4.
5.
“Linear in theta” means the expression can separate into a matrix product of two terms: a vector of thetas, and a matrix/vector not involving thetas.
❌
Mean Squared Error
Lecture 11, Data 100 Summer 2023
OLS Problem Formulation
Geometric Derivation
Performance: Residuals, Multiple R2
OLS Properties
27
Today’s Goal: Ordinary Least Squares
28
2. Choose a loss function
3. Fit the model
4. Evaluate model performance
1. Choose a model
L2 Loss
Mean Squared Error (MSE)
Minimize�average loss�with calculus geometry
Multiple Linear Regression
Visualize, Root MSE Multiple R2
✅
More Linear Algebra!!
[Linear Algebra] Vector Norms and the L2 Vector Norm
The norm of a vector is some measure of that vector’s size/length.
29
For the n-dimensional vector , the L2 vector norm is
[Linear Algebra] The L2 Norm as a Measure of Length
The L2 vector norm is a generalization of the Pythagorean theorem into n dimensions.
30
It can therefore be used as a measure of length of a vector
In
2
[Linear Algebra] The L2 Norm as a Measure of Distance
The L2 vector norm is a generalization of the Pythagorean theorem into n dimensions.
31
Looks like Mean Squared Error!!
Note: The square of the L2 norm of a vector is�the sum of the squares of the vector’s elements:
It can also be used as a measure of distance between two vectors.
In
In
2
Mean Squared Error with L2 Norms
We can rewrite mean squared error as a squared L2 norm:
32
With our linear model :
Ordinary Least Squares
The least squares estimate is the parameter that minimizes the objective function :
A. Minimize the mean squared error for the linear model
B. Minimize the distance� between true and predicted values and �
C. Minimize the length of the residual vector,
D. All of the above
E. Something else
33
How should we interpret the OLS problem?
🤔
34
How should we interpret the OLS problem?
ⓘ Start presenting to display the poll results on this slide.
Ordinary Least Squares
The least squares estimate is the parameter that minimizes the objective function :
A. Minimize the mean squared error for the linear model
B. Minimize the distance� between true and predicted values and �
C. Minimize the length of the residual vector,
D. All of the above
E. Something else
35
How should we interpret the OLS problem?
Important for today
Interlude
36
Geometric Derivation
Lecture 11, Data 100 Summer 2023
OLS Problem Formulation
Geometric Derivation
Performance: Residuals, Multiple R2
OLS Properties
37
Today’s Goal: Ordinary Least Squares
38
2. Choose a loss function
3. Fit the model
4. Evaluate model performance
1. Choose a model
L2 Loss
Mean Squared Error (MSE)
Minimize�average loss�with calculus geometry
Multiple Linear Regression
Visualize, Root MSE Multiple R2
✅
The calculus derivation requires matrix calculus (out of scope, but here’s a link if you’re interested).
Instead, we will derive using a geometric argument.
✅
[Linear Algebra] Span
The set of all possible linear combinations of the columns of is called the span of the columns of (denoted ), also called the column space.
39
[Linear Algebra] Matrix-Vector Multiplication
Approach 1: So far, we’ve thought of our model as horizontally stacked predictions per datapoint:
Approach 2: However, it is helpful sometimes to think of matrix-vector multiplication as performed by columns. We can also think of as a linear combination of feature vectors, scaled by parameters.
40
Prediction is a Linear Combination of Columns
The set of all possible linear combinations of the columns of X is called the span of the columns of X (denoted ), also called the column space.
Our prediction is a linear combination�of the columns of . Therefore .
Interpret: Our linear prediction will be in ,� even if the true values might not be.
Goal: Find the vector in that is closest to .
41
A thought experiment
If you’re a human being who can only stand on the blue sheet of paper, and you need to get as close as possible in distance to the light bulb located at the tip of the red arrow.
Where do you stand on the blue sheet?
42
A thought experiment
If you’re a human being who can only stand on the blue sheet of paper, and you need to get as close as possible in distance to the light bulb located at the tip of the red arrow.
Where do you stand on the blue sheet?
Right below the lightbulb - that’s the closest you
can get because you can’t travel vertically!
43
44
Goal:
Minimize the L2 norm of the residual vector.
i.e., get the predictions to be “as close” to our true values as possible.
This is the residual vector,
.
45
How do we minimize this distance – the norm of the residual vector (squared)?
46
How do we minimize this distance – the norm of the residual vector (squared)?
The vector in that is closest to is the orthogonal projection of onto
.
We will not prove this property of orthogonal projection: see Khan Academy.
47
How do we minimize this distance – the norm of the residual vector (squared)?
Thus, we should choose the θ that makes the residual vector orthogonal to .
We will not prove this property of orthogonal projection: see Khan Academy.
The vector in that is closest to is the orthogonal projection of onto
.
[Linear Algebra] Orthogonality
1. Vector a and Vector b are orthogonal if and only if their dot product is 0:
This is a generalization of the notion of two vectors in 2D being perpendicular.
48
2. A vector v is orthogonal to , the span of the columns of a matrix M,�if and only if v is orthogonal to each column in M.
Let’s express 2 in matrix notation. Let , where :
v
v is orthogonal to each column of M,
zero vector (d-length vector full of 0s).
Ordinary Least Squares Proof
The least squares estimate is the parameter that minimizes the objective function :
�
Equivalently, this is the such that the residual vector is orthogonal to .
49
Rearrange terms
The normal equation
If is invertible
Definition of orthogonality �of to �(0 is the vector)
Design matrix
Residual vector
This result is so important that it deserves its own slide.
It is the least squares estimate and the solution to the normal equation .
50
This result is so important that it deserves its own slide.
It is the least squares estimate and the solution to the normal equation .
51
🎉🎉🎉
Least Squares Estimate
52
2. Choose a loss function
3. Fit the model
4. Evaluate model performance
1. Choose a model
L2 Loss
Mean Squared Error (MSE)
Minimize�average loss�with calculus geometry
Multiple Linear Regression
Visualize, Root MSE Multiple R2
✅
Performance
Lecture 11, Data 100 Summer 2023
OLS Problem Formulation
Geometric Derivation
Performance: Residuals, Multiple R2
OLS Properties
53
Least Squares Estimate
54
2. Choose a loss function
3. Fit the model
4. Evaluate model performance
1. Choose a model
L2 Loss
Mean Squared Error (MSE)
Minimize�average loss�with calculus geometry
Multiple Linear Regression
Visualize, Root MSE Multiple R2
✅
✅
✅
Multiple Linear Regression
55
Design matrix
Prediction vector
Parameter vector
Note that our true output is also a vector:
Demo
[Visualization] Residual Plots
Simple linear regression
Plot residuals vs
the single feature x.
56
Compare
[Visualization] Residual Plots
See notebook
Simple linear regression
Plot residuals vs
the single feature x.
Multiple linear regression
Plot residuals vs
fitted (predicted) values .
Same interpretation as before (Data 8 textbook):
57
Compare
[Metrics] Multiple R²
Simple linear regression
Error
RMSE
Linearity
Correlation coefficient, r
Multiple linear regression
Error
RMSE
Linearity
Multiple R2, also called the coefficient of determination
58
Compare
[Metrics] Multiple R²
We define the multiple R² value as the proportion of variance or our fitted values (predictions) to our true values .
Also called the correlation of determination.
R2 ranges from 0 to 1 and is effectively�“the proportion of variance that the model explains.”
For OLS with an intercept term (e.g. ),
is equal to the square of correlation between , .
59
Compare
[Metrics] Multiple R²
Simple linear regression
Error
RMSE
Linearity
Correlation coefficient, r
Multiple linear regression
Error
RMSE
Linearity
Multiple R2, also called the coefficient of determination
As we add more features, our fitted values tend to become closer and closer to our actual values. Thus, R² increases.
Adding more features doesn’t always mean our model is better, though!�We will see why after the midterm.
60
R² = 0.457
R² = 0.609
Compare
OLS Properties
Lecture 11, Data 100 Summer 2023
OLS Problem Formulation
Geometric Derivation
Performance: Residuals, Multiple R2
OLS Properties
61
Residual Properties
When using the optimal parameter vector, our residuals are orthogonal to .
Proof: First line of our OLS estimate proof (slide).
62
For all linear models with an intercept term, , the sum of residuals is zero.
For all linear models:
Since our predicted response is in by definition, , and hence it is orthogonal to the residuals.
You will prove both properties in homework.
(Proof hint)
Properties When Our Model Has an Intercept Term
For all linear models with an intercept term, the sum of residuals is zero.
63
It follows from the property above that for linear models with intercepts, �the average predicted value is equal to the average true value.
These properties are true when there is an intercept term, and not necessarily when there isn’t.
(previous slide)
Does a Unique Solution Always Exist?
64
| Model | Estimate | Unique? |
Constant Model + MSE | | | Yes. Any set of values�has a unique mean. |
Constant Model + MAE | | | Yes, if odd.�No, if even. Return average of middle 2 values. |
Simple Linear Regression + MSE | | | Yes. Any set of non-constant* values has a unique mean, SD, and correlation coefficient. |
Ordinary Least Squares (Linear Model + MSE) | | | ??? |
|
Understanding The Solution Matrices
65
In most settings,
# observations # features� n >> p
Understanding The Solution Matrices
In practice, instead of directly inverting matrices, we can use more efficient�numerical solvers to directly solve a system of linear equations.
66
The Normal Equation:
Note that at least one solution always exists:
Intuitively, we can always draw a line of best fit for a given set of data, but there may be multiple lines that are “equally good”. (Formal proof is beyond this course.)
Uniqueness of a Solution: Proof
Claim
The Least Squares estimate is unique if and only if is full column rank.
67
Proof
Uniqueness of a Solution: Interpretation
Claim:
The Least Squares estimate is unique if and only if is full column rank.
When would we not have unique estimates?
68
2. If we our design matrix has features that are linear combinations of other features.
p + 1 features
n data points
Does a Unique Solution Always Exist?
69
| Model | Estimate | Unique? |
Constant Model + MSE | | | Yes. Any set of values�has a unique mean. |
Constant Model + MAE | | | Yes, if odd.�No, if even. Return average of middle 2 values. |
Simple Linear Regression + MSE | | | Yes. Any set of non-constant* values has a unique mean, SD, and correlation coefficient. |
Ordinary Least Squares (Linear Model + MSE) | | | Yes, if is full col rank (all cols lin independent,�#datapoints>> #features) |
|
Ordinary Least Squares
Content credit: Acknowledgments
70
Lecture 11