1 of 70

Ordinary Least Squares

Using linear algebra to derive the multiple linear regression model.

Data 100, Summer 2023 @ UC Berkeley

Bella Crouch and Dominic Liu

Content credit: Acknowledgments

1

LECTURE 11

2 of 70

2

Join at slido.com�#2248037

ⓘ Start presenting to display the joining instructions on this slide.

3 of 70

Plan for Next Few Lectures: Modeling

3

Modeling I:�Intro to Modeling, Simple Linear Regression

Question & Problem

Formulation

Data

Acquisition

Exploratory Data Analysis

Prediction and

Inference

Reports, Decisions, and Solutions

?

Modeling II:�Different models, loss functions, linearization

Modeling III:�Multiple Linear Regression

(today)

4 of 70

Today’s Roadmap

Lecture 11, Data 100 Summer 2023

OLS Problem Formulation

Multiple Linear Regression Model
Mean Squared Error

Geometric Derivation

Performance: Residuals, Multiple R²

OLS Properties

Residuals
The Bias/Intercept Term
Existence of a Unique Solution

4

5 of 70

Multiple Linear Regression Model

Lecture 11, Data 100 Summer 2023

OLS Problem Formulation

Multiple Linear Regression Model
Mean Squared Error

Geometric Derivation

Performance: Residuals, Multiple R²

OLS Properties

Residuals
The Bias/Intercept Term
Existence of a Unique Solution

5

6 of 70

Multiple Linear Regression

Define the multiple linear regression model:

.

6

Predicted value of

single observation

(p features)

single�prediction

7 of 70

NBA 2018-2019 Dataset

How many points does an athlete score per game?�PTS (average points/game)

To name a few factors:

FG: average # 2 point field goals
AST: average # of assists
3PA: average # 3 point field goals attempted

7

3PA

assist: a pass to a teammate that directly leads to a goal

Rows correspond to individual players.

FG

8 of 70

Multiple Linear Regression Model

How many points does an athlete score per game?�PTS (average points/game)

To name a few factors:

FG: average # 2 point field goals
AST: average # of assists
3PA: average # 3 point field goals attempted

8

3PA

assist: a pass to a teammate that directly leads to a goal

Rows correspond to individual players.

0.4

0.8

1.5

FG

AST

3PA

9 of 70

Today’s Goal: Ordinary Least Squares

9

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

L2 Loss

Mean Squared Error (MSE)

Minimize�average loss�with calculus geometry

Multiple Linear Regression

Visualize, Root MSE Multiple R²

The solution to OLS are the minimizing loss for parameters , also called the�least squares estimate.

In statistics, this model + loss is called�Ordinary Least Squares (OLS).

10 of 70

Today’s Goal: Ordinary Least Squares

10

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

L2 Loss

Mean Squared Error (MSE)

Minimize�average loss�with calculus geometry

Multiple Linear Regression

Visualize, Root MSE Multiple R²

For each of our data points:

Linear Algebra!!

11 of 70

From one feature to many features

11

Dataset for SLR

Dataset for Constant Model

Dataset for Multiple Linear Regression

12 of 70

From one feature to many features

12

Dataset for Multiple Linear Regression

Observation i

Feature 2

Model

13 of 70

[Linear Algebra] Vector Dot Product

The dot product (or inner product) is a vector operation that

can only be carried out on two vectors of the same length
sums up the products of the corresponding entries of the two vectors, and
returns a single number

Sidenote (not in scope): we can interpret dot product geometrically:

It is the product of three things: the magnitude of both vectors, and the cosine of the angles between them.
Another interpretation: 3Blue1Brown

13

“u dot v”

14 of 70

Vector Notation

14

We want to collect all the ‘s into a single vector

This part looks a little like a dot product…

😤What about this one???

15 of 70

Vector Notation

15

We want to collect all the ‘s into a single vector

bias term, intercept term

16 of 70

Matrix Notation

16

where is datapoint/observation 1

where is datapoint/observation 2

where is datapoint/observation n

17 of 70

Matrix Notation

17

where is datapoint/observation 1

where is datapoint/observation 2

where is datapoint/observation n

For data point/observation 2, we have

Dimension check

also called scalars

18 of 70

Matrix Notation

18

…

n row vectors, each�with dimension (p+1)

…

Expand out each datapoint’s (transposed) input

19 of 70

Matrix Notation

19

…

n row vectors, each�with dimension (p+1)

Vectorize predictions and parameters to encapsulate all n equations into a single matrix equation.

20 of 70

Matrix Notation

20

Design matrix with�dimensions n x (p + 1)

…

21 of 70

The Design Matrix

We can use linear algebra to represent our predictions of all data points at once.

One step in this process is to stack all of our input features together into a design matrix:

21

Example design matrix 708 rows x (3+1) cols

Field Goals

Assists

3-Point� Attempts

What do the rows and columns of the design matrix represent in terms of the observed data?

🤔

22 of 70

The Design Matrix

We can use linear algebra to represent our predictions of all data points at once.

One step in this process is to stack all of our input features together into a design matrix:

22

A row corresponds to one observation, e.g., all (p+1) features for datapoint 3

A column corresponds to a feature,

e.g. feature 1 for all n data points

Special all-ones feature often called the bias/intercept

Example design matrix 708 rows x (3+1) cols

Field Goals

Assists

3-Point� Attempts

23 of 70

The Multiple Linear Regression Model using Matrix Notation

We can express our linear model on our entire dataset as follows:

23

Design matrix

Prediction vector

Parameter vector

Note that our true output is also a vector:

24 of 70

Linear in Theta

An expression is “linear in theta” if it is a linear combination of parameters .

24

1.

2.

3.

4.

5.

Which of these expressions are linear in theta?

25 of 70

25

Which of the following expressions are linear in theta?

ⓘ Start presenting to display the poll results on this slide.

26 of 70

Linear in Theta

An expression is “linear in theta” if it is a linear combination of parameters .

26

1.

2.

3.

4.

5.

“Linear in theta” means the expression can separate into a matrix product of two terms: a vector of thetas, and a matrix/vector not involving thetas.

❌

27 of 70

Mean Squared Error

Lecture 11, Data 100 Summer 2023

OLS Problem Formulation

Multiple Linear Regression Model
Mean Squared Error

Geometric Derivation

Performance: Residuals, Multiple R²

OLS Properties

Residuals
The Bias/Intercept Term
Existence of a Unique Solution

27

28 of 70

Today’s Goal: Ordinary Least Squares

28

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

L2 Loss

Mean Squared Error (MSE)

Minimize�average loss�with calculus geometry

Multiple Linear Regression

Visualize, Root MSE Multiple R²

✅

More Linear Algebra!!

29 of 70

[Linear Algebra] Vector Norms and the L2 Vector Norm

The norm of a vector is some measure of that vector’s size/length.

The two norms we need to know for Data 100 are the L1 and L2 norms (sound familiar?).
Today, we focus on L2 norm. We’ll define the L1 norm another day.

29

For the n-dimensional vector , the L2 vector norm is

30 of 70

[Linear Algebra] The L2 Norm as a Measure of Length

The L2 vector norm is a generalization of the Pythagorean theorem into n dimensions.

30

It can therefore be used as a measure of length of a vector

The vector on the right has length

In

2

31 of 70

[Linear Algebra] The L2 Norm as a Measure of Distance

The L2 vector norm is a generalization of the Pythagorean theorem into n dimensions.

31

Looks like Mean Squared Error!!

Note: The square of the L2 norm of a vector is�the sum of the squares of the vector’s elements:

It can also be used as a measure of distance between two vectors.

For n-dimensional vectors , their distance is .

In

2

32 of 70

Mean Squared Error with L2 Norms

We can rewrite mean squared error as a squared L2 norm:

32

With our linear model :

33 of 70

Ordinary Least Squares

The least squares estimate is the parameter that minimizes the objective function :

A. Minimize the mean squared error for the linear model

B. Minimize the distance� between true and predicted values and �

C. Minimize the length of the residual vector,

D. All of the above

E. Something else

33

How should we interpret the OLS problem?

🤔

34 of 70

34

How should we interpret the OLS problem?

ⓘ Start presenting to display the poll results on this slide.

35 of 70

Ordinary Least Squares

The least squares estimate is the parameter that minimizes the objective function :

A. Minimize the mean squared error for the linear model

B. Minimize the distance� between true and predicted values and �

C. Minimize the length of the residual vector,

D. All of the above

E. Something else

35

How should we interpret the OLS problem?

Important for today

36 of 70

Interlude

36

37 of 70

Geometric Derivation

Lecture 11, Data 100 Summer 2023

OLS Problem Formulation

Multiple Linear Regression Model
Mean Squared Error

Geometric Derivation

Performance: Residuals, Multiple R²

OLS Properties

Residuals
The Bias/Intercept Term
Existence of a Unique Solution

37

38 of 70

Today’s Goal: Ordinary Least Squares

38

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

L2 Loss

Mean Squared Error (MSE)

Minimize�average loss�with calculus geometry

Multiple Linear Regression

Visualize, Root MSE Multiple R²

✅

The calculus derivation requires matrix calculus (out of scope, but here’s a link if you’re interested).

Instead, we will derive using a geometric argument.

✅

39 of 70

[Linear Algebra] Span

The set of all possible linear combinations of the columns of is called the span of the columns of (denoted ), also called the column space.

Intuitively, this is all of the vectors�you can "reach" using the columns of .
If each column of has length n,� is a subspace of .

39

40 of 70

[Linear Algebra] Matrix-Vector Multiplication

Approach 1: So far, we’ve thought of our model as horizontally stacked predictions per datapoint:

Approach 2: However, it is helpful sometimes to think of matrix-vector multiplication as performed by columns. We can also think of as a linear combination of feature vectors, scaled by parameters.

40

41 of 70

Prediction is a Linear Combination of Columns

The set of all possible linear combinations of the columns of X is called the span of the columns of X (denoted ), also called the column space.

Intuitively, this is all of the vectors�you can “reach” using the columns of X.
If each column of X has length n,� is a subspace of .

Our prediction is a linear combination�of the columns of . Therefore .

Interpret: Our linear prediction will be in ,� even if the true values might not be.

Goal: Find the vector in that is closest to .

41

42 of 70

A thought experiment

If you’re a human being who can only stand on the blue sheet of paper, and you need to get as close as possible in distance to the light bulb located at the tip of the red arrow.

Where do you stand on the blue sheet?

42

43 of 70

A thought experiment

If you’re a human being who can only stand on the blue sheet of paper, and you need to get as close as possible in distance to the light bulb located at the tip of the red arrow.

Where do you stand on the blue sheet?

Right below the lightbulb - that’s the closest you

can get because you can’t travel vertically!

43

44 of 70

44

Goal:

Minimize the L2 norm of the residual vector.

i.e., get the predictions to be “as close” to our true values as possible.

This is the residual vector,

.

45 of 70

45

How do we minimize this distance – the norm of the residual vector (squared)?

46 of 70

46

How do we minimize this distance – the norm of the residual vector (squared)?

The vector in that is closest to is the orthogonal projection of onto

.

We will not prove this property of orthogonal projection: see Khan Academy.

47 of 70

47

How do we minimize this distance – the norm of the residual vector (squared)?

Thus, we should choose the θ that makes the residual vector orthogonal to .

We will not prove this property of orthogonal projection: see Khan Academy.

The vector in that is closest to is the orthogonal projection of onto

.

48 of 70

[Linear Algebra] Orthogonality

1. Vector a and Vector b are orthogonal if and only if their dot product is 0:

This is a generalization of the notion of two vectors in 2D being perpendicular.

48

2. A vector v is orthogonal to , the span of the columns of a matrix M,�if and only if v is orthogonal to each column in M.

Let’s express 2 in matrix notation. Let , where :

v

v is orthogonal to each column of M,

zero vector (d-length vector full of 0s).

49 of 70

Ordinary Least Squares Proof

The least squares estimate is the parameter that minimizes the objective function :

�

Equivalently, this is the such that the residual vector is orthogonal to .

49

Rearrange terms

The normal equation

If is invertible

Definition of orthogonality �of to �(0 is the vector)

Design matrix

Residual vector

50 of 70

This result is so important that it deserves its own slide.

It is the least squares estimate and the solution to the normal equation .

50

51 of 70

This result is so important that it deserves its own slide.

It is the least squares estimate and the solution to the normal equation .

51

🎉🎉🎉

52 of 70

Least Squares Estimate

52

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

L2 Loss

Mean Squared Error (MSE)

Minimize�average loss�with calculus geometry

Multiple Linear Regression

Visualize, Root MSE Multiple R²

✅

53 of 70

Performance

Lecture 11, Data 100 Summer 2023

OLS Problem Formulation

Multiple Linear Regression Model
Mean Squared Error

Geometric Derivation

Performance: Residuals, Multiple R²

OLS Properties

Residuals
The Bias/Intercept Term
Existence of a Unique Solution

53

54 of 70

Least Squares Estimate

54

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

L2 Loss

Mean Squared Error (MSE)

Minimize�average loss�with calculus geometry

Multiple Linear Regression

Visualize, Root MSE Multiple R²

✅

55 of 70

Multiple Linear Regression

55

Design matrix

Prediction vector

Parameter vector

Note that our true output is also a vector:

Demo

56 of 70

[Visualization] Residual Plots

Simple linear regression

Plot residuals vs

the single feature x.

56

Compare

57 of 70

[Visualization] Residual Plots

See notebook

Simple linear regression

Plot residuals vs

the single feature x.

Multiple linear regression

Plot residuals vs

fitted (predicted) values .

Same interpretation as before (Data 8 textbook):

A good residual plot shows no pattern.
A good residual plot also has a similar vertical spread throughout the entire plot. Else (heteroscedasticity), the accuracy of the predictions is not reliable.

57

Compare

58 of 70

[Metrics] Multiple R²

Simple linear regression

Error

RMSE

Linearity

Correlation coefficient, r

Multiple linear regression

Error

RMSE

Linearity

Multiple R², also called the coefficient of determination

58

Compare

59 of 70

[Metrics] Multiple R²

We define the multiple R² value as the proportion of variance or our fitted values (predictions) to our true values .

Also called the correlation of determination.

R² ranges from 0 to 1 and is effectively�“the proportion of variance that the model explains.”

For OLS with an intercept term (e.g. ),

is equal to the square of correlation between , .

For SLR, , the correlation between x, .
The proof of these last two properties is beyond this course.

59

Compare

60 of 70

[Metrics] Multiple R²

Simple linear regression

Error

RMSE

Linearity

Correlation coefficient, r

Multiple linear regression

Error

RMSE

Linearity

Multiple R², also called the coefficient of determination

As we add more features, our fitted values tend to become closer and closer to our actual values. Thus, R² increases.

The SLR model (AST only) explains 45.7% of the variance in the true .
The AST & 3PA model explains 60.9%.

Adding more features doesn’t always mean our model is better, though!�We will see why after the midterm.

60

R² = 0.457

R² = 0.609

Compare

61 of 70

OLS Properties

Lecture 11, Data 100 Summer 2023

OLS Problem Formulation

Multiple Linear Regression Model
Mean Squared Error

Geometric Derivation

Performance: Residuals, Multiple R²

OLS Properties

Residuals
The Bias/Intercept Term
Existence of a Unique Solution

61

62 of 70

Residual Properties

When using the optimal parameter vector, our residuals are orthogonal to .

Proof: First line of our OLS estimate proof (slide).

62

For all linear models with an intercept term, , the sum of residuals is zero.

For all linear models:

Since our predicted response is in by definition, , and hence it is orthogonal to the residuals.

You will prove both properties in homework.

(Proof hint)

63 of 70

Properties When Our Model Has an Intercept Term

For all linear models with an intercept term, the sum of residuals is zero.

This is the real reason why we don’t�directly use residuals as loss.
This is also why positive and negative residuals will cancel out in any residual plot where the (linear) model contains an intercept term, even if the model is terrible.

63

It follows from the property above that for linear models with intercepts, �the average predicted value is equal to the average true value.

These properties are true when there is an intercept term, and not necessarily when there isn’t.

(previous slide)

64 of 70

Does a Unique Solution Always Exist?

64

	Model	Estimate	Unique?
Constant Model + MSE			Yes. Any set of values�has a unique mean.
Constant Model + MAE			Yes, if odd.�No, if even. Return average of middle 2 values.
Simple Linear Regression + MSE			Yes. Any set of non-constant* values has a unique mean, SD, and correlation coefficient.
Ordinary Least Squares (Linear Model + MSE)			???
Ordinary Least Squares (Linear Model + MSE)			???

65 of 70

Understanding The Solution Matrices

65

In most settings,

# observations # features� n >> p

66 of 70

Understanding The Solution Matrices

In practice, instead of directly inverting matrices, we can use more efficient�numerical solvers to directly solve a system of linear equations.

66

The Normal Equation:

Note that at least one solution always exists:

Intuitively, we can always draw a line of best fit for a given set of data, but there may be multiple lines that are “equally good”. (Formal proof is beyond this course.)

67 of 70

Uniqueness of a Solution: Proof

Claim

The Least Squares estimate is unique if and only if is full column rank.

67

Proof

The solution to the normal equation is the least square estimate .

has a unique solution if and only if the square matrix is invertible, which happens�if and only if is full rank.

The rank of a square matrix is the max # of linearly independent columns it contains.
has shape (p +1) x (p + 1), and therefore has max rank p + 1.

and have the same rank (proof out of scope).�
Therefore has rank p + 1 if and only if has rank p + 1 (full column rank).

68 of 70

Uniqueness of a Solution: Interpretation

Claim:

The Least Squares estimate is unique if and only if is full column rank.

When would we not have unique estimates?

68

2. If we our design matrix has features that are linear combinations of other features.

By definition, rank of is number of linearly independent columns in .
Example: If “Width”, “Height”, and “Perimeter” are all columns,

Perimeter = 2 * Width + 2 * Height → is not full rank.

Important with one-hot encoding (to discuss in later).

If our design matrix is “wide”:

(property of rank) If n < p, rank of = min(n, p + 1) < p + 1.
In other words, if we have way more features�than observations, then is not unique.
Typically we have n >> p so this is less of an issue.

p + 1 features

n data points

69 of 70

Does a Unique Solution Always Exist?

69

	Model	Estimate	Unique?
Constant Model + MSE			Yes. Any set of values�has a unique mean.
Constant Model + MAE			Yes, if odd.�No, if even. Return average of middle 2 values.
Simple Linear Regression + MSE			Yes. Any set of non-constant* values has a unique mean, SD, and correlation coefficient.
Ordinary Least Squares (Linear Model + MSE)			Yes, if is full col rank (all cols lin independent,�#datapoints>> #features)
Ordinary Least Squares (Linear Model + MSE)

70 of 70

Ordinary Least Squares

Content credit: Acknowledgments

70

Lecture 11