1 of 41

Linear Regression

(Reading: Ch 13)

(Slides adapted from Sandrine Dudoit and Joey Gonzalez)

UC Berkeley Data 100 Summer 2019

Sam Lau

Learning goals:

Reframe loss minimization framework for modeling.
Introduce multivariable linear models within the loss minimization framework.

2 of 41

Announcements

HW4 due Tuesday
HW5 out Tuesday, due Friday
Leo was out sick! Hopefully back today.
Today’s lecture might get split into two

Ask lots of questions if we’re going too fast

3 of 41

Last Time

Draw conclusions about population using a sample through statistical estimation.
Make estimations by picking the estimator that minimizes empirical risk / loss.
Today:

Connect estimation with prediction and modeling.
First foray into machine learning with linear models.

5 of 41

Making Predictions

Loss minimization framework useful for predictions too!
Suppose we have a dataset of cars and we’d like to predict fuel efficiency (miles per gallon, or mpg):

6 of 41

Models

To make a prediction, we choose a model.

Takes input data and outputs a prediction.

Constant model:

Simple linear model:

Prediction

Input data

Recipe to compute the prediction

Two model weights

7 of 41

The Constant Model

Start simple: if constant model, how do we pick θ?

Intuition: pick θ to be close to most of the values in data

8 of 41

Model Loss

Use x_i to denote what we use to make predictions
Use y_i to denote what we’re trying to predict
But both x and y come from a single sample
Idea: Pick the θ that minimizes the average loss between y in our sample and model predictions.

9 of 41

Constant Model Loss

Remember this expression from last lecture?
θ = sample mean is the best model parameter.
So, for car MPGs, we set θ = mean(mpg)

10 of 41

Modeling is Estimation in New Clothes

Estimation: making best guess at population parameter
Modeling: making predictions for population values
Two sides of the same coin! Why?
Modeling assumes pop values generated by parameters:

RTA: assume that data from population generated by taking a constant θ^* and adding noise ϵ.
Estimation = Finding θ̂, our best estimate for θ^*
Modeling = Using θ̂ to make predictions

11 of 41

The Modeling Pipeline

We choose what goes in the blue boxes!

Input Data

Model

Predictions

Loss Function

Loss

Model Weight(s)

Fit a model by finding weights that minimize loss.

Minimizing sample loss approximates minimizing population loss.

12 of 41

The Modeling Recipe

Pick a model, pick a loss function, fit the model to sample.
Preview of model and loss function combos:

Model	Loss Function	Technique Name
		Least squares linear regression
		Lasso regression
		Ridge regression
		Least absolute deviations
		Logistic regression

13 of 41

Linear Models

14 of 41

Using Our Data

If we’re trying to predict MPG, we can do better than a constant model by incorporating more information.

E.g. higher horsepowers have lower MPGs:

15 of 41

Simple Linear Model

We want our predictions to depend on the input data x.
Simple linear model:

As usual, we can minimize the loss. This time, we have two parameters.

16 of 41

Simple Linear Model

This ends up being a lot of algebra, so we’ll skip to the answer.

17 of 41

Skipping Ahead

Data 8 textbook has example slope/intercept calculations.
Takeaway: Can derive these formulas by minimizing loss.
You should know how to take the derivative but won’t need to solve it.

18 of 41

Multivariable Linear Model

Simple linear model uses one variable to predict:

Time to graduate from Data 8!
Multivariable linear model uses ≥1 variable:

x is a vector containing one row of input data.
IOW: Predict by combining multiple features together.

19 of 41

Intuition

Using horsepower and model year to predict mpg

Expect θ₁ to be negative and θ₂ to be positive. Why?

20 of 41

Using Matrix Multiplication

This means our model is:

Many terms to write! We’ll use a trick: add a column of 1s to the table:

Bolded letters means vector or matrix.

21 of 41

More Notation!

Your turn: Write the matrix expression that computes a vector with a fitted linear model’s predictions for all sample points.

22 of 41

Your Turn

Write the matrix expression that computes a vector with a fitted linear model’s predictions for all sample points.

23 of 41

Your Turn

Write the matrix expression that computes the average MSE loss for all data points (this is a scalar!).

24 of 41

Your Turn

Write the matrix expression that computes the average MSE loss for all data points (this is a scalar!).

Using matrix notation takes a lot of practice to get used to, but the results are worth it. Always check your dimensions!

25 of 41

Fitting a Linear Model

How do we pick θ to minimize loss?

Want to take partial derivatives for θ₀, θ₁, ...
Instead, we’ll take the gradient and set it equal to zero.
This solves for all model weights at once!

26 of 41

The Normal Equation

Saving the setup for the Gradient Descent lecture

Again, you need to know how to take the gradient but not how to solve for θ.

Skipping ahead to the answer:

Expression above called normal equation
Gives a closed-form recipe for fitting linear model

What are the matrix shapes in these expressions?

27 of 41

The Abnormal Equation

In practice, it takes too long to compute this:

Inverting an (n x n) matrix takes at least O(n²) time.

State of the art: O(n^2.3)

Takeaway: analytic solutions are elegant but are sometimes hard to find and slow.

Next lecture: gradient descent

28 of 41

Demo: Predicting MPGs

29 of 41

Break!

Fill out Attendance:

http://bit.ly/at-d100

30 of 41

Feature Engineering

(moved to Wed lecture)

31 of 41

Linear Models Level Up

Horsepower and mpg have a nonlinear relationship.
Can still use linear regression to capture this!
Feature engineering: creating new features from data to give model more complexity.

32 of 41

Adding Features

For now, predict MPG from horsepower alone.
Insight: Add a new column to X with horsepower².

Now we fit a quadratic function!

This is still linear in model weights θ, so we call it a linear model.

(Demo)

33 of 41

Polynomial Regression

For polynomial features of degree n, usually add every possible combination of columns.

4 original columns, degree 2:

Can end up being a lot of columns
To cope, use kernel trick (covered in advanced courses)

34 of 41

Categorical Features

Origin column is correlated with MPG. Can we use it?
Idea: Encode categories as numbers in a smart way.
Discuss: Why can’t we just encode “usa” as 0, “japan” as 1, “europe” as 2?

35 of 41

One-Hot Encoding

One-hot encoding makes one new column for each unique category:

36 of 41

One-Hot Encoding

What do you expect the largest weight to be?

Can interpret weight as “contribution” of that category

37 of 41

One Hot Problem

Problem: Adding a new column for each category makes columns of X linearly dependent! Why?
One-hot columns always sum to 1:

This makes normal equations unsolvable.

Not invertible ^

38 of 41

Weight Interpretation

Invertibility isn’t a problem for gradient descent, but this still affects how we interpret the model weights.
Linearly dependent columns can “swap” weights:

Left: All categories matter. Right: No categories matter!

39 of 41

Drop it Like it’s Hot

Simple fix: Drop the last one-hot column.
In this case, the weight for USA can be interpreted as “change in MPG between USA and Japan”.

40 of 41

Features feat. More Features

Feature engineering is often domain-specific:

Standardizing: “How many SDs away from average?”
Log transform: Used to fit exponential models.
Absolute difference: “How different is the current temperature from 70°?”
Binning data, then one-hot encoding: “Are we driving during morning rush hour? Evening rush hour?”
Date-related features: year, month, weekday
Image-related features: blurring, edge detection, etc.

41 of 41

Summary

Modeling and estimation are closely related.

We can view modeling as estimation of model parameters.

Linear models can incorporate an arbitrary number of features to make a prediction.
Feature engineering extends linear models to generate more complex models.