1 of 41

Linear Regression

(Reading: Ch 13)

(Slides adapted from Sandrine Dudoit and Joey Gonzalez)

UC Berkeley Data 100 Summer 2019

Sam Lau

Learning goals:

  • Reframe loss minimization framework for modeling.
  • Introduce multivariable linear models within the loss minimization framework.

2 of 41

Announcements

  • HW4 due Tuesday
  • HW5 out Tuesday, due Friday
  • Leo was out sick! Hopefully back today.
  • Today’s lecture might get split into two
    • Ask lots of questions if we’re going too fast

3 of 41

Last Time

  • Draw conclusions about population using a sample through statistical estimation.
  • Make estimations by picking the estimator that minimizes empirical risk / loss.
  • Today:
    • Connect estimation with prediction and modeling.
    • First foray into machine learning with linear models.

4 of 41

Modeling

5 of 41

Making Predictions

  • Loss minimization framework useful for predictions too!
  • Suppose we have a dataset of cars and we’d like to predict fuel efficiency (miles per gallon, or mpg):

6 of 41

Models

  • To make a prediction, we choose a model.
    • Takes input data and outputs a prediction.
  • Constant model:
  • Simple linear model:

Prediction

Input data

Recipe to compute the prediction

Two model weights

7 of 41

The Constant Model

  • Start simple: if constant model, how do we pick θ?
  • Intuition: pick θ to be close to most of the values in data

8 of 41

Model Loss

  • Use xi to denote what we use to make predictions
  • Use yi to denote what we’re trying to predict
  • But both x and y come from a single sample
  • Idea: Pick the θ that minimizes the average loss between y in our sample and model predictions.

9 of 41

Constant Model Loss

  • Remember this expression from last lecture?
  • θ = sample mean is the best model parameter.
  • So, for car MPGs, we set θ = mean(mpg)

10 of 41

Modeling is Estimation in New Clothes

  • Estimation: making best guess at population parameter
  • Modeling: making predictions for population values
  • Two sides of the same coin! Why?
  • Modeling assumes pop values generated by parameters:
  • RTA: assume that data from population generated by taking a constant θ* and adding noise ϵ.
  • Estimation = Finding θ̂, our best estimate for θ*
  • Modeling = Using θ̂ to make predictions

11 of 41

The Modeling Pipeline

We choose what goes in the blue boxes!

Input Data

Model

Predictions

Loss Function

Loss

Model Weight(s)

Fit a model by finding weights that minimize loss.

Minimizing sample loss approximates minimizing population loss.

12 of 41

The Modeling Recipe

  • Pick a model, pick a loss function, fit the model to sample.
  • Preview of model and loss function combos:

Model

Loss Function

Technique Name

Least squares linear regression

Lasso regression

Ridge regression

Least absolute deviations

Logistic regression

13 of 41

Linear Models

14 of 41

Using Our Data

  • If we’re trying to predict MPG, we can do better than a constant model by incorporating more information.
    • E.g. higher horsepowers have lower MPGs:

15 of 41

Simple Linear Model

  • We want our predictions to depend on the input data x.
  • Simple linear model:
  • As usual, we can minimize the loss. This time, we have two parameters.

16 of 41

Simple Linear Model

  • This ends up being a lot of algebra, so we’ll skip to the answer.

17 of 41

Skipping Ahead

  • Data 8 textbook has example slope/intercept calculations.
  • Takeaway: Can derive these formulas by minimizing loss.
  • You should know how to take the derivative but won’t need to solve it.

18 of 41

Multivariable Linear Model

  • Simple linear model uses one variable to predict:
  • Time to graduate from Data 8!
  • Multivariable linear model uses ≥1 variable:
  • x is a vector containing one row of input data.
  • IOW: Predict by combining multiple features together.

19 of 41

Intuition

  • Using horsepower and model year to predict mpg
    • Expect θ1 to be negative and θ2 to be positive. Why?

20 of 41

Using Matrix Multiplication

  • This means our model is:
  • Many terms to write! We’ll use a trick: add a column of 1s to the table:

Bolded letters means vector or matrix.

21 of 41

More Notation!

Your turn: Write the matrix expression that computes a vector with a fitted linear model’s predictions for all sample points.

22 of 41

Your Turn

Write the matrix expression that computes a vector with a fitted linear model’s predictions for all sample points.

23 of 41

Your Turn

Write the matrix expression that computes the average MSE loss for all data points (this is a scalar!).

24 of 41

Your Turn

Write the matrix expression that computes the average MSE loss for all data points (this is a scalar!).

Using matrix notation takes a lot of practice to get used to, but the results are worth it. Always check your dimensions!

25 of 41

Fitting a Linear Model

  • How do we pick θ to minimize loss?
  • Want to take partial derivatives for θ0, θ1, ...
  • Instead, we’ll take the gradient and set it equal to zero.
  • This solves for all model weights at once!

26 of 41

The Normal Equation

  • Saving the setup for the Gradient Descent lecture
    • Again, you need to know how to take the gradient but not how to solve for θ.
  • Skipping ahead to the answer:
  • Expression above called normal equation
  • Gives a closed-form recipe for fitting linear model

What are the matrix shapes in these expressions?

27 of 41

The Abnormal Equation

  • In practice, it takes too long to compute this:
  • Inverting an (n x n) matrix takes at least O(n2) time.
    • State of the art: O(n2.3)
  • Takeaway: analytic solutions are elegant but are sometimes hard to find and slow.
    • Next lecture: gradient descent

28 of 41

Demo: Predicting MPGs

29 of 41

Break!

Fill out Attendance:

http://bit.ly/at-d100

30 of 41

Feature Engineering

(moved to Wed lecture)

31 of 41

Linear Models Level Up

  • Horsepower and mpg have a nonlinear relationship.
  • Can still use linear regression to capture this!
  • Feature engineering: creating new features from data to give model more complexity.

32 of 41

Adding Features

  • For now, predict MPG from horsepower alone.
  • Insight: Add a new column to X with horsepower2.
  • Now we fit a quadratic function!
  • This is still linear in model weights θ, so we call it a linear model.

(Demo)

33 of 41

Polynomial Regression

  • For polynomial features of degree n, usually add every possible combination of columns.
    • 4 original columns, degree 2:
  • Can end up being a lot of columns
  • To cope, use kernel trick (covered in advanced courses)

34 of 41

Categorical Features

  • Origin column is correlated with MPG. Can we use it?
  • Idea: Encode categories as numbers in a smart way.
  • Discuss: Why can’t we just encode “usa” as 0, “japan” as 1, “europe” as 2?

35 of 41

One-Hot Encoding

  • One-hot encoding makes one new column for each unique category:

36 of 41

One-Hot Encoding

  • What do you expect the largest weight to be?
  • Can interpret weight as “contribution” of that category

37 of 41

One Hot Problem

  • Problem: Adding a new column for each category makes columns of X linearly dependent! Why?
  • One-hot columns always sum to 1:
  • This makes normal equations unsolvable.

Not invertible ^

=

+

+

38 of 41

Weight Interpretation

  • Invertibility isn’t a problem for gradient descent, but this still affects how we interpret the model weights.
  • Linearly dependent columns can “swap” weights:
    • Left: All categories matter. Right: No categories matter!

0

3

3

3

3

0

0

0

=

39 of 41

Drop it Like it’s Hot

  • Simple fix: Drop the last one-hot column.
  • In this case, the weight for USA can be interpreted as “change in MPG between USA and Japan”.

40 of 41

Features feat. More Features

  • Feature engineering is often domain-specific:
    • Standardizing: “How many SDs away from average?”
    • Log transform: Used to fit exponential models.
    • Absolute difference: “How different is the current temperature from 70°?”
    • Binning data, then one-hot encoding: “Are we driving during morning rush hour? Evening rush hour?”
    • Date-related features: year, month, weekday
    • Image-related features: blurring, edge detection, etc.

41 of 41

Summary

  • Modeling and estimation are closely related.
    • We can view modeling as estimation of model parameters.
  • Linear models can incorporate an arbitrary number of features to make a prediction.
  • Feature engineering extends linear models to generate more complex models.