1 of 54

Introduction to Modeling

Understanding the usefulness of models, and how loss functions help create them.

Data 100, Summer 2020 @ UC Berkeley

Suraj Rampure, Ani Adhikari, Deborah Nolan, Fernando Perez, Joseph Gonzalez

LECTURE 11

2 of 54

Data science lifecycle

We’re now moving to the fourth stage of the data science lifecycle – Understand the World – where we build models that try and generalize patterns in the data we collected.

3 of 54

What is a model?

4 of 54

What is a model?

For example, we model the acceleration due to gravity on Earth as 9.81 m/s².

  • While this describes the behavior of our system, it is merely an approximation.
  • It doesn’t account for the effects of air resistance, the gravitational effects of other objects, etc.
  • But in practice, it’s accurate enough to be useful!

“Essentially, all models are wrong, but some are useful.”

George Box, Statistician (1919-2013)

A model is an an idealized representation of a system.

5 of 54

Why do we build models?

To understand complex phenomena occurring in the world we live in.

To make accurate predictions about unseen data.

  • What factors play a role in the growth of COVID-19?
  • How do an object’s velocity and acceleration impact how far it travels? (Physics: )

Often times, we care about creating models that are simple and interpretable, allowing us to understand what the relationships between our variables are.

  • Can we predict if this email is spam or not? (Project 2!)
  • Can we generate a one-sentence summary of this 10-page long article?

Other times, we care more about making extremely accurate predictions, at the cost of having an uninterpretable model. These are sometimes called black-box models, and are common in fields like deep learning.

Most of the time, we try to strike a balance between interpretability and accuracy.

6 of 54

From HBO’s Silicon Valley – hot dog or not hot dog? Behind this app is indeed a model.

7 of 54

Physical (or mechanistic) models

Some models, such as the aforementioned model of the acceleration due to gravity on Earth, are laws that govern how the world works. We call these physical models.

8 of 54

Statistical models

Other times, we don’t have such a precise understanding of some natural relationship. In such cases, we collect data and use statistical tools to learn more about the relationships between variables.

9 of 54

Summary statistics and notation

10 of 54

A simple model – the constant model

Suppose you want to build a model to predict some numerical quantity of a population:

  • The percentage tip given at restaurants.
  • The weight of dogs.
  • The GPA of students at UC Berkeley.

One choice of model would be to ignore any relationships between variables, and predict the same number for each individual – i.e., predicting a constant. We call this a summary statistic because it summarizes the data in our sample.

  • For instance, tips given at restaurants likely depend on the total bill price, the time of day, how generous the customers are feeling, etc.
  • Ignoring these factors is a simplifying assumption!

11 of 54

Example – Tips dataset

  • For instance, suppose a waiter collected data about the tips that they received over time while working at some restaurant. (A histogram w/KDE is shown to the right.)
  • We want to pick a constant that “best models” the tips we’ve seen. It seems like most tips are somewhere around 12-20%.
    • 15% seems like a better guess than 25%.
    • But is 15% a better guess than 14%? Hard to tell.
    • We need a precise formulation of all of this.

12 of 54

Notation

represents the parameter(s) of our model. This is what we are trying to estimate!

represents our true observations (e.g. the actual observed tip %s).

represents the predicted observations given by our model (e.g. the predicted tip %s).

represents the ith observation in particular (e.g. is the 4th observed tip).

In general, we represent our collected data as .

represents the ith prediction in particular (e.g. is the predicted tip for the 9th data point).

represents the fitted, or optimal, parameter(s) that we solve for. It is our goal to find this!

Parameters are what define our model. We make this more clear in the next slide.

We want to find to make the best possible model.

13 of 54

Notation

The constant model can be stated as follows:

  • Parameters are what define our model. Parameters tell us what the relationships between our input and output variables are.
    • Note: not all models have parameters. KDEs are non-parametric models!
  • Our model only has one parameter. Here, the only thing that defines our model is the single number we will predict, regardless of the input.
  • Models can have many parameters (which we often express as a single parameter vector). Here are examples of models we’ll see in the coming lectures:

  • Our goal is to find the best possible value of our parameter, which we denote with .
    • We know that for the tips dataset is closer to 15% than it is to 25%.

14 of 54

Prediction vs. estimation

These terms are often used somewhat interchangeably, but there is a subtle difference between them.

Prediction is the task of using a model to predict outputs for unseen data. Once we have estimates for our model’s parameters, we can use our model for prediction.

Estimation is the task of using data to determine model parameters.

15 of 54

Loss functions

16 of 54

The cost of doing business (making predictions)

We need some metric of how “good” or “bad” our predictions are. This is what loss functions provide us with. Loss functions quantify how bad a prediction is for a single observation.

  • If our prediction is close to the actual value, we want low loss.
  • If our prediction is far from the actual value, we want high loss.

A natural choice of loss function is actual - predicted, or . We call this the error for a single prediction.

  • But, this treats “negative” predictions and “positive” predictions differently.
    • Predicting 16 when the true value is 15 should be penalized the same as predicting 14.
  • This leads to two natural loss functions.

17 of 54

Squared and absolute loss

The most common loss function you’ll see is the squared loss, also known as L2 loss.

  • For a single data point in general, this is .
  • For our constant model, since , this is .

Another common loss function is the absolute loss, also known as L1 loss.

  • For our constant model, for a single point, this is .

There are benefits and drawbacks to both of the above loss functions. We will examine those shortly. These are also not the only possible loss functions; we will see more later.

If our prediction is equal to the actual observation, in both cases, our loss is 0.

Low loss means a good fit!

18 of 54

Loss functions and empirical risk

We care about how bad our model’s predictions are for our entire data set, not just for one point. A natural measure, then, is of the average loss across all points. Assuming points:

The average loss of a model tells us how well it fits the given data. If our model has a low average loss across our dataset, that means it is good at making predictions. As such, we want to find the parameter(s) that minimize average loss, in order to make our model as good at making predictions as it can be.

Other names for average loss include empirical risk and an objective function.

19 of 54

MSE and MAE

If we choose squared loss as our loss function, then average squared loss is typically referred to as mean squared error (MSE), and is of the following form:

If we choose absolute loss as our loss function, then average absolute loss is typically referred to as mean absolute error (MAE), and is of the following form:

These definitions hold true, regardless of our model. We want to minimize these quantities.

20 of 54

Exploring MSE

Average loss is typically written as a function of , since defines what our model is (and hence what our predictions are). For example, with squared loss and the constant model, our average loss (and hence, the function we want to minimize) is

Another way of stating our goal is to find the satisfying: �

We won’t use this notation again in this lecture, but it will come up again in the future.

Average loss is also a function of our data. But unlike theta, we can’t change our data: it is given to us (i.e. it is fixed).

argmin means “the argument that minimizes the following function.”

21 of 54

Exploring MSE

When our model is the constant model, and we choose to use L2 loss, again, our average loss looks like:

Let’s examine a toy example. Suppose we have 5 observations, [20, 21, 22, 29, 33].

The loss for the first observation (y₁).

The average loss across all observations (the MSE).

22 of 54

Exploring MSE

The loss for the first observation (y₁).

The average loss across all observations (the MSE).

A parabola, minimized at theta = 20.

Also a parabola! Minimized at theta = 25.

23 of 54

Minimizing mean squared error (MSE)

for the constant model

24 of 54

Minimizing MSE

We saw with the toy example of [20, 21, 22, 29, 33] that the value that minimizes the MSE of the constant model was 25, which was the mean of our observations.

We can try other examples if we want to, and we’ll end up with the same result. Let’s instead pivot to proving this rigorously, using mathematics. There are two ways we’ll go about doing this:

  • Using calculus.
  • Using a neat algebraic trick.

For both derivations, the slides contain the key ideas, but the lecture videos will contain a step-by-step walkthrough.

25 of 54

MSE minimization using calculus

One way to minimize a function is by using calculus: we can take the derivative, set it equal to 0, and solve for the optimizing value.

  • The derivative of the sum of several pieces is equal to the sum of the derivative of said pieces.
  • The derivative of the loss for a single point is .

Then:

from above

since we can pull constants out of sums

26 of 54

MSE minimization using calculus

Setting this term to 0, we have:

Thus, with squared loss and the constant model, the sample mean minimizes MSE.

we can separate sums

c + c + … + c = n * c

27 of 54

MSE minimization using calculus

We’re not done yet! To be thorough, we need to perform the second derivative test, to guarantee that the point we found is truly a minimum (rather than a maximum or saddle point). We hope that the second derivative of our objective function is positive, indicating our function is convex opening upwards.

Fortunately, it is, so the sample mean truly is the minimizer we were looking for. We will interpret what this means shortly.

28 of 54

MSE minimization using an algebraic trick

It turns out that in this case, there’s another rather elegant way of performing the same minimization algebraically, but without using calculus.

  • We present this derivation in the next few slides. The lecture video will walk through it in detail.
  • In this proof, you will need to use the fact that the sum of deviations from the mean is 0 (in other words, that ). We present that proof here:

  • Our proof will also use the definition of the variance of a sample. As a refresher:

For example, this mini-proof shows

1 + 2 + 3 + 4 + 5 is the same as

3 + 3 + 3 + 3 + 3.

Equal to the MSE of the sample mean!

29 of 54

MSE minimization using an algebraic trick

This proof relies on an algebraic trick. We can write the difference a - b as

(a - c) + (c - b), where a, b, and c are any numbers.

Using that fact, we can write

, where

, our sample mean.

Also note: going from line 3 to 4, we distribute the sum to the individual terms. This is a property of sums you should become familiar with!

from the previous slide

variance of sample!

30 of 54

Minimization using an algebraic trick

In the previous slide, we showed that .

  • Since variance can’t be negative, the first term is greater than or equal to 0.
    • Of note, the first term doesn’t involve at all. Changing our model won’t change this value, so for the purposes of determining , we can ignore it.
  • The second term is being squared, and so also must be greater than or equal to 0.
    • This term does involve , and so picking the right value of will minimize our average loss.
    • We need to pick the that sets the second term to 0.
    • This is achieved when . In other words:

Looks familiar!

Question: What is the value of average loss, when evaluated at ?

31 of 54

Mean minimizes MSE for the constant model

As we determined a variety of ways, for the constant model with squared loss, the mean of the dataset is the optimal model.

  • This holds true regardless of the dataset we use, but it’s only true for this combination of model and loss.
  • If we choose any other constant other than the sample mean, the empirical risk will not be as small as possible, and so our model is “worse” (for this loss).

This is not all that surprising! It provides some formal reasoning as to why we use means so commonly as summary statistics. It is the best, in some sense.

Note, we now write instead of . This is because we are referring to the optimal parameter, not just any arbitrary .

32 of 54

Minimum value of MSE is the sample variance

It’s worth noting that when we substitute back into our average loss, we obtain a familiar result:

That is, the minimum value that mean squared error can take on (again, for the constant model) is the sample variance.

Put another way, the following statement is true whenever :

33 of 54

Minimizing mean absolute error (MAE)

for the constant model

34 of 54

Exploring MAE

When we use absolute (or L1) loss, we call the average loss mean absolute error. For the constant model, our MAE looks like:

Let’s again re-visit our toy example of 5 observations, [20, 21, 22, 29, 33].

The loss for the first observation (y₁).

The average loss across all observations (the MAE).

35 of 54

Exploring MAE

The loss for the first observation (y₁).

The average loss across all observations (the MAE).

An absolute value curve, centered at theta = 20.

Some weird shape.... minimized near theta = 22?

36 of 54

Exploring MAE

The shape of the MAE with the constant model seems to be jagged. This is because it is the (weighted) sum of several absolute value curves, which results in a piecewise linear function.

It also doesn’t seem to be immediately clear what the optimal choice of theta should be. It’s somewhere in the “middle” of our points, but it’s clearly not 25, which was the minimizing value for the MSE.

Let’s once again resort to calculus!

37 of 54

Exploring MAE

The shape of the MAE with the constant model seems to be jagged. This is because it is the (weighted) sum of several absolute value curves, which results in a piecewise linear function.

It also doesn’t seem to be immediately clear what the optimal choice of theta should be. It’s somewhere in the “middle” of our points, but it’s clearly not 25, which was the minimizing value for the MSE.

Let’s once again resort to calculus!

The bends, or “kinks,” all appear at our observations! (20, 21, 22, 29, 33)

38 of 54

MAE minimization using calculus

Once again, we can use calculus to determine the optimal .

The first step is to determine the derivative of our loss function for a single point. Absolute value functions can be written as two piecewise linear functions:

The derivative of our loss for a single point, then, is also a piecewise linear function:

Note: The derivative of the absolute value when the argument is 0 (i.e. when ) is technically undefined. We ignore this case in our derivation, since thankfully, it doesn’t change our result.

39 of 54

MAE minimization using calculus

From here, we again use the fact that the derivative of a sum is a sum of derivatives:

Add -1 for each time an observation yi is greater than our choice of theta.

Add 1 for each time an observation yi is less than our choice of theta.

40 of 54

MAE minimization using calculus

Setting this derivative equal to 0:

The last line is telling us that in order for our MAE to be minimized, we need to choose a theta such that the number of observations less than theta needs to be equal to the number of observations greater than theta.

41 of 54

MAE minimization using calculus

In order for our MAE to be minimized, we need to choose a theta such that the number of observations less than theta needs to be equal to the number of observations greater than theta. In other words, theta needs to be such that there are an equal number of points to the left and right.

This is the definition of the median! For example, in our toy dataset, the point below in red (22) is the median of our observations. It is the value in the “middle.”

42 of 54

MAE minimization using calculus

In order for our MAE to be minimized, we need to choose a theta such that the number of observations less than theta needs to be equal to the number of observations greater than theta. In other words, theta needs to be such that there are an equal number of points to the left and right.

This is the definition of the median! For example, in our toy dataset, the point below in red (22) is the median of our observations. It is the value in the “middle.”

Two points to the left, two points to the right.

43 of 54

Median minimizes MAE for the constant model

We’ve now shown that the median minimizes MAE for the constant model.

This is consistent with what we saw earlier, when plotting the MAE for our toy dataset:

Minimized at exactly theta = 22.

Important note: In general, the mean and median of a dataset are not the same. Therefore, using MSE and MAE give us different optimal theta values!

A key takeaway here is that our choice of loss function determines the optimal parameters for our model.

44 of 54

Median minimizes MAE for the constant model

Our toy dataset only had 5 observations. What if it had an even number of observations? Let’s say our toy dataset is now [20, 21, 22, 29, 33, 35]. The 35 is new.

  • There’s no longer a unique solution!
  • Any value in the range [22, 29] minimizes MAE.
  • This reflects the fact that there are an even number of observations, and any number in that range has the same number of points to the left and right.
  • (When there are an even number of data points, we typically set the median to be the mean of the two middle ones. Here, that’d be 25.5.)

Any theta value in this flat region minimizes MAE.

45 of 54

Comparing loss functions

46 of 54

MSE vs. MAE for toy data

Below, we present the plot of the loss surface for our toy dataset, using L2 loss (left) and L1 loss (right).

  • A loss surface is a plot of the loss encountered for each possible value of .
  • If our model had 2 parameters, this plot would be 3 dimensional.

Minimized at the mean of y (25).

Minimized at the median of y (22).

47 of 54

MSE vs. MAE

What else is different about squared loss (MSE) and absolute loss (MAE)?

Mean squared error (optimal parameter for the constant model is the sample mean)

  • Very smooth. Easy to minimize using numerical methods (coming later in the course).
  • Very sensitive to outliers, e.g. if we added 1000 to our largest observation, the optimal theta would become 225 instead of 25.

Mean absolute error (optimal parameter for the constant model is the sample median)

  • Not as smooth – at each of the “kinks,” it’s not differentiable. Harder to minimize using numerical methods.
  • Robust to outliers! E.g, adding 1000 to our largest observation doesn’t change the median.

It’s not clear that one is “better” than the other. In practice, we get to choose our loss function!

48 of 54

Summary

49 of 54

The modeling process

We’ve implicitly introduced this three-step process, which we will use constantly throughout the rest of the course.

Choose a model

Choose a loss function

Fit the model by minimizing average loss

50 of 54

The modeling process

We’ve implicitly introduced this three-step process, which we will use constantly throughout the rest of the course.

Choose a model

Choose a loss function

Fit the model by minimizing average loss

In this lecture, we focused exclusively on the constant model, which has a single parameter.

Parameters define our model. They tell us the relationship between the variables involved in our model. (Not all models have parameters, though!)

In the coming lectures, we will look at more sophisticated models.

51 of 54

The modeling process

We’ve implicitly introduced this three-step process, which we will use constantly throughout the rest of the course.

Choose a model

Choose a loss function

Fit the model by minimizing average loss

We introduced two loss functions here: L2 (squared) loss and L1 (absolute) loss. There also exist others.

Both have their benefits and drawbacks. We get to choose which loss function we use, for any modeling task.

52 of 54

The modeling process

We’ve implicitly introduced this three-step process, which we will use constantly throughout the rest of the course.

Choose a model

Choose a loss function

Fit the model by minimizing average loss

Lastly, we choose the optimal parameters by determining the parameters that minimize average loss across our entire dataset. Different loss functions lead to different optimal parameters.

This process is called fitting the model to the data. We did it by hand here, but in the future we will rely on computerized techniques.

53 of 54

Vocabulary review

  • When we use squared (L2) loss as our loss function, the average loss across our dataset is called mean squared error.
    • “Squared loss” and “mean squared error” are not the exact same thing – one is for a single observation, and one is for an entire dataset.
    • But they are closely related.
  • A similar relationship holds true between absolute (L1) loss and mean absolute error.
  • “Average loss” and “empirical risk” mean the same thing for our purposes.
    • So far, our empirical risk was either mean squared error, or mean absolute error.
    • But generally, average loss / empirical risk could be the mean of any loss function across our dataset.

54 of 54

What’s next...

  • Changing the model.
    • Next, we’ll introduce the simple linear regression model that you saw in Data 8.
    • We’ll also look at multiple regression, logistic regression, decision trees, and random forests, all of which are different types of models.
  • Changing the loss function.
    • L2 loss (and, hence, mean squared error) will appear a lot.
    • But we’ll also introduce new loss functions, like cross-entropy loss.
  • Changing how we fit the model to the data.
    • We did this largely by hand in this lecture.
    • But shortly, we’ll run into combinations of models and loss functions for which the optimal parameters can’t be determined by hand.
    • As such, we’ll learn about techniques like gradient descent.