1 of 54

Introduction to Modeling

Understanding the usefulness of models, and how loss functions help create them.

Data 100, Summer 2020 @ UC Berkeley

Suraj Rampure, Ani Adhikari, Deborah Nolan, Fernando Perez, Joseph Gonzalez

LECTURE 11

2 of 54

Data science lifecycle

We’re now moving to the fourth stage of the data science lifecycle – Understand the World – where we build models that try and generalize patterns in the data we collected.

3 of 54

What is a model?

4 of 54

What is a model?

For example, we model the acceleration due to gravity on Earth as 9.81 m/s².

While this describes the behavior of our system, it is merely an approximation.
It doesn’t account for the effects of air resistance, the gravitational effects of other objects, etc.
But in practice, it’s accurate enough to be useful!

“Essentially, all models are wrong, but some are useful.”

George Box, Statistician (1919-2013)

A model is an an idealized representation of a system.

5 of 54

Why do we build models?

To understand complex phenomena occurring in the world we live in.

To make accurate predictions about unseen data.

What factors play a role in the growth of COVID-19?
How do an object’s velocity and acceleration impact how far it travels? (Physics: )

Often times, we care about creating models that are simple and interpretable, allowing us to understand what the relationships between our variables are.

Can we predict if this email is spam or not? (Project 2!)
Can we generate a one-sentence summary of this 10-page long article?

Other times, we care more about making extremely accurate predictions, at the cost of having an uninterpretable model. These are sometimes called black-box models, and are common in fields like deep learning.

Most of the time, we try to strike a balance between interpretability and accuracy.

6 of 54

From HBO’s Silicon Valley – hot dog or not hot dog? Behind this app is indeed a model.

7 of 54

Physical (or mechanistic) models

Some models, such as the aforementioned model of the acceleration due to gravity on Earth, are laws that govern how the world works. We call these physical models.

8 of 54

Statistical models

Other times, we don’t have such a precise understanding of some natural relationship. In such cases, we collect data and use statistical tools to learn more about the relationships between variables.

9 of 54

Summary statistics and notation

10 of 54

A simple model – the constant model

Suppose you want to build a model to predict some numerical quantity of a population:

The percentage tip given at restaurants.
The weight of dogs.
The GPA of students at UC Berkeley.

One choice of model would be to ignore any relationships between variables, and predict the same number for each individual – i.e., predicting a constant. We call this a summary statistic because it summarizes the data in our sample.

For instance, tips given at restaurants likely depend on the total bill price, the time of day, how generous the customers are feeling, etc.
Ignoring these factors is a simplifying assumption!

11 of 54

Example – Tips dataset

For instance, suppose a waiter collected data about the tips that they received over time while working at some restaurant. (A histogram w/KDE is shown to the right.)
We want to pick a constant that “best models” the tips we’ve seen. It seems like most tips are somewhere around 12-20%.

15% seems like a better guess than 25%.
But is 15% a better guess than 14%? Hard to tell.
We need a precise formulation of all of this.

12 of 54

Notation

represents the parameter(s) of our model. This is what we are trying to estimate!

represents our true observations (e.g. the actual observed tip %s).

represents the predicted observations given by our model (e.g. the predicted tip %s).

represents the ith observation in particular (e.g. is the 4th observed tip).

In general, we represent our collected data as .

represents the ith prediction in particular (e.g. is the predicted tip for the 9th data point).

represents the fitted, or optimal, parameter(s) that we solve for. It is our goal to find this!

Parameters are what define our model. We make this more clear in the next slide.

We want to find to make the best possible model.

13 of 54

Notation

The constant model can be stated as follows:

Parameters are what define our model. Parameters tell us what the relationships between our input and output variables are.

Note: not all models have parameters. KDEs are non-parametric models!

Our model only has one parameter. Here, the only thing that defines our model is the single number we will predict, regardless of the input.
Models can have many parameters (which we often express as a single parameter vector). Here are examples of models we’ll see in the coming lectures:

Our goal is to find the best possible value of our parameter, which we denote with .

We know that for the tips dataset is closer to 15% than it is to 25%.

14 of 54

Prediction vs. estimation

These terms are often used somewhat interchangeably, but there is a subtle difference between them.

Prediction is the task of using a model to predict outputs for unseen data. Once we have estimates for our model’s parameters, we can use our model for prediction.

Estimation is the task of using data to determine model parameters.

15 of 54

Loss functions

16 of 54

The cost of doing business (making predictions)

We need some metric of how “good” or “bad” our predictions are. This is what loss functions provide us with. Loss functions quantify how bad a prediction is for a single observation.

If our prediction is close to the actual value, we want low loss.
If our prediction is far from the actual value, we want high loss.

A natural choice of loss function is actual - predicted, or . We call this the error for a single prediction.

But, this treats “negative” predictions and “positive” predictions differently.

Predicting 16 when the true value is 15 should be penalized the same as predicting 14.

This leads to two natural loss functions.

17 of 54

Squared and absolute loss

The most common loss function you’ll see is the squared loss, also known as L2 loss.

For a single data point in general, this is .
For our constant model, since , this is .

Another common loss function is the absolute loss, also known as L1 loss.

For our constant model, for a single point, this is .

There are benefits and drawbacks to both of the above loss functions. We will examine those shortly. These are also not the only possible loss functions; we will see more later.

If our prediction is equal to the actual observation, in both cases, our loss is 0.

Low loss means a good fit!

18 of 54

Loss functions and empirical risk

We care about how bad our model’s predictions are for our entire data set, not just for one point. A natural measure, then, is of the average loss across all points. Assuming points:

The average loss of a model tells us how well it fits the given data. If our model has a low average loss across our dataset, that means it is good at making predictions. As such, we want to find the parameter(s) that minimize average loss, in order to make our model as good at making predictions as it can be.

Other names for average loss include empirical risk and an objective function.

19 of 54

MSE and MAE

If we choose squared loss as our loss function, then average squared loss is typically referred to as mean squared error (MSE), and is of the following form:

If we choose absolute loss as our loss function, then average absolute loss is typically referred to as mean absolute error (MAE), and is of the following form:

These definitions hold true, regardless of our model. We want to minimize these quantities.

20 of 54

Exploring MSE

Average loss is typically written as a function of , since defines what our model is (and hence what our predictions are). For example, with squared loss and the constant model, our average loss (and hence, the function we want to minimize) is

Another way of stating our goal is to find the satisfying: �

We won’t use this notation again in this lecture, but it will come up again in the future.

Average loss is also a function of our data. But unlike theta, we can’t change our data: it is given to us (i.e. it is fixed).

argmin means “the argument that minimizes the following function.”

21 of 54

Exploring MSE

When our model is the constant model, and we choose to use L2 loss, again, our average loss looks like:

Let’s examine a toy example. Suppose we have 5 observations, [20, 21, 22, 29, 33].

The loss for the first observation (y₁).

The average loss across all observations (the MSE).

22 of 54

Exploring MSE

The loss for the first observation (y₁).

The average loss across all observations (the MSE).

A parabola, minimized at theta = 20.

Also a parabola! Minimized at theta = 25.

23 of 54

Minimizing mean squared error (MSE)

for the constant model

24 of 54

Minimizing MSE

We saw with the toy example of [20, 21, 22, 29, 33] that the value that minimizes the MSE of the constant model was 25, which was the mean of our observations.

We can try other examples if we want to, and we’ll end up with the same result. Let’s instead pivot to proving this rigorously, using mathematics. There are two ways we’ll go about doing this:

Using calculus.
Using a neat algebraic trick.

For both derivations, the slides contain the key ideas, but the lecture videos will contain a step-by-step walkthrough.

25 of 54

MSE minimization using calculus

One way to minimize a function is by using calculus: we can take the derivative, set it equal to 0, and solve for the optimizing value.

The derivative of the sum of several pieces is equal to the sum of the derivative of said pieces.
The derivative of the loss for a single point is .

Then:

from above

since we can pull constants out of sums

26 of 54

MSE minimization using calculus

Setting this term to 0, we have:

Thus, with squared loss and the constant model, the sample mean minimizes MSE.

we can separate sums

c + c + … + c = n * c

27 of 54

MSE minimization using calculus

We’re not done yet! To be thorough, we need to perform the second derivative test, to guarantee that the point we found is truly a minimum (rather than a maximum or saddle point). We hope that the second derivative of our objective function is positive, indicating our function is convex opening upwards.

Fortunately, it is, so the sample mean truly is the minimizer we were looking for. We will interpret what this means shortly.

28 of 54

MSE minimization using an algebraic trick

It turns out that in this case, there’s another rather elegant way of performing the same minimization algebraically, but without using calculus.

We present this derivation in the next few slides. The lecture video will walk through it in detail.
In this proof, you will need to use the fact that the sum of deviations from the mean is 0 (in other words, that ). We present that proof here:

Our proof will also use the definition of the variance of a sample. As a refresher:

For example, this mini-proof shows

1 + 2 + 3 + 4 + 5 is the same as

3 + 3 + 3 + 3 + 3.

Equal to the MSE of the sample mean!

29 of 54

MSE minimization using an algebraic trick

This proof relies on an algebraic trick. We can write the difference a - b as

(a - c) + (c - b), where a, b, and c are any numbers.

Using that fact, we can write

, where

, our sample mean.

Also note: going from line 3 to 4, we distribute the sum to the individual terms. This is a property of sums you should become familiar with!

from the previous slide

variance of sample!

30 of 54

Minimization using an algebraic trick

In the previous slide, we showed that .

Since variance can’t be negative, the first term is greater than or equal to 0.

Of note, the first term doesn’t involve at all. Changing our model won’t change this value, so for the purposes of determining , we can ignore it.

The second term is being squared, and so also must be greater than or equal to 0.

This term does involve , and so picking the right value of will minimize our average loss.
We need to pick the that sets the second term to 0.
This is achieved when . In other words:

Looks familiar!

Question: What is the value of average loss, when evaluated at ?

31 of 54

Mean minimizes MSE for the constant model

As we determined a variety of ways, for the constant model with squared loss, the mean of the dataset is the optimal model.

This holds true regardless of the dataset we use, but it’s only true for this combination of model and loss.
If we choose any other constant other than the sample mean, the empirical risk will not be as small as possible, and so our model is “worse” (for this loss).

This is not all that surprising! It provides some formal reasoning as to why we use means so commonly as summary statistics. It is the best, in some sense.

Note, we now write instead of . This is because we are referring to the optimal parameter, not just any arbitrary .

32 of 54

Minimum value of MSE is the sample variance

It’s worth noting that when we substitute back into our average loss, we obtain a familiar result:

That is, the minimum value that mean squared error can take on (again, for the constant model) is the sample variance.

Put another way, the following statement is true whenever :

33 of 54

Minimizing mean absolute error (MAE)

for the constant model

34 of 54

Exploring MAE

When we use absolute (or L1) loss, we call the average loss mean absolute error. For the constant model, our MAE looks like:

Let’s again re-visit our toy example of 5 observations, [20, 21, 22, 29, 33].

The loss for the first observation (y₁).

The average loss across all observations (the MAE).

35 of 54

Exploring MAE

The loss for the first observation (y₁).

The average loss across all observations (the MAE).

An absolute value curve, centered at theta = 20.

Some weird shape.... minimized near theta = 22?

36 of 54

Exploring MAE

The shape of the MAE with the constant model seems to be jagged. This is because it is the (weighted) sum of several absolute value curves, which results in a piecewise linear function.

It also doesn’t seem to be immediately clear what the optimal choice of theta should be. It’s somewhere in the “middle” of our points, but it’s clearly not 25, which was the minimizing value for the MSE.

Let’s once again resort to calculus!

37 of 54

Exploring MAE

The shape of the MAE with the constant model seems to be jagged. This is because it is the (weighted) sum of several absolute value curves, which results in a piecewise linear function.

It also doesn’t seem to be immediately clear what the optimal choice of theta should be. It’s somewhere in the “middle” of our points, but it’s clearly not 25, which was the minimizing value for the MSE.

Let’s once again resort to calculus!

The bends, or “kinks,” all appear at our observations! (20, 21, 22, 29, 33)

38 of 54