Introduction to Modeling
Understanding the usefulness of models, and how loss functions help create them.
Data 100, Summer 2020 @ UC Berkeley
Suraj Rampure, Ani Adhikari, Deborah Nolan, Fernando Perez, Joseph Gonzalez
LECTURE 11
Data science lifecycle
We’re now moving to the fourth stage of the data science lifecycle – Understand the World – where we build models that try and generalize patterns in the data we collected.
What is a model?
What is a model?
For example, we model the acceleration due to gravity on Earth as 9.81 m/s².
“Essentially, all models are wrong, but some are useful.”
George Box, Statistician (1919-2013)
A model is an an idealized representation of a system.
Why do we build models?
To understand complex phenomena occurring in the world we live in.
To make accurate predictions about unseen data.
Often times, we care about creating models that are simple and interpretable, allowing us to understand what the relationships between our variables are.
Other times, we care more about making extremely accurate predictions, at the cost of having an uninterpretable model. These are sometimes called black-box models, and are common in fields like deep learning.
Most of the time, we try to strike a balance between interpretability and accuracy.
From HBO’s Silicon Valley – hot dog or not hot dog? Behind this app is indeed a model.
Physical (or mechanistic) models
Some models, such as the aforementioned model of the acceleration due to gravity on Earth, are laws that govern how the world works. We call these physical models.
Statistical models
Other times, we don’t have such a precise understanding of some natural relationship. In such cases, we collect data and use statistical tools to learn more about the relationships between variables.
Summary statistics and notation
A simple model – the constant model
Suppose you want to build a model to predict some numerical quantity of a population:
One choice of model would be to ignore any relationships between variables, and predict the same number for each individual – i.e., predicting a constant. We call this a summary statistic because it summarizes the data in our sample.
Example – Tips dataset
Notation
represents the parameter(s) of our model. This is what we are trying to estimate!
represents our true observations (e.g. the actual observed tip %s).
represents the predicted observations given by our model (e.g. the predicted tip %s).
represents the ith observation in particular (e.g. is the 4th observed tip).
In general, we represent our collected data as .
represents the ith prediction in particular (e.g. is the predicted tip for the 9th data point).
represents the fitted, or optimal, parameter(s) that we solve for. It is our goal to find this!
Parameters are what define our model. We make this more clear in the next slide.
We want to find to make the best possible model.
Notation
The constant model can be stated as follows:
Prediction vs. estimation
These terms are often used somewhat interchangeably, but there is a subtle difference between them.
Prediction is the task of using a model to predict outputs for unseen data. Once we have estimates for our model’s parameters, we can use our model for prediction.
Estimation is the task of using data to determine model parameters.
Loss functions
The cost of doing business (making predictions)
We need some metric of how “good” or “bad” our predictions are. This is what loss functions provide us with. Loss functions quantify how bad a prediction is for a single observation.
A natural choice of loss function is actual - predicted, or . We call this the error for a single prediction.
Squared and absolute loss
The most common loss function you’ll see is the squared loss, also known as L2 loss.
Another common loss function is the absolute loss, also known as L1 loss.
There are benefits and drawbacks to both of the above loss functions. We will examine those shortly. These are also not the only possible loss functions; we will see more later.
If our prediction is equal to the actual observation, in both cases, our loss is 0.
Low loss means a good fit!
Loss functions and empirical risk
We care about how bad our model’s predictions are for our entire data set, not just for one point. A natural measure, then, is of the average loss across all points. Assuming points:
The average loss of a model tells us how well it fits the given data. If our model has a low average loss across our dataset, that means it is good at making predictions. As such, we want to find the parameter(s) that minimize average loss, in order to make our model as good at making predictions as it can be.
Other names for average loss include empirical risk and an objective function.
MSE and MAE
If we choose squared loss as our loss function, then average squared loss is typically referred to as mean squared error (MSE), and is of the following form:
If we choose absolute loss as our loss function, then average absolute loss is typically referred to as mean absolute error (MAE), and is of the following form:
These definitions hold true, regardless of our model. We want to minimize these quantities.
Exploring MSE
Average loss is typically written as a function of , since defines what our model is (and hence what our predictions are). For example, with squared loss and the constant model, our average loss (and hence, the function we want to minimize) is
Another way of stating our goal is to find the satisfying: �
We won’t use this notation again in this lecture, but it will come up again in the future.
Average loss is also a function of our data. But unlike theta, we can’t change our data: it is given to us (i.e. it is fixed).
argmin means “the argument that minimizes the following function.”
Exploring MSE
When our model is the constant model, and we choose to use L2 loss, again, our average loss looks like:
Let’s examine a toy example. Suppose we have 5 observations, [20, 21, 22, 29, 33].
The loss for the first observation (y₁).
The average loss across all observations (the MSE).
Exploring MSE
The loss for the first observation (y₁).
The average loss across all observations (the MSE).
A parabola, minimized at theta = 20.
Also a parabola! Minimized at theta = 25.
Minimizing mean squared error (MSE)
for the constant model
Minimizing MSE
We saw with the toy example of [20, 21, 22, 29, 33] that the value that minimizes the MSE of the constant model was 25, which was the mean of our observations.
We can try other examples if we want to, and we’ll end up with the same result. Let’s instead pivot to proving this rigorously, using mathematics. There are two ways we’ll go about doing this:
For both derivations, the slides contain the key ideas, but the lecture videos will contain a step-by-step walkthrough.
MSE minimization using calculus
One way to minimize a function is by using calculus: we can take the derivative, set it equal to 0, and solve for the optimizing value.
Then:
from above
since we can pull constants out of sums
MSE minimization using calculus
Setting this term to 0, we have:
Thus, with squared loss and the constant model, the sample mean minimizes MSE.
we can separate sums
c + c + … + c = n * c
MSE minimization using calculus
We’re not done yet! To be thorough, we need to perform the second derivative test, to guarantee that the point we found is truly a minimum (rather than a maximum or saddle point). We hope that the second derivative of our objective function is positive, indicating our function is convex opening upwards.
Fortunately, it is, so the sample mean truly is the minimizer we were looking for. We will interpret what this means shortly.
MSE minimization using an algebraic trick
It turns out that in this case, there’s another rather elegant way of performing the same minimization algebraically, but without using calculus.
For example, this mini-proof shows
1 + 2 + 3 + 4 + 5 is the same as
3 + 3 + 3 + 3 + 3.
Equal to the MSE of the sample mean!
MSE minimization using an algebraic trick
This proof relies on an algebraic trick. We can write the difference a - b as
(a - c) + (c - b), where a, b, and c are any numbers.
Using that fact, we can write
, where
, our sample mean.
Also note: going from line 3 to 4, we distribute the sum to the individual terms. This is a property of sums you should become familiar with!
from the previous slide
variance of sample!
Minimization using an algebraic trick
In the previous slide, we showed that .
Looks familiar!
Question: What is the value of average loss, when evaluated at ?
Mean minimizes MSE for the constant model
As we determined a variety of ways, for the constant model with squared loss, the mean of the dataset is the optimal model.
This is not all that surprising! It provides some formal reasoning as to why we use means so commonly as summary statistics. It is the best, in some sense.
Note, we now write instead of . This is because we are referring to the optimal parameter, not just any arbitrary .
Minimum value of MSE is the sample variance
It’s worth noting that when we substitute back into our average loss, we obtain a familiar result:
That is, the minimum value that mean squared error can take on (again, for the constant model) is the sample variance.
Put another way, the following statement is true whenever :
Minimizing mean absolute error (MAE)
for the constant model
Exploring MAE
When we use absolute (or L1) loss, we call the average loss mean absolute error. For the constant model, our MAE looks like:
Let’s again re-visit our toy example of 5 observations, [20, 21, 22, 29, 33].
The loss for the first observation (y₁).
The average loss across all observations (the MAE).
Exploring MAE
The loss for the first observation (y₁).
The average loss across all observations (the MAE).
An absolute value curve, centered at theta = 20.
Some weird shape.... minimized near theta = 22?
Exploring MAE
The shape of the MAE with the constant model seems to be jagged. This is because it is the (weighted) sum of several absolute value curves, which results in a piecewise linear function.
It also doesn’t seem to be immediately clear what the optimal choice of theta should be. It’s somewhere in the “middle” of our points, but it’s clearly not 25, which was the minimizing value for the MSE.
Let’s once again resort to calculus!
Exploring MAE
The shape of the MAE with the constant model seems to be jagged. This is because it is the (weighted) sum of several absolute value curves, which results in a piecewise linear function.
It also doesn’t seem to be immediately clear what the optimal choice of theta should be. It’s somewhere in the “middle” of our points, but it’s clearly not 25, which was the minimizing value for the MSE.
Let’s once again resort to calculus!
The bends, or “kinks,” all appear at our observations! (20, 21, 22, 29, 33)
MAE minimization using calculus
Once again, we can use calculus to determine the optimal .
The first step is to determine the derivative of our loss function for a single point. Absolute value functions can be written as two piecewise linear functions:
The derivative of our loss for a single point, then, is also a piecewise linear function:
Note: The derivative of the absolute value when the argument is 0 (i.e. when ) is technically undefined. We ignore this case in our derivation, since thankfully, it doesn’t change our result.
MAE minimization using calculus
From here, we again use the fact that the derivative of a sum is a sum of derivatives:
�
Add -1 for each time an observation yi is greater than our choice of theta.
Add 1 for each time an observation yi is less than our choice of theta.
MAE minimization using calculus
Setting this derivative equal to 0:
The last line is telling us that in order for our MAE to be minimized, we need to choose a theta such that the number of observations less than theta needs to be equal to the number of observations greater than theta.
�
MAE minimization using calculus
In order for our MAE to be minimized, we need to choose a theta such that the number of observations less than theta needs to be equal to the number of observations greater than theta. In other words, theta needs to be such that there are an equal number of points to the left and right.
This is the definition of the median! For example, in our toy dataset, the point below in red (22) is the median of our observations. It is the value in the “middle.”
�
MAE minimization using calculus
In order for our MAE to be minimized, we need to choose a theta such that the number of observations less than theta needs to be equal to the number of observations greater than theta. In other words, theta needs to be such that there are an equal number of points to the left and right.
This is the definition of the median! For example, in our toy dataset, the point below in red (22) is the median of our observations. It is the value in the “middle.”
�
Two points to the left, two points to the right.
Median minimizes MAE for the constant model
We’ve now shown that the median minimizes MAE for the constant model.
This is consistent with what we saw earlier, when plotting the MAE for our toy dataset:
Minimized at exactly theta = 22.
Important note: In general, the mean and median of a dataset are not the same. Therefore, using MSE and MAE give us different optimal theta values!
A key takeaway here is that our choice of loss function determines the optimal parameters for our model.
Median minimizes MAE for the constant model
Our toy dataset only had 5 observations. What if it had an even number of observations? Let’s say our toy dataset is now [20, 21, 22, 29, 33, 35]. The 35 is new.
Any theta value in this flat region minimizes MAE.
Comparing loss functions
MSE vs. MAE for toy data
Below, we present the plot of the loss surface for our toy dataset, using L2 loss (left) and L1 loss (right).
Minimized at the mean of y (25).
Minimized at the median of y (22).
MSE vs. MAE
What else is different about squared loss (MSE) and absolute loss (MAE)?
Mean squared error (optimal parameter for the constant model is the sample mean)
Mean absolute error (optimal parameter for the constant model is the sample median)
It’s not clear that one is “better” than the other. In practice, we get to choose our loss function!
Summary
The modeling process
We’ve implicitly introduced this three-step process, which we will use constantly throughout the rest of the course.
Choose a model
Choose a loss function
Fit the model by minimizing average loss
The modeling process
We’ve implicitly introduced this three-step process, which we will use constantly throughout the rest of the course.
Choose a model
Choose a loss function
Fit the model by minimizing average loss
In this lecture, we focused exclusively on the constant model, which has a single parameter.
Parameters define our model. They tell us the relationship between the variables involved in our model. (Not all models have parameters, though!)
In the coming lectures, we will look at more sophisticated models.
The modeling process
We’ve implicitly introduced this three-step process, which we will use constantly throughout the rest of the course.
Choose a model
Choose a loss function
Fit the model by minimizing average loss
We introduced two loss functions here: L2 (squared) loss and L1 (absolute) loss. There also exist others.
Both have their benefits and drawbacks. We get to choose which loss function we use, for any modeling task.
The modeling process
We’ve implicitly introduced this three-step process, which we will use constantly throughout the rest of the course.
Choose a model
Choose a loss function
Fit the model by minimizing average loss
Lastly, we choose the optimal parameters by determining the parameters that minimize average loss across our entire dataset. Different loss functions lead to different optimal parameters.
This process is called fitting the model to the data. We did it by hand here, but in the future we will rely on computerized techniques.
Vocabulary review
What’s next...