Introduction to Modeling
Linear Regression and The Least Squares Method
DATA 201
Cristiano Fanelli
The lecture material is temporarily hosted in
Outline
3
What is Modeling?
—- Intro
In general: it is the root of all sciences, a comprehensive framework of understanding reality.
For many applications you can think of a model as a functional relationship between an input and an output.
We believe that when we have a lot of data collected (a lot of examples) we can train or update the model based on the minimization of the errors.
4
Examples from My Personal Research Experience
—- Intro
By 'model,' we typically refer to a representation of a real-world process designed to describe or predict patterns in our data.
I work at the intersection of data science and experimental nuclear physics
At JLab and EIC we study some of the smallest objects in the universe, quarks and gluons in the proton
~1 fm
5
Examples from My Personal Research Experience
—- Intro
charged track
Cherenkov photons
(x,y,t) hit pattern
Photon Yield vs Track Angle
(P∈[0,5] GeV/c)
Changing Kinematics
Fixed Kinematics
Large-scale experiment
AI-based design
Example of particle reconstruction
6
What is Modeling?
—- Motivation
By 'model,' we typically refer to a representation of a real-world process designed to describe or predict patterns in our data.
“In theory, there is no difference between theory and practice…”
Yogi Berra
7
What is Modeling?
—- Motivation
By 'model,' we typically refer to a representation of a real-world process designed to describe or predict patterns in our data.
“In theory, there is no difference between theory and practice.
In practice there is.”
Yogi Berra
8
—- Motivation
9
—- Motivation
Models, almost unavoidably, entail approximations, depending on the assumptions made, complexity etc.
* Interestingly, we can create a highly complex model that fits a dataset very well, but it may fail to generalize to new data from the same population as the original dataset
10
E.g., Fitting vs predicting (no noise)
—- Motivation
11
Fitting
Sample 10 points with no noise between [0,1] from:
f(x) = 2x -10x5 + 15x10 (ground truth)
The points follow the polynomial. See Fig. right. Imagine to “fit” a model (linear, poly degree 3, poly degree 10) based on the points in the range [0,1] and exactly learnt the coefficients of the polynomial (2, 10, 15)
Predicting
Now create another dataset of 20 points between [0,1.25] by sampling the same equation of before:
f(x) = 2x -10x5 + 15x10
Use the models “fitted” before in the range [0,1] and make predictions for this new dataset.
polynomial order 10
polynomial order 10
(same polynomial order 10)
The poly 10 is able to do good predictions (it generalizes perfectly to (1,1.25] as expected).
E.g., Fitting vs predicting (with noise)
—- Motivation
12
Fitting
Sample 100 points with noise between [0,1] from:
f(x) = 2x -10x5 + 15x10 (ground truth)
Do “fit” a model (linear, poly degree 3, poly degree 10) to this new dataset. Because of noise, the poly degree 10 does a decent job in [0,1], but you clearly see things start being different.
Predicting
Now create another dataset of 20 points between [0,1.25] by sampling the same equation of before:
f(x) = 2x -10x5 + 15x10
Use the models “fitted” before in the range [0,1] and make predictions for this new dataset. The poly degree 10 goes south.
E.g., Fitting vs predicting (with noise)
—- Motivation
13
The poly10 does a poor job and does not generalize well to (1,1.25].
Even if the data actually came from a poly10!!!
ML can be difficult: Bias vs Variance
—- Motivation
14
— It is difficult to generalize beyond what seen in the training dataset —
—- Motivation
“Make everything as simple as possible,
but not simpler”
Albert Einstein
Least Squares Method
—- OLS, LinReg
Carl Friedrich Gauss (1777-1855)
Had great contributions to mathematics and astronomy.
He proposed a rule to score the contributions of individual errors to overall error.
“The least-squares method was officially discovered and published by Adrien-Marie Legendre (1805), though it is usually also co-credited to Carl Friedrich Gauss (1809), who contributed significant theoretical advances to the method, and may have also used it in his earlier work in 1794 and 1795.” [https://en.wikipedia.org/wiki/Least_squares]
The Law of Probable Errors
—- OLS, LinReg
The story began when an italian astronomer, Giuseppe Piazzi, discovered a new object in our solar system, the dwarf planet (asteroid) Ceres:
https://www.jpl.nasa.gov/news/ceres-keeping-well-guarded-secrets-for-215-years
Gauss helped relocate the position of Ceres and confirmed the discovery.
"... for it is now clearly shown that the orbit of a heavenly body may be determined quite nearly from good observations embracing only a few days; and this without any hypothetical assumption." - Gauss
Why Mean Squared Error?
—- OLS, LinReg
Ordinary Least Squares (OLS) minimizes the sum of squared residuals (SSR) to estimate the coefficients of a linear regression model.
Least Squares Method for Regression
—- OLS, LinReg
N.b.: In standard regression analysis that leads to fitting by least squares there is an implicit assumption that errors in the independent variable are zero or strictly controlled so as to be negligible.
MSE: Bias, Variance & Noise
—- OLS, LinReg
Interestingly, turns out that
MSE = Bias2 + Variance + Noise2
Bias is how far the model’s average prediction is from the true value
Variance is how much the model’s predictions vary around its mean prediction
Noise is the irreducible error intrinsic to data
Demonstration
Final Expression
The second term equals zero because E[ϵ]=0 and f(x) - y and ϵ are independent, therefore the expectation of the product is the product of the expectations. The variance of ϵ is just the noise σ2
bias
variance
Bias/Variance -
Epistemic/Aleatoric -
Accuracy/Precision
—- OLS, LinReg
Relationship Summary:
Red: true value
Blue: model predictions of true value
Turkey shooting team,
Paris 2024 Olympics
Supervised Learning Based
on Training Examples
—- Reg VS Clas
Regression – with known numerical outcomes, can we predict outcomes for new data?
Classification – with known groups, how can we classify new data?
X1 | X2 | X3 | X4 | X5 | y |
5 | 5 | 5 | 5 | 1 | 7.25 |
1 | 1 | 1 | 1 | 5 | 4.5 |
Input
X1 | X2 | X3 | X4 | X5 | y |
5 | 5 | 5 | 5 | 1 | Class_0 |
1 | 1 | 1 | 1 | 5 | Class_1 |
Input
Output
Output
Model that will predict target values for new data
Model that will predict labels for new data
Evaluate: Do predicted values match known values
Goal: Accurate predictions for new data.
Evaluate: Do predicted labels match known labels
Goal: Accurate predictions for new data
Main Goal: Predictive Models
—- Reg VS Clas
Response/ Target
Features
Main Goal: Predictive Models
—- Reg VS Clas
Response/ Target
Features
Regression Models Have Numeric Targets
—- Reg VS Clas
For linear regression the relationship between dependent (target) and independent (features) variables is always described by the same type of equation:
In 2 dimensions:
In 3 dimensions:
Linear Models:
Linear Regression does not always mean a line
—- Lin. Models
In 2 dimensions:
In 3 dimensions:
In p dimensions:
(data has p variables)
Linear Combination of Variables
Bias Term
Linear Models:
Linear combinations of parameters
—- Lin. Models
Message: A linear model is linear in its parameters.
Parameters/Coefficients: the betas:
Predictors: the xi: 𝑥1, 𝑥2, ….
Linear in parameters (the betas):
Linear in parameters but not in predictors:
Linear in predictors but not in parameters:
What about a polynomial?
In general, models are imperfect
—- Lin. Models
Can you fit one straight line that passes through all of these points?
Can you fit one straight line that passes close to all of these points?
Ordinary Least Squares
—- Lin. Models
OLS finds the coefficients that minimize the sum of squared errors between predictions and actual observations.
Observed data are points.
Line represents
model predictions error example.
Model equation y = 1.17 + 2.20x
Observed data point: 0.69, 3.74
Prediction for x = 0.69: 2.69
Error = 1.052
Squared error = 1.11
OLS is therefore connected to the Mean Squared Error (next)
Mean Squared Error
—- Lin. Models
Mean squared error is good for scoring models created to predict the same target …
Possible Issues:
—- Coeff of Deter
The coefficient of determination
The coefficient of determination compares your model to an uninformed model.
R2=1 “perfect” fit
R2=0 no fit
R2∈(0,1) partial fit
The coefficient of determination
—- Coeff of Deter
The coefficient of determination compares your model to an uninformed model.
Cautions:
A very high R2 can indicate overfitting
R2 can be used for non-linear model but with caution
Does the scale of the data matter?
—- Lin. Models
Consider a model with just 2 of the predictors
Model unscaled:
Unscaled observation:
Cement = 540, Superplasticizer = 2.5
Prediction:
9.19 + 0.075* 540 + 0.902*2.5 = 51.94
Does the scale of the data matter?
—- Lin. Models
Consider a model with just 2 of the predictors
Model unscaled:
Unscaled observation:
Cement = 540, Superplasticizer = 2.5
Prediction: 51.94
Model scaled:
Scaled observation:
Cement = 2.48, Superplasticizer = -0.62
Prediction: 51.94
Scaling is pivotal for maintaining consistency, improving computational performance, and enhancing the interpretability of the results.
While it affects the coefficients' values, it should not alter the accuracy of predictions, assuming the scaling is applied correctly and consistently across all data.
Questions?
Quiz
Coding
—- Coding
Please open Colab at this link:
https://cfteach.github.io/NNDL_DATA621/Linear_Regression_1_class.html
Summary
We have covered today:
Some Useful References