1 of 38

Introduction to Modeling

Linear Regression and The Least Squares Method

DATA 201

Cristiano Fanelli

2 of 38

The lecture material is temporarily hosted in

https://cfteach.github.io/NNDL_DATA621

3 of 38

Outline

  • Motivation — why modeling so important and what does that entail
    • You will be introduced to the concept of bias and variance, which will become a leitmotif throughout your data science studies.
  • Linear Regression (LinReg) and Ordinary Least Squares (OLS)
    • Why Mean Squared Errors — coding
    • Linear Regression — interactive example
    • Regression vs Classification
    • Linear Models — i.e., class of problems that can be deal with Linreg
    • Coefficient of Determination
  • Quiz ~ 15 mins
  • Coding and Hands-on
  • Summary and References

3

4 of 38

What is Modeling?

—- Intro

In general: it is the root of all sciences, a comprehensive framework of understanding reality.

For many applications you can think of a model as a functional relationship between an input and an output.

We believe that when we have a lot of data collected (a lot of examples) we can train or update the model based on the minimization of the errors.

4

5 of 38

Examples from My Personal Research Experience

—- Intro

By 'model,' we typically refer to a representation of a real-world process designed to describe or predict patterns in our data.

I work at the intersection of data science and experimental nuclear physics

At JLab and EIC we study some of the smallest objects in the universe, quarks and gluons in the proton

  • We need theory/models supporting our experimental data
  • We analyze massive amount of data. Complex data analysis pipelines using ML/DL which means building models to reconstruct / analyze / interpret our data

~1 fm

5

6 of 38

Examples from My Personal Research Experience

—- Intro

charged track

Cherenkov photons

(x,y,t) hit pattern

Photon Yield vs Track Angle

(P∈[0,5] GeV/c)

Changing Kinematics

Fixed Kinematics

Large-scale experiment

AI-based design

Example of particle reconstruction

6

7 of 38

What is Modeling?

—- Motivation

By 'model,' we typically refer to a representation of a real-world process designed to describe or predict patterns in our data.

In theory, there is no difference between theory and practice…”

Yogi Berra

7

8 of 38

What is Modeling?

—- Motivation

By 'model,' we typically refer to a representation of a real-world process designed to describe or predict patterns in our data.

In theory, there is no difference between theory and practice.

In practice there is.”

Yogi Berra

8

9 of 38

—- Motivation

9

10 of 38

—- Motivation

Models, almost unavoidably, entail approximations, depending on the assumptions made, complexity etc.

* Interestingly, we can create a highly complex model that fits a dataset very well, but it may fail to generalize to new data from the same population as the original dataset

10

11 of 38

E.g., Fitting vs predicting (no noise)

—- Motivation

11

Fitting

Sample 10 points with no noise between [0,1] from:

f(x) = 2x -10x5 + 15x10 (ground truth)

The points follow the polynomial. See Fig. right. Imagine to “fit” a model (linear, poly degree 3, poly degree 10) based on the points in the range [0,1] and exactly learnt the coefficients of the polynomial (2, 10, 15)

Predicting

Now create another dataset of 20 points between [0,1.25] by sampling the same equation of before:

f(x) = 2x -10x5 + 15x10

Use the models “fitted” before in the range [0,1] and make predictions for this new dataset.

polynomial order 10

polynomial order 10

(same polynomial order 10)

The poly 10 is able to do good predictions (it generalizes perfectly to (1,1.25] as expected).

12 of 38

E.g., Fitting vs predicting (with noise)

—- Motivation

12

Fitting

Sample 100 points with noise between [0,1] from:

f(x) = 2x -10x5 + 15x10 (ground truth)

Do “fit” a model (linear, poly degree 3, poly degree 10) to this new dataset. Because of noise, the poly degree 10 does a decent job in [0,1], but you clearly see things start being different.

Predicting

Now create another dataset of 20 points between [0,1.25] by sampling the same equation of before:

f(x) = 2x -10x5 + 15x10

Use the models “fitted” before in the range [0,1] and make predictions for this new dataset. The poly degree 10 goes south.

13 of 38

E.g., Fitting vs predicting (with noise)

—- Motivation

13

The poly10 does a poor job and does not generalize well to (1,1.25].

Even if the data actually came from a poly10!!!

14 of 38

ML can be difficult: Bias vs Variance

—- Motivation

14

  • Fitting is not predicting.
    • Fitting existing data well is fundamentally different from making good predictions (generalize well) on new data
  • Bias is the error introduced by oversimplifying the model, causing it to miss relevant patterns (leading to underfitting).
  • Variance is the model's sensitivity to small fluctuations in the training data. Using a complex model can result in overfitting (high variance)
  • For complex datasets and small training sets, simple models can be better at predicting than complex ones due to the bias-variance tradeoff
    • Even though the correct model has better predictive performance for an infinite amount of training data, the training errors stemming from finite-size sampling (variance) can cause simpler models to outperform the more complex model

— It is difficult to generalize beyond what seen in the training dataset —

15 of 38

—- Motivation

Make everything as simple as possible,

but not simpler

Albert Einstein

16 of 38

Least Squares Method

—- OLS, LinReg

Carl Friedrich Gauss (1777-1855)

Had great contributions to mathematics and astronomy.

He proposed a rule to score the contributions of individual errors to overall error.

“The least-squares method was officially discovered and published by Adrien-Marie Legendre (1805), though it is usually also co-credited to Carl Friedrich Gauss (1809), who contributed significant theoretical advances to the method, and may have also used it in his earlier work in 1794 and 1795.” [https://en.wikipedia.org/wiki/Least_squares]

17 of 38

The Law of Probable Errors

—- OLS, LinReg

The story began when an italian astronomer, Giuseppe Piazzi, discovered a new object in our solar system, the dwarf planet (asteroid) Ceres:

https://www.jpl.nasa.gov/news/ceres-keeping-well-guarded-secrets-for-215-years

Gauss helped relocate the position of Ceres and confirmed the discovery.

"... for it is now clearly shown that the orbit of a heavenly body may be determined quite nearly from good observations embracing only a few days; and this without any hypothetical assumption." - Gauss

  1. Small errors are more likely than large errors.
  2. The likelihood of errors of the same magnitude but different signs, such as x and -x, are equal (the distribution is symmetrical).
  3. When several measurements are taken of the same quantity, the average (arithmetic mean) is the most likely value.

18 of 38

Why Mean Squared Error?

—- OLS, LinReg

Ordinary Least Squares (OLS) minimizes the sum of squared residuals (SSR) to estimate the coefficients of a linear regression model.

19 of 38

Least Squares Method for Regression

—- OLS, LinReg

N.b.: In standard regression analysis that leads to fitting by least squares there is an implicit assumption that errors in the independent variable are zero or strictly controlled so as to be negligible.

20 of 38

MSE: Bias, Variance & Noise

—- OLS, LinReg

Interestingly, turns out that

MSE = Bias2 + Variance + Noise2

Bias is how far the model’s average prediction is from the true value

Variance is how much the model’s predictions vary around its mean prediction

Noise is the irreducible error intrinsic to data

Demonstration

Final Expression

The second term equals zero because E[ϵ]=0 and f(x) - y and ϵ are independent, therefore the expectation of the product is the product of the expectations. The variance of ϵ is just the noise σ2

bias

variance

21 of 38

Bias/Variance -

Epistemic/Aleatoric -

Accuracy/Precision

—- OLS, LinReg

  • Bias/Variance:
    • Bias: Systematic error due to model simplification (underfitting).
    • Variance: Error due to model sensitivity to data fluctuations (overfitting).
  • Epistemic/Aleatoric Uncertainty:
    • Epistemic (Model): Uncertainty in the prediction due to lack of knowledge or wrong model assumptions (reducible, relates to bias, which is a particular case of oversimplification).
    • Aleatoric (Data): Uncertainty inherent in the prediction due data variability (irreducible, relates to variance).
  • Accuracy/Precision:
    • Accuracy: How close predictions are to the true values (relates to low bias and epistemic uncertainty).
    • Precision: How consistent predictions are (relates to low variance and aleatoric uncertainty).

Relationship Summary:

  • BiasEpistemic UncertaintyAccuracy
  • VarianceAleatoric UncertaintyPrecision

Red: true value

Blue: model predictions of true value

Turkey shooting team,

Paris 2024 Olympics

22 of 38

Supervised Learning Based

on Training Examples

—- Reg VS Clas

Regression – with known numerical outcomes, can we predict outcomes for new data?

Classification – with known groups, how can we classify new data?

X1

X2

X3

X4

X5

y

5

5

5

5

1

7.25

1

1

1

1

5

4.5

Input

X1

X2

X3

X4

X5

y

5

5

5

5

1

Class_0

1

1

1

1

5

Class_1

Input

Output

Output

Model that will predict target values for new data

Model that will predict labels for new data

Evaluate: Do predicted values match known values

Goal: Accurate predictions for new data.

Evaluate: Do predicted labels match known labels

Goal: Accurate predictions for new data

23 of 38

Main Goal: Predictive Models

—- Reg VS Clas

Response/ Target

Features

24 of 38

Main Goal: Predictive Models

—- Reg VS Clas

Response/ Target

Features

25 of 38

Regression Models Have Numeric Targets

—- Reg VS Clas

For linear regression the relationship between dependent (target) and independent (features) variables is always described by the same type of equation:

In 2 dimensions:

In 3 dimensions:

26 of 38

Linear Models:

Linear Regression does not always mean a line

—- Lin. Models

In 2 dimensions:

In 3 dimensions:

In p dimensions:

(data has p variables)

Linear Combination of Variables

Bias Term

27 of 38

Linear Models:

Linear combinations of parameters

—- Lin. Models

Message: A linear model is linear in its parameters.

Parameters/Coefficients: the betas:

Predictors: the xi: 𝑥1, 𝑥2, ….

Linear in parameters (the betas):

Linear in parameters but not in predictors:

Linear in predictors but not in parameters:

What about a polynomial?

28 of 38

In general, models are imperfect

—- Lin. Models

Can you fit one straight line that passes through all of these points?

Can you fit one straight line that passes close to all of these points?

29 of 38

Ordinary Least Squares

—- Lin. Models

OLS finds the coefficients that minimize the sum of squared errors between predictions and actual observations.

Observed data are points.

Line represents

model predictions error example.

Model equation y = 1.17 + 2.20x

Observed data point: 0.69, 3.74

Prediction for x = 0.69: 2.69

Error = 1.052

Squared error = 1.11

OLS is therefore connected to the Mean Squared Error (next)

30 of 38

Mean Squared Error

—- Lin. Models

Mean squared error is good for scoring models created to predict the same target …

Possible Issues:

  • What if you have nothing to compare to?
  • Value governed by the target metric.

31 of 38

—- Coeff of Deter

The coefficient of determination

The coefficient of determination compares your model to an uninformed model.

R2=1 “perfect” fit

R2=0 no fit

R2∈(0,1) partial fit

32 of 38

The coefficient of determination

—- Coeff of Deter

The coefficient of determination compares your model to an uninformed model.

Cautions:

A very high R2 can indicate overfitting

R2 can be used for non-linear model but with caution

33 of 38

Does the scale of the data matter?

—- Lin. Models

Consider a model with just 2 of the predictors

Model unscaled:

Unscaled observation:

Cement = 540, Superplasticizer = 2.5

Prediction:

9.19 + 0.075* 540 + 0.902*2.5 = 51.94

34 of 38

Does the scale of the data matter?

—- Lin. Models

Consider a model with just 2 of the predictors

Model unscaled:

Unscaled observation:

Cement = 540, Superplasticizer = 2.5

Prediction: 51.94

Model scaled:

Scaled observation:

Cement = 2.48, Superplasticizer = -0.62

Prediction: 51.94

Scaling is pivotal for maintaining consistency, improving computational performance, and enhancing the interpretability of the results.

While it affects the coefficients' values, it should not alter the accuracy of predictions, assuming the scaling is applied correctly and consistently across all data.

35 of 38

Questions?

36 of 38

Quiz

Go to https://kahoot.it/ and insert the Game PIN

Or scan the QR code

37 of 38

Coding

—- Coding

38 of 38

Summary

We have covered today:

  • Modeling
  • Linear Regression and Linear Models
  • OLS and MSE, coefficient of determination as a score
  • We mentioned sources of model uncertainties

Some Useful References