1 of 60

Linear Regression and Gradient Descent for Optimization

Fardina Fathmiul Alam

2 of 60

Modular Approach to ML Algorithm Design

So far, we have talked about procedures for learning.

E.g. KNN, Decision trees.

3 of 60

Modular Approach to ML Algorithm Design

For the remainder of this course, we will take a more modular approach:

Choose a model describing the relationships between variables of interest
Define a loss function quantifying how bad the fit to the data is.
Fit the model that minimizes the loss function and satisfy the constraint/penalty imposed by the regularizer, possibly using an optimization algorithm.

Mixing and matching these modular components gives us a lot of new ML methods.

4 of 60

Regression vs. Classification

What we've been doing so far has been classification--predicting the category data will fall into. The other branch of supervised learning is regression.

Classification predicts which discrete, mutually exclusive category a datapoint belongs to.
Regression takes in input data and predicts a real valued (continuous) feature.

5 of 60

How Does this Differ From Classification?

If you are trying to decide if something is a cat or a dog, you can't be half right; it's either a cat or a dog

If you are trying to predict how much a house will sell for, you can be more or less correct.

6 of 60

Identify: Classification or Regression?

7 of 60

Problem Setup: Let’s Say

We’ve collected Weight and Height measurements from 5 people, and we want to use Weight to predict Height, which is continuous value.

Person	Weight	Height
1	…	…
2	…	…
3	…	…
4	…	…
5	…	…

6	…	?

We want to know height (output) of a new person X based on his/her weight (input)

8 of 60

If we want to find a simple and effective model for this situation, we'll start by using a simple regression model.

9 of 60

Problem Setup:

We’ve collected Weight and Height measurements from 5 people, and we want to use Weight to predict Height, which is continuous value.

Regression is a statistical method used to understand the relationship between a dependent variable and one or more independent variables.

Goal: Create a model that can predict the dependent variable based on the independent variable(s).

In this scenario, since we only have one independent variable (Weight) and one dependent variable (Height), we'll use simple linear regression.

10 of 60

Problem Setup:

We want to use Weight to predict Height, which is

continuous.

Person	Weight	Height
1	…	…
2	…	…
3	…	…
4	…	…
5	…	…

Plotting data

Training data

11 of 60

Linear Equation with 1 Feature

We want to use Weight to predict Height (continuous).

Any linear relationship between two variables can be represented as a straight line.

Y= ꞵ₀ + ꞵ₁. X

Here,

Model: y is a linear function of x:

Y= ꞵ0 + ꞵ1. X; where:

“Y” is target or prediction → Height

“X” is feature → Weight

“ꞵ₀,ꞵ₁” → Model’s parameter or coefficient

(Independent Variable)

(Dependent

Variable)

We usually like to add a line to the data so we can see what the trend is

12 of 60

What if there are multiple features?

13 of 60

Linear Equation with Multiple P Features

Any general linear equation with multiple features can be written as

x_i are features
𝛽_i are model parameters or coefficients
Y is target variable

14 of 60

Terminology:

Slope and Intercept

“m”

“c”

Slope (β₁): Also known as the gradient, it shows how steep the line is, indicating how much the dependent variable changes when the independent variable changes by one unit.

Intercept (β₀): It's where the line intersects the y-axis, representing the starting value of the dependent variable when the independent variable is zero.

15 of 60

Back to: Goal of Regression

Y= ꞵ₀ + ꞵ₁. X

Find a function that best represents the relationship between the input variables (independent variables) and the output variable (dependent variable), which is continuous.

6	…	?

For a new data X, we want to predict height, based on the weight

β₀ represents the y-intercept of the line

β₁represents the slope of the line

16 of 60

Regression Line or Line of Best Fit

The best fit line, often referred to as the "line of best fit" or "regression line," is a straight line that best represents the relationship between two variables in a set of data.

17 of 60

But how do we know which one is the best line?

How to find the optimal line to fit our data?

A reasonable criterion for a line to be the “best” is for it to have the smallest possible overall error among all straight lines.

18 of 60

But how do we know which one is the best line? cont.

Why that particular line?

We aim to choose a line that

Best Fit: The chosen regression line accurately fits the data.
Minimizes Errors: Aims to reduce differences between predicted and actual values.
Captures Relationship: Shows the relationship between variables for accurate predictions.

So, we need a process of determining the best-fitting line → Linear Regression Method

REMEMBER: “The best fit line will have the least error”.

19 of 60

Linear Regression

Linear Regression is a statistical method used to find the relationship between variables by finding the best fit line.

One of the most important algorithms in machine learning.

Measure the relationship between one or more independent variables vs one dependent variable.

Independent variables → predictors (inputs)
Dependent variables → outcome of interest

Example: Years of Experience vs Salary, Area vs House Price.

Linear Regression involves fitting a straight line to a dataset to find the relationship between two variables, such as diameter and height.

Independent Variables: variables used to predict or explain the values of the dependent variable.

Dependent Variables: house price would be the dependent variable (the value you want to predict)

Example: multiple independent variables and a single dependent variable: predict a student's final exam score (dependent variable) based on independent variable: Number of hours studied (X1), Previous test scores (X2), Attendance percentage (X3), Distance from school (X4).

If dealing with multiple independent variables and multiple dependent variables, you would typically use multivariate statistical techniques such as Multivariate Multiple Regression or Multivariate Analysis of Variance (MANOVA),

20 of 60

Goal of the linear regression algorithm

Find the best values for β₀ and β₁ to find the best fit line*.

*The best fit line is a line that has the least error which means the error between predicted values and actual values should be minimum.

21 of 60

Find the best values for β0 and β1 to find the best fit line.

To do that, we use the method of 'Least Squares' or minimize the cost function, typically the Mean Squared Error (MSE), to fit a line to the data.

22 of 60

Terminologies: Residual or Error

Error/Residuals: the difference between the actual value of y and the predicted value of y.

Residual Error, ε = Y(actual) - Y(predicted)

A well-fitting model should have small residuals on average.

23 of 60

Square Error: RSS or SSE

The residual sum of squares (RSS), also known as the sum of squared residuals (SSR) or the sum of squared errors (SSE) or squared error (says how bad the fit is) [ref: Wiki]

Now, we measures the loss function which the squared distances (residuals) between the observed data points and the predicted values, and then sums these squares to evaluate how well the model fits the data.

24 of 60

Square Error: RSS or SSE

25 of 60

Terminologies: Cost Function

Cost Function: the least Sum of Squares of Errors is used as the cost function for Linear Regression.

Here, loss function averaged over all training example.

Mean Squared Error

26 of 60

Cost Function

This equation is same as before, just using the different terms or synonyms. And The 1/2 factor is multiplied just to make the calculations convenient.

27 of 60

Back to Terminologies

Error/Residuals: the difference between the actual value of y and the predicted value of y.

Cost Function: the least Sum of Squares of Errors is used as the cost function for Linear Regression.

For all possible lines, we calculate the sum of squares of errors. The line which has the least sum of squares of errors is the best fit line.

REMEMBER: Data points far from the regression line lead to higher errors (residuals), resulting in a higher cost function, while closer points yield lower errors and a reduced cost function.

28 of 60

Summary so far

the cost function need to be minimized to find that value of β0 and β1, to find that best fit of the predicted line.

This become a ML Optimization problem now!

Minimize (Cost Function)!

29 of 60

In the context of linear regression, optimization is employed to find the optimal coefficients (slopes and intercept) for the linear equation that best fits the data.

30 of 60

There are some Helping Slides!!

How Linear Regression aim to find the best-fit line among all possible lines

Check by yourself if you want to!

31 of 60

In the context of linear regression, optimization is employed to find the optimal coefficients (slopes and intercept) for the linear equation that best fits the data.

32 of 60

We Now Have a Function to Minimize

We want to find the coefficients (slope β1, constant/Interceptβ0) of the line:

That minimize cost:

+

Target: Update coefficient values iteratively to minimize the cost.

33 of 60

This now lead to a population Optimization Algorithm

Known as “Gradient Descent”

An iterative optimization algorithm to find the minimum of a function.

*** The gradient is a vector that indicates the direction and steepness of the slope of a function at a specific point (check helping slide)

34 of 60

Gradient Descent

Process:

Initialization: Start with initial values for β0 and β1.
Cost Calculation: Calculate the cost function (RSS).
Parameters Adjustment: Adjust (increases or decreases t) parameters (β0, β1) on the gradient o find the next cost value.
Iterations: Repeat process until the minimum cost is achieved.

In linear regression, optimize the Residual sum of squares cost function. How?

Gradient descent iteratively finds optimal parameter values by minimizing a cost function.

35 of 60

Importance of the “Derivative” in Minimization

The derivative, a key concept from calculus indicates the slope of the function at a given point.

Knowing the slope helps determine the direction (sign) to move in order to adjust the coefficient values so that we achieve a lower cost in the next iteration

extremely important technique to minimise the cost function to get the minimum point.

36 of 60

Gradient Descent Algorithm

37 of 60

Learning Step / Learning Rate (alpha)

Learning Rate (α): Determines the size of steps taken in Gradient Descent.

Large Learning Rate: Covers more ground per step but risks overshooting the minimum.

Small Learning Rate: Ensures precise movement but may result in slow convergence due to frequent gradient calculations.

Common Learning Rates: 0.001, 0.003, 0.01, 0.03, 0.1, 0.3.

38 of 60

39 of 60

Line of Best Fit

Once we're done, we have a nice line of best fit.

40 of 60

41 of 60

Assumptions (L.I.N.E)

Relationship must be linear
Residuals must be independent
Residuals must be normally distributed
Equal variance of residuals

42 of 60

Linearity:

The relationship between the independent and dependent variables must be linear.

43 of 60

Independence of Residuals

Residuals must be independent of one another.

There should be no autocorrelation in residuals for a linear regression model to be valid.

“Assumes that the residuals/ errors from one data point to another are independent.”

44 of 60

What is Autocorrelation

Autocorrelation occurs when residuals are correlated with each other.

E.g: In predicting students' exam scores based on study hours, if errors in predicting one student's score are consistently related to errors in predicting another student's score, it implies autocorrelation. This violates the assumption of independent errors in linear regression.

Autocorrelation usually occurs if there is a dependency between residual errors
Problem: any correlation in the error term will drastically reduce the accuracy of the model.

45 of 60

Residuals Must be Normally Distributed

If they are not, you're likely to have outliers

46 of 60

Homoscedasticity

The data must be homoscedastic, meaning the variance in the outliers must be constant.

Residuals should have constant variance across all levels of the independent variable.

Homoscedasticity refers to the situation where the spread of the data points, especially the residuals or errors ((including the outliers)), remains constant across all (or different) values of the independent variables.

variance of the residuals (errors) across all levels of the predictor variable remains constant. This indicates that the spread of the residuals around the regression line is uniform.

If the variability of these extreme data points changes at different parts of the dataset, it can affect the reliability of the model's predictions, particularly at those extreme values, and thus influence the overall accuracy and trustworthiness of the regression model.
Inconsistency in the spread of extreme values (outliers) can affect the model's reliability. If the variability of outliers changes across the data, predictions might be less accurate at different points. This inconsistency can lead to biased estimates and impact the overall reliability of the model's predictions.

47 of 60

How Do We Know if Our Regression is Any Good?

48 of 60

Evaluating Linear Regression Performance

After building a linear regression model, it is important to evaluate its performance. The most commonly used metrics for evaluating the performance of linear regression models are R-Squared, Root Mean Squared Error (RMSE).

49 of 60

R²

R-squared is a statistical method that determines the goodness of fit.

It measures the strength of the relationship between the dependent and independent variables.

It measures the proportion of variance in the dependent variable explained by the independent variables in a regression model.

R² shows how well the independent variables explain the variation in the dependent variable (the outcome you want to predict).

The coefficient of determination

50 of 60

R²

It typically ranges between 0 and 1, and shows how well our line fits the data. It technically can be negative if your fit is worse than a horizontal line.

Interpretation:

R² = 0: The model doesn’t explain any variation.

R² = 1: The model perfectly explains all variation.

Higher R²: Means the model does a good job of predicting the dependent variable.

The coefficient of determination

51 of 60

R²

calculated using the formula:

SS_res= is the sum of the squared residuals or errors (the difference between the actual Y values and the predicted Y values.

SS_tot=is the total sum of squares, which measures the total variance of the dependent variable around its mean.

For each data point, calculate the difference between the observed value (Y) and the mean.

52 of 60

Negative R²

53 of 60

54 of 60

Root Mean Squared Error

Our actual loss function.

Good at comparing two models for the same data BUT
Is in the same units as whatever we're measuring, so is useless on its own

55 of 60

Types of Regression

56 of 60

Polynomial Regression

Prone to overfitting
Generally use it only for small degree polynomials

Polynomial regression is a type of regression analysis used when the relationship between the independent variable (x) and dependent variable (y) isn't linear. It involves fitting a polynomial equation to the data, allowing for curves instead of straight lines.

57 of 60

Random Forest Regression

You can modify a random forest to do regression.

In general, people often use this one compare to all the other types.

58 of 60

Regularization Technique in Machine Learning

Regularization is a technique used in machine learning and statistical modeling to prevent overfitting and improve the generalization performance of models.

Regularization adds a penalty term to the model's objective function, which penalizes complex models and encourages simpler ones.

The goal of regularization is to find a balance between fitting the training data well and avoiding overly complex models that may not generalize well to new, unseen data.

59 of 60

Other Types of Regression:

Ridge Regression: A specific type of regression that does well in situations with multicollinearity, where two independent variables are correlated.

Adds an L1 regularization penalty to the east squares objective function.
The L1 penalty encourages sparsity in the solution by penalizing the sum of the absolute values of the coefficients.

Lasso Regression: A type of regression that is useful when there are large numbers of potentially irrelevant features.

Adds an L2 regularization penalty to the objective function; penalizes the sum of the squared coefficients, effectively shrinking them towards zero. (prevents the weights from getting too large and hence avoiding them to overfit)

ElasticNet Regression: A type of regression that combines the loss functions of ridge and lasso regression to get the advantages of both

1 of 60

2 of 60

3 of 60

4 of 60

5 of 60

6 of 60

7 of 60

8 of 60

9 of 60

10 of 60

11 of 60

12 of 60

13 of 60

14 of 60

15 of 60

16 of 60

17 of 60

18 of 60

19 of 60

20 of 60

21 of 60

22 of 60

23 of 60

24 of 60

25 of 60

26 of 60

27 of 60

28 of 60

29 of 60

30 of 60

31 of 60

32 of 60

33 of 60

34 of 60

35 of 60

36 of 60

37 of 60

38 of 60

39 of 60

40 of 60

41 of 60

42 of 60

43 of 60

44 of 60

45 of 60

46 of 60

47 of 60

48 of 60

49 of 60

50 of 60

51 of 60

52 of 60

53 of 60

54 of 60

55 of 60

56 of 60

57 of 60

58 of 60

59 of 60

60 of 60