Linear Regression and Gradient Descent for Optimization
Modular Approach to ML Algorithm Design
So far, we have talked about procedures for learning.
Modular Approach to ML Algorithm Design
For the remainder of this course, we will take a more modular approach:
Mixing and matching these modular components gives us a lot of new ML methods.
Regression vs. Classification
What we've been doing so far has been classification--predicting the category data will fall into. The other branch of supervised learning is regression.
How Does this Differ From Classification?
If you are trying to decide if something is a cat or a dog, you can't be half right; it's either a cat or a dog
If you are trying to predict how much a house will sell for, you can be more or less correct.
Identify: Classification or Regression?
Problem Setup: Let’s Say
We’ve collected Weight and Height measurements from 5 people, and we want to use Weight to predict Height, which is continuous value.
Person | Weight | Height |
1 | … | … |
2 | … | … |
3 | … | … |
4 | … | … |
5 | … | … |
6 | … | ? |
We want to know height (output) of a new person X based on his/her weight (input)
If we want to find a simple and effective model for this situation, we'll start by using a simple regression model.
Problem Setup:
We’ve collected Weight and Height measurements from 5 people, and we want to use Weight to predict Height, which is continuous value.
Regression is a statistical method used to understand the relationship between a dependent variable and one or more independent variables.
Goal: Create a model that can predict the dependent variable based on the independent variable(s).
In this scenario, since we only have one independent variable (Weight) and one dependent variable (Height), we'll use simple linear regression.
Problem Setup:
We want to use Weight to predict Height, which is
continuous.
Person | Weight | Height |
1 | … | … |
2 | … | … |
3 | … | … |
4 | … | … |
5 | … | … |
Plotting data
Training data
Linear Equation with 1 Feature
We want to use Weight to predict Height (continuous).
Any linear relationship between two variables can be represented as a straight line.
Y= ꞵ0 + ꞵ1. X
Here,
Model: y is a linear function of x:
Y= ꞵ0 + ꞵ1. X; where:
“Y” is target or prediction → Height
“X” is feature → Weight
“ꞵ0,ꞵ1” → Model’s parameter or coefficient
(Independent Variable)
(Dependent
Variable)
We usually like to add a line to the data so we can see what the trend is
What if there are multiple features?
Linear Equation with Multiple P Features
Any general linear equation with multiple features can be written as
Terminology:
Slope and Intercept
“m”
“c”
Slope (β1): Also known as the gradient, it shows how steep the line is, indicating how much the dependent variable changes when the independent variable changes by one unit.
Intercept (β0): It's where the line intersects the y-axis, representing the starting value of the dependent variable when the independent variable is zero.
Back to: Goal of Regression
Y= ꞵ0 + ꞵ1. X
Find a function that best represents the relationship between the input variables (independent variables) and the output variable (dependent variable), which is continuous.
6 | … | ? |
For a new data X, we want to predict height, based on the weight
β0 represents the y-intercept of the line
β1 represents the slope of the line
Regression Line or Line of Best Fit
The best fit line, often referred to as the "line of best fit" or "regression line," is a straight line that best represents the relationship between two variables in a set of data.
But how do we know which one is the best line?
How to find the optimal line to fit our data?
A reasonable criterion for a line to be the “best” is for it to have the smallest possible overall error among all straight lines.
But how do we know which one is the best line? cont.
Why that particular line?
We aim to choose a line that
So, we need a process of determining the best-fitting line → Linear Regression Method
REMEMBER: “The best fit line will have the least error”.
Linear Regression
Linear Regression is a statistical method used to find the relationship between variables by finding the best fit line.
One of the most important algorithms in machine learning.
Goal of the linear regression algorithm
Find the best values for β0 and β1 to find the best fit line*.
*The best fit line is a line that has the least error which means the error between predicted values and actual values should be minimum.
Find the best values for β0 and β1 to find the best fit line.
To do that, we use the method of 'Least Squares' or minimize the cost function, typically the Mean Squared Error (MSE), to fit a line to the data.
Terminologies: Residual or Error
Error/Residuals: the difference between the actual value of y and the predicted value of y.
Residual Error, ε = Y(actual) - Y(predicted)
A well-fitting model should have small residuals on average.
Square Error: RSS or SSE
The residual sum of squares (RSS), also known as the sum of squared residuals (SSR) or the sum of squared errors (SSE) or squared error (says how bad the fit is) [ref: Wiki]
Now, we measures the loss function which the squared distances (residuals) between the observed data points and the predicted values, and then sums these squares to evaluate how well the model fits the data.
Square Error: RSS or SSE
Terminologies: Cost Function
Cost Function: the least Sum of Squares of Errors is used as the cost function for Linear Regression.
Here, loss function averaged over all training example.
Mean Squared Error
Cost Function
This equation is same as before, just using the different terms or synonyms. And The 1/2 factor is multiplied just to make the calculations convenient.
Back to Terminologies
Error/Residuals: the difference between the actual value of y and the predicted value of y.
Cost Function: the least Sum of Squares of Errors is used as the cost function for Linear Regression.
For all possible lines, we calculate the sum of squares of errors. The line which has the least sum of squares of errors is the best fit line.
REMEMBER: Data points far from the regression line lead to higher errors (residuals), resulting in a higher cost function, while closer points yield lower errors and a reduced cost function.
Summary so far
the cost function need to be minimized to find that value of β0 and β1, to find that best fit of the predicted line.
This become a ML Optimization problem now!
Minimize (Cost Function)!
In the context of linear regression, optimization is employed to find the optimal coefficients (slopes and intercept) for the linear equation that best fits the data.
There are some Helping Slides!!
How Linear Regression aim to find the best-fit line among all possible lines
Check by yourself if you want to!
In the context of linear regression, optimization is employed to find the optimal coefficients (slopes and intercept) for the linear equation that best fits the data.
We Now Have a Function to Minimize
We want to find the coefficients (slope β1, constant/Interceptβ0) of the line:
That minimize cost:
+
Target: Update coefficient values iteratively to minimize the cost.
This now lead to a population Optimization Algorithm
Known as “Gradient Descent”
*** The gradient is a vector that indicates the direction and steepness of the slope of a function at a specific point (check helping slide)
Gradient Descent
Process:
In linear regression, optimize the Residual sum of squares cost function. How?
Gradient descent iteratively finds optimal parameter values by minimizing a cost function.
Importance of the “Derivative” in Minimization
The derivative, a key concept from calculus indicates the slope of the function at a given point.
extremely important technique to minimise the cost function to get the minimum point.
Gradient Descent Algorithm
Learning Step / Learning Rate (alpha)
Learning Rate (α): Determines the size of steps taken in Gradient Descent.
Large Learning Rate: Covers more ground per step but risks overshooting the minimum.
Small Learning Rate: Ensures precise movement but may result in slow convergence due to frequent gradient calculations.
Common Learning Rates: 0.001, 0.003, 0.01, 0.03, 0.1, 0.3.
Line of Best Fit
Once we're done, we have a nice line of best fit.
Assumptions (L.I.N.E)
Linearity:
The relationship between the independent and dependent variables must be linear.
Independence of Residuals
Residuals must be independent of one another.
“Assumes that the residuals/ errors from one data point to another are independent.”
What is Autocorrelation
Autocorrelation occurs when residuals are correlated with each other.
E.g: In predicting students' exam scores based on study hours, if errors in predicting one student's score are consistently related to errors in predicting another student's score, it implies autocorrelation. This violates the assumption of independent errors in linear regression.
Residuals Must be Normally Distributed
If they are not, you're likely to have outliers
Homoscedasticity
The data must be homoscedastic, meaning the variance in the outliers must be constant.
How Do We Know if Our Regression is Any Good?
Evaluating Linear Regression Performance
After building a linear regression model, it is important to evaluate its performance. The most commonly used metrics for evaluating the performance of linear regression models are R-Squared, Root Mean Squared Error (RMSE).
R2
R-squared is a statistical method that determines the goodness of fit.
It measures the strength of the relationship between the dependent and independent variables.
It measures the proportion of variance in the dependent variable explained by the independent variables in a regression model.
The coefficient of determination
R2
It typically ranges between 0 and 1, and shows how well our line fits the data. It technically can be negative if your fit is worse than a horizontal line.
Interpretation:
R² = 0: The model doesn’t explain any variation.
R² = 1: The model perfectly explains all variation.
Higher R²: Means the model does a good job of predicting the dependent variable.
The coefficient of determination
R2
calculated using the formula:
SS_res= is the sum of the squared residuals or errors (the difference between the actual Y values and the predicted Y values.
SS_tot=is the total sum of squares, which measures the total variance of the dependent variable around its mean.
Negative R2
Root Mean Squared Error
Our actual loss function.
Types of Regression
Polynomial Regression
Polynomial regression is a type of regression analysis used when the relationship between the independent variable (x) and dependent variable (y) isn't linear. It involves fitting a polynomial equation to the data, allowing for curves instead of straight lines.
Random Forest Regression
You can modify a random forest to do regression.
In general, people often use this one compare to all the other types.
Regularization Technique in Machine Learning
Regularization is a technique used in machine learning and statistical modeling to prevent overfitting and improve the generalization performance of models.
Regularization adds a penalty term to the model's objective function, which penalizes complex models and encourages simpler ones.
The goal of regularization is to find a balance between fitting the training data well and avoiding overly complex models that may not generalize well to new, unseen data.
Other Types of Regression: