Linear Regression and Gradient Descent for Optimization
Modular Approach to ML Algorithm Design
So far, we have talked about procedures for learning.
Modular Approach to ML Algorithm Design
For the remainder of this course, we will take a more modular approach:
Mixing and matching these modular components gives us a lot of new ML methods.
Introduction to “Linear Regression”
A supervised learning algorithm that predicts a continuous target Y using a linear combination of input features (X) and model weights (ꞵ).
Assumes: A linear relationship between X and Y.
Predicting Continuous Values
ꞵ
Regression vs. Classification
Classification: Predicts discrete, mutually exclusive categories (e.g., cat or dog).
Regression: Predicts continuous, real-valued outputs (e.g., house price). Outcome can be more or less correct
Key Difference:
Identify: Classification or Regression?
Goal of the Regression
Find the best-fit straight line that minimizes error between predicted and actual values.
Problem Setup: Let’s Say
We’ve collected Weight and Height measurements from 5 people, and we want to use Weight to predict Height, which is continuous value.
Person | Weight | Height |
1 | … | … |
2 | … | … |
3 | … | … |
4 | … | … |
5 | … | … |
6 | … | ? |
We want to know height (output) of a new person X based on his/her weight (input)
Simple linear regression
If we want to find a simple and effective model for this situation, we'll start by using a simple regression model.
We want to use Weight (independent variable) to predict Height (dependent variable) , which is
continuous.
Person | Weight | Height |
1 | … | … |
2 | … | … |
3 | … | … |
4 | … | … |
5 | … | … |
Plotting data
Training data
Linear Equation with 1 Feature
We want to use Weight to predict Height (continuous).
Any linear relationship between two variables can be represented as a straight line.
(Independent Variable)
(Dependent
Variable)
We usually like to add a line to the data so we can see what the trend is
What if there are multiple features?
Linear Equation with Multiple P Features
Any general linear equation with multiple features can be written as
Multiple linear regression
Terminology:
Slope and Intercept
“m”
“c”
Slope (β1): Also known as the gradient, it shows how steep the line is, indicating how much the dependent variable changes when the independent variable changes by one unit.
Intercept (β0): It's where the line intersects the y-axis, representing the starting value of the dependent variable when the independent variable is zero.
Back to: Goal of Regression
Y= ꞵ0 + ꞵ1. X
Find a function that best represents the relationship between the input variables (independent variables) and the output variable (dependent variable), which is continuous.
6 | … | ? |
For a new data X, we want to predict height, based on the weight
β0 represents the y-intercept of the line
β1 represents the slope of the line
Regression Line or Line of Best Fit
The best fit line, often referred to as the "line of best fit" or "regression line," is a straight line that best represents the relationship between two variables in a set of data.
But how do we know which one is the best (optimal) line to fit our data?
?
REMEMBER: “The best fit line will have the least possible overall error among all straight lines.”
Criteria for the Best Line:
Find the best values for β0 and β1 to find the best fit line.
Minimize prediction error Mean Squared Error (MSE) via Ordinary Least Squares (OLS) or Gradient Descent approach to fit the line
Square Error: RSS or SSE
Summary so far
the cost function need to be minimized to find that value of β0 and β1, to find that best fit of the predicted line.
In the context of linear regression, optimization is employed to find the optimal coefficients (slopes and intercept) for the linear equation that best fits the data.
This become a ML Optimization problem now!
Minimize (Cost Function)!
Gradient Descent for Linear Regression
An iterative optimization algorithm that adjusts model parameters (slope β1 and intercept β0)
Goal: Minimize the cost function step-by-step by reducing prediction errors.
Minimizing the Cost Function with Gradient Descent
How it works:
*** The gradient is a vector that indicates the direction and steepness of the slope of a function at a specific point (check helping slide)
We Now Have a Function to Minimize
We want to find the coefficients (slope β1, constant/Interceptβ0) of the line:
That minimize cost:
+
Target: Update coefficient values iteratively to minimize the cost.
Gradient Descent Process
Process:
In linear regression, optimize the Residual sum of squares cost function. How?
Gradient descent iteratively finds optimal parameter values by minimizing a cost function.
Gradient Descent Algorithm
Learning Rate (α) — Step Size in Gradient Descent
Why is it important?
Too large: Steps may overshoot the minimum and fail to converge.
Too small: Convergence is slow and takes many iterations.
Common Learning Rates: 0.001, 0.003, 0.01, 0.03, 0.1, 0.3.
Line of Best Fit
Once we're done, we have a nice line of best fit.
Gradient Descent Visualization
** Visualization Courtesy: Gavin Hung, Former CMSC320 Student
Gradient Descent algorithm finds the local extrema of a function.
Linear Regression: Key Assumptions (L.I.N.E)
Multiple linear regression
https://www.jmp.com/en/statistics-knowledge-portal/what-is-regression/simple-linear-regression-assumptions
Evaluating Linear Regression Performance
How Do We Know if Our Regression is Any Good?
R-squared (R2) – Coefficient of Determination Measures how well the model explains the variability in the target variable.
| Mean Squared Error (MSE) Average of the squared prediction errors.
|
Root Mean Squared Error (RMSE) Square root of MSE (in same/original units as Y).
| Mean Absolute Error (MAE): Average of absolute prediction errors.
|
Understanding R² (Coefficient of Determination)
Variance measures how much values spread out from the mean.
R2 measures how much of the variance in the target variable (Y) is explained by the model.
SSres/SSE (Sum of Squared Errors): Leftover error not explained by the model
SStot/ TSS (Total Sum of Squares): Total variance in Y
R2 = 0: Model explains none of the variance (as good as guessing the average)�
R2 = 1: Model explains all the variance (perfect predictions)�
Example: If R2 =0.85, then 85% of the variation in Y is explained by X
Example: R² (Coefficient of Determination)
Types of Regression
Types of Regression
1. Linear Regression | Models the linear relationship between dependent and independent variables. Can be simple (one predictor) or multiple (more than one predictor). |
2. Polynomial Regression: | Models nonlinear relationships by including polynomial terms of predictors (e.g., x2, x3 etc.). |
3. Logistic Regression (LATER TOPIC) | Used when the dependent variable is categorical (e.g., binary: yes/no). Estimates probabilities using the logistic function. |
4. Ridge and Lasso Regression | Regularized linear regression techniques that prevent overfitting. Ridge adds L2 penalty; Lasso adds L1 penalty. |
5. Other Variants | Elastic Net, Poisson Regression, Quantile Regression, Robust Regression, etc., tailored for specific data types or assumptions. |
Polynomial Regression
An extension of linear regression that fits a curved relationship between the independent and dependent variables.
Random Forest Regression
You can modify a random forest to do regression.
How it works:
Why use it:
In general, people often use this one compare to all the other types.
Other Types of Regression:
Lasso (L1) Regression: A type of regression that is useful when there are large numbers of potentially irrelevant features.
Selecting a small subset of influential features in high-dimensional datasets. |
Ridge (L2) Regression: does well in situations with multicollinearity, where two independent variables are correlated.
Works well when predictors are highly correlated (multicollinearity). |
ElasticNet Regression: A type of regression that combines the loss functions of ridge and lasso regression to get the advantages of both
Preferred when dataset has many correlated features, combining the advantages of Ridge and Lasso. |
Equations of L1 and L2 Regression:
Regularization
Modifies loss with a penalty
Hyperparameter C controls how much regularization applied
Comparison:
S. No | L1 Regularization | L2 Regularization |
1 | Penalizes the sum of absolute value of weights. | Penalizes the sum of square weights. |
2 | It has a sparse solution. | It has non-sparse solution. |
3 | It gives multiple solutions. | It has only one solution. |
4 | Constructed in feature selection. | No feature selection. |
5 | Robust to outliers. | Not robust to outliers. |
6 | It generates simple and interpretable models. | It gives more accurate predictions when the output variable is the function of whole input variables. |
7 | Unable to learn complex data patterns. | Able to learn complex data patterns. |
8 | Computationally inefficient over non-sparse conditions. | Computationally efficient because of having analytical solutions. |
Tyagi, Neelam “L2 and L1 Regularization in Machine Learning” AnalyticsSteps, 2021, https://www.analyticssteps.com/blogs/l2-and-l1-regularization-machine-learning.
Conclusions: