1 of 38

Linear Regression and Gradient Descent for Optimization

  • Fardina Fathmiul Alam

2 of 38

Modular Approach to ML Algorithm Design

So far, we have talked about procedures for learning.

  • E.g. KNN, Decision trees.

3 of 38

Modular Approach to ML Algorithm Design

For the remainder of this course, we will take a more modular approach:

  • Choose a model describing the relationships between variables of interest
  • Define a loss function quantifying how bad the fit to the data is.
  • Fit the model that minimizes the loss function and satisfy the constraint/penalty imposed by the regularizer, possibly using an optimization algorithm.

Mixing and matching these modular components gives us a lot of new ML methods.

4 of 38

Introduction to “Linear Regression

A supervised learning algorithm that predicts a continuous target Y using a linear combination of input features (X) and model weights (ꞵ).

Assumes: A linear relationship between X and Y.

Predicting Continuous Values

5 of 38

Regression vs. Classification

Classification: Predicts discrete, mutually exclusive categories (e.g., cat or dog).

  • Outcome is either/or, no in-between
  • Example: Is this email spam or not?

Regression: Predicts continuous, real-valued outputs (e.g., house price). Outcome can be more or less correct

  • Example: What will this home sell for?

Key Difference:

  • Classification → "Which category/class?" (discrete)
  • Regression → "How much?" (continuous)

6 of 38

Identify: Classification or Regression?

7 of 38

Goal of the Regression

Find the best-fit straight line that minimizes error between predicted and actual values.

8 of 38

Problem Setup: Let’s Say

We’ve collected Weight and Height measurements from 5 people, and we want to use Weight to predict Height, which is continuous value.

Person

Weight

Height

1

2

3

4

5

6

?

We want to know height (output) of a new person X based on his/her weight (input)

Simple linear regression

9 of 38

If we want to find a simple and effective model for this situation, we'll start by using a simple regression model.

We want to use Weight (independent variable) to predict Height (dependent variable) , which is

continuous.

Person

Weight

Height

1

2

3

4

5

Plotting data

Training data

10 of 38

Linear Equation with 1 Feature

We want to use Weight to predict Height (continuous).

Any linear relationship between two variables can be represented as a straight line.

(Independent Variable)

(Dependent

Variable)

We usually like to add a line to the data so we can see what the trend is

11 of 38

What if there are multiple features?

12 of 38

Linear Equation with Multiple P Features

Any general linear equation with multiple features can be written as

  • xi are features
  • 𝛽i are model parameters or coefficients
  • Y is target variable

Multiple linear regression

13 of 38

Terminology:

Slope and Intercept

“m”

“c”

Slope (β1): Also known as the gradient, it shows how steep the line is, indicating how much the dependent variable changes when the independent variable changes by one unit.

Intercept (β0): It's where the line intersects the y-axis, representing the starting value of the dependent variable when the independent variable is zero.

14 of 38

Back to: Goal of Regression

Y= ꞵ0 + ꞵ1. X

Find a function that best represents the relationship between the input variables (independent variables) and the output variable (dependent variable), which is continuous.

6

?

For a new data X, we want to predict height, based on the weight

β0 represents the y-intercept of the line

β1 represents the slope of the line

15 of 38

Regression Line or Line of Best Fit

The best fit line, often referred to as the "line of best fit" or "regression line," is a straight line that best represents the relationship between two variables in a set of data.

But how do we know which one is the best (optimal) line to fit our data?

?

REMEMBER: “The best fit line will have the least possible overall error among all straight lines.”

Criteria for the Best Line:

  • Fits the data well
  • Minimizes prediction errors
  • Captures the relationship between variables

16 of 38

Find the best values for β0 and β1 to find the best fit line.

Minimize prediction error Mean Squared Error (MSE) via Ordinary Least Squares (OLS) or Gradient Descent approach to fit the line

17 of 38

  1. Residual (Error): Difference between actual (Y) and predicted (Ŷ) values. Formula: ε = Y - Ŷ
    • Good models have small residuals on average
  2. Loss Function (Per-Point Error): Measures the error for a single prediction. Loss=(Y - Ŷ)²
    • Squared error penalizes large errors heavily.
  3. Sum of Squared Errors (SSE/RSS/SSR): Measures total prediction error across all points. SSE = Σ(Y - Ŷ)²
    • Also called Residual Sum of Squares (RSS) or Sum of Squared Residuals (SSR)
    • Evaluate how well the model fits the data.
  4. Cost Function (Mean Squared Error - MSE): Linear regression uses the sum of Squared Errors to define the cost function.
    • MSE = SSE/n: 1/n Σ(Y - Ŷ)²
    • Average loss/squared error (normalized for dataset size).
    • Goal: Find the model parameters (slope and intercept) that minimize this cost function.

18 of 38

Square Error: RSS or SSE

19 of 38

Summary so far

the cost function need to be minimized to find that value of β0 and β1, to find that best fit of the predicted line.

In the context of linear regression, optimization is employed to find the optimal coefficients (slopes and intercept) for the linear equation that best fits the data.

This become a ML Optimization problem now!

Minimize (Cost Function)!

20 of 38

Gradient Descent for Linear Regression

An iterative optimization algorithm that adjusts model parameters (slope β1​ and intercept β0​)

Goal: Minimize the cost function step-by-step by reducing prediction errors.

Minimizing the Cost Function with Gradient Descent

How it works:

  • Calculate the gradient (partial derivatives) of the cost function w.r.t each parameter.�
  • Update parameters in the opposite direction of the gradient to decrease the cost.�
  • Repeat until convergence (cost function stops decreasing significantly).

*** The gradient is a vector that indicates the direction and steepness of the slope of a function at a specific point (check helping slide)

21 of 38

We Now Have a Function to Minimize

We want to find the coefficients (slope β1, constant/Interceptβ0) of the line:

That minimize cost:

+

Target: Update coefficient values iteratively to minimize the cost.

22 of 38

Gradient Descent Process

Process:

  • Initialization: Start with initial values for β0 and β1.
  • Cost Calculation: Calculate the cost function (RSS or MSE).
  • Parameters Adjustment: Adjust (increases or decreases t) parameters (β0, β1) on the gradient to find the next cost value.
  • Iterations: Repeat process until the minimum cost is achieved.

In linear regression, optimize the Residual sum of squares cost function. How?

Gradient descent iteratively finds optimal parameter values by minimizing a cost function.

23 of 38

Gradient Descent Algorithm

24 of 38

Learning Rate (α) — Step Size in Gradient Descent

Why is it important?

Too large: Steps may overshoot the minimum and fail to converge.

Too small: Convergence is slow and takes many iterations.

Common Learning Rates: 0.001, 0.003, 0.01, 0.03, 0.1, 0.3.

25 of 38

Line of Best Fit

Once we're done, we have a nice line of best fit.

26 of 38

Gradient Descent Visualization

** Visualization Courtesy: Gavin Hung, Former CMSC320 Student

Gradient Descent algorithm finds the local extrema of a function.

  • Find the optimal parameters that minimize our loss function for linear regression.

27 of 38

Linear Regression: Key Assumptions (L.I.N.E)

  • Relationship must be linear
  • Residuals must be independent
  • Residuals must be normally distributed
  • Equal variance of residuals

Multiple linear regression

https://www.jmp.com/en/statistics-knowledge-portal/what-is-regression/simple-linear-regression-assumptions

28 of 38

Evaluating Linear Regression Performance

How Do We Know if Our Regression is Any Good?

R-squared (R2) – Coefficient of Determination

Measures how well the model explains the variability in the target variable.

  • Range: 0 to 1 (higher is better)
  • Example: R2 = 0.85 means 85% of the variance in Y is explained by X.

Mean Squared Error (MSE)

Average of the squared prediction errors.

  • Penalizes large errors more. Lower = better.

Root Mean Squared Error (RMSE)

Square root of MSE (in same/original units as Y).

  • Easier to interpret than MSE.
  • RMSE = $10K → Predictions are ±$10K on average.
  • Lower = better.

Mean Absolute Error (MAE): Average of absolute prediction errors.

  • Less sensitive to large errors than MSE/RMSE
  • Lower = better.
  • Doesn’t prioritize large errors (may hide high-risk mistakes). Non-differentiable at zero (less ideal for some optimizers).

29 of 38

Understanding R² (Coefficient of Determination)

Variance measures how much values spread out from the mean.

R2 measures how much of the variance in the target variable (Y) is explained by the model.

SSres/SSE (Sum of Squared Errors): Leftover error not explained by the model

SStot/ TSS (Total Sum of Squares): Total variance in Y

R2 = 0: Model explains none of the variance (as good as guessing the average)�

R2 = 1: Model explains all the variance (perfect predictions)�

Example: If R2 =0.85, then 85% of the variation in Y is explained by X

  • Ȳ = Mean
  • Ŷ = (Predictions): Regression line.
  • Y = Actual Value

30 of 38

Example: R² (Coefficient of Determination)

31 of 38

Types of Regression

32 of 38

Types of Regression

1. Linear Regression

Models the linear relationship between dependent and independent variables.

Can be simple (one predictor) or multiple (more than one predictor).

2. Polynomial Regression:

Models nonlinear relationships by including polynomial terms of predictors (e.g., x2, x3 etc.).

3. Logistic Regression (LATER TOPIC)

Used when the dependent variable is categorical (e.g., binary: yes/no).

Estimates probabilities using the logistic function.

4. Ridge and Lasso Regression

Regularized linear regression techniques that prevent overfitting.

Ridge adds L2 penalty; Lasso adds L1 penalty.

5. Other Variants

Elastic Net, Poisson Regression, Quantile Regression, Robust Regression, etc., tailored for specific data types or assumptions.

33 of 38

Polynomial Regression

  • Useful when data shows a nonlinear pattern.
  • Captures curvature that simple linear regression can’t.
  • More flexible — can fit U-shapes, S-curves, waves, etc.
  • Generally use it only for small degree polynomials
  • Prone to overfitting

An extension of linear regression that fits a curved relationship between the independent and dependent variables.

  • Models nonlinear relationships by including polynomial terms of predictors (e.g., x2, x3 etc.).

34 of 38

Random Forest Regression

You can modify a random forest to do regression.

  • A modification of random forest for predicting continuous values.
  • Ensemble of decision trees combined for better accuracy and stability.

How it works:

  1. Build many decision trees on random subsets of data.
  2. Each tree predicts independently.
  3. Average all tree predictions for final output.

Why use it:

  • Handles non-linear relationships.
  • Robust to outliers and overfitting.
  • Can work with many features.

In general, people often use this one compare to all the other types.

35 of 38

Other Types of Regression:

Lasso (L1) Regression: A type of regression that is useful when there are large numbers of potentially irrelevant features.

  • Forces some coefficients to zero (feature selection).
  • Use Case: Sparse models with few key features.

Selecting a small subset of influential features in high-dimensional datasets.

Ridge (L2) Regression: does well in situations with multicollinearity, where two independent variables are correlated.

  • Shrinks coefficients but never to zero.
  • Use Case: Many small/medium important features.

Works well when predictors are highly correlated (multicollinearity).

ElasticNet Regression: A type of regression that combines the loss functions of ridge and lasso regression to get the advantages of both

  • Best of both: Handles correlated features + feature selection.

Preferred when dataset has many correlated features, combining the advantages of Ridge and Lasso.

36 of 38

Equations of L1 and L2 Regression:

  • Choosing the Regularization Parameter (lambda λ) The strength of regularization (λ) is critical: too large → underfitting; too small → overfitting.�Common methods to select λ:
  • Cross-Validation (CV) – split data into training and validation sets.
  • Grid Search / Random Search – search over a range of λ values
  • Automated techniques – e.g., scikit-learn’s RidgeCV or LassoCV

37 of 38

Regularization

Modifies loss with a penalty

  • Two types: L1 and L2

Hyperparameter C controls how much regularization applied

Comparison:

S. No

L1 Regularization

L2 Regularization

1

Penalizes the sum of absolute value of weights.

Penalizes the sum of square weights.

2

It has a sparse solution.

It has non-sparse solution.

3

It gives multiple solutions.

It has only one solution.

4

Constructed in feature selection.

No feature selection.

5

Robust to outliers.

Not robust to outliers.

6

It generates simple and interpretable models.

It gives more accurate predictions when the output variable is the function of whole input variables.

7

Unable to learn complex data patterns.

Able to learn complex data patterns.

8

Computationally inefficient over non-sparse conditions.

Computationally efficient because of having analytical solutions.

Tyagi, Neelam “L2 and L1 Regularization in Machine Learning” AnalyticsSteps, 2021, https://www.analyticssteps.com/blogs/l2-and-l1-regularization-machine-learning.

38 of 38

Conclusions:

  • Linear Regression is the foundation: models relationships between variables using a straight line.
  • Model evaluation is key — use metrics like R², MSE, RMSE, and MAE to assess performance.
  • Assumptions (L.I.N.E.) must be checked to ensure validity and reliability of results.
  • Polynomial Regression handles nonlinear relationships by adding higher-degree terms.
  • Regularization Techniques (Ridge, Lasso, Elastic Net) help control overfitting and improve generalization — especially with many or correlated features.
  • The goal of all regression models: find patterns, make accurate predictions, and generalize well to new data.