1 of 137

Learning with Regression and Trees

Learning with Regression:

Linear Regression,
Multivariate Linear Regression,
Logistic Regression.

Learning with Trees:

Decision Trees,
Constructing Decision Trees using Gini Index (Regression),
Classification and Regression Trees (CART)

Performance Metrics:

Confusion Matrix, [Kappa Statistics],
Sensitivity, Specificity, Precision, Recall, F-measure, ROC curve

Dr. S. M. Patil, Computer Engineering Department

1

11-06-2024

2 of 137

Learning with Regression and Trees

Learning with Regression

Linear Regression in Machine Learning

Linear regression is one of the easiest and most popular Machine Learning algorithms.
It is a statistical method that is used for predictive analysis.
Linear regression makes predictions for continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more independent (x) variables, hence called as linear regression.
The linear regression model provides a sloped straight line representing the relationship between the variables.

Dr. S. M. Patil, Computer Engineering Department

2

11-06-2024

3 of 137

Linear Regression

Dr. S. M. Patil, Computer Engineering Department

3

11-06-2024

4 of 137

Learning with Regression and Trees

How we can represent a linear regression mathematically?

y= a₀+a₁x+ ε

Here

Y= Dependent Variable (Target Variable)�X= Independent Variable (predictor Variable)�a0= intercept of the line (Gives an additional degree of freedom)�a1 = Linear regression coefficient (scale factor to each input value).�ε = random error

The values for x and y variables are training datasets for Linear Regression model representation.

Dr. S. M. Patil, Computer Engineering Department

4

11-06-2024

5 of 137

Learning with Regression and Trees

Types of Linear Regression

Simple Linear Regression:�If a single independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear Regression.

Multiple Linear regression:�If more than one independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Multiple Linear Regression.

Linear Regression Line

A linear line showing the relationship between the dependent and independent variables is called a regression line. A regression line can show two types of relationship

Dr. S. M. Patil, Computer Engineering Department

5

11-06-2024

6 of 137

Learning with Regression and Trees

Positive Linear Relationship:�If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then such a relationship is termed as a Positive linear relationship

Dr. S. M. Patil, Computer Engineering Department

6

11-06-2024

7 of 137

Learning with Regression and Trees

Negative Linear Relationship:�If the dependent variable decreases on the Y-axis and independent variable increases on the X-axis, then such a relationship is called a negative linear relationship.

Dr. S. M. Patil, Computer Engineering Department

7

11-06-2024

8 of 137

Learning with Regression and Trees

Finding the best fit line:

our main goal is to find the best fit line that means the error between predicted values and actual values should be minimized. The best fit line will have the least error.
The different values for weights or the coefficient of lines (a₀, a₁) gives a different line of regression, so we need to calculate the best values for a₀ and a₁ to find the best fit line, so to calculate this we use cost function.
Cost function- the cost function is used to estimate the values of the coefficient for the best fit line.
Cost function optimizes the regression coefficients or weights. It measures how a linear regression model is performing.
We can use the cost function to find the accuracy of the mapping function, which maps the input variable to the output variable. This mapping function is also known as Hypothesis function.

Dr. S. M. Patil, Computer Engineering Department

8

11-06-2024

9 of 137

Learning with Regression and Trees

Finding the best fit line:

For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of squared error occurred between the predicted values and actual values. It can be written as:

Here , N=Total number of observation�Yi = Actual value�(a1x_i+a₀)= Predicted value.

Dr. S. M. Patil, Computer Engineering Department

9

11-06-2024

10 of 137

Learning with Regression and Trees

Residuals: The distance between the actual value and predicted values is called residual. If the observed points are far from the regression line, then the residual will be high, and so cost function will high. If the scatter points are close to the regression line, then the residual will be small and hence the cost function.

Gradient Descent:

Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
A regression model uses gradient descent to update the coefficients of the line by reducing the cost function.
It is done by a random selection of values of coefficient and then iteratively update the values to reach the minimum cost function.

Dr. S. M. Patil, Computer Engineering Department

10

11-06-2024

11 of 137

Learning with Regression and Trees

Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The process of finding the best model out of various models is called optimization. It can be achieved by below method:
R-squared method:
R-squared is a statistical method that determines the goodness of fit.
It measures the strength of the relationship between the dependent and independent variables on a scale of 0-100%.
The high value of R-square determines the less difference between the predicted values and actual values and hence represents a good model.
It is also called a coefficient of determination, or coefficient of multiple determination for multiple regression.
It can be calculated from the below formula:

Dr. S. M. Patil, Computer Engineering Department

11

11-06-2024

12 of 137

Learning with Regression and Trees

Assumptions of Linear Regression
These are some formal checks while building a Linear Regression model, which ensures to get the best possible result from the given dataset.
Linear relationship between the features and target:
Small or no multicollinearity between the features:
Homoscedasticity Assumption:
Normal distribution of error terms:
No autocorrelations:

Dr. S. M. Patil, Computer Engineering Department

12

11-06-2024

13 of 137

Learning with Regression and Trees

Simple Linear Regression in Machine Learning

Simple Linear Regression is a type of Regression algorithms that models the relationship between a dependent variable and a single independent variable. The relationship shown by a Simple Linear Regression model is linear or a sloped straight line, hence it is called Simple Linear Regression.
The key point in Simple Linear Regression is that the dependent variable must be a continuous/real value. However, the independent variable can be measured on continuous or categorical values.
Objectives of Simple Linear Regression:
Model the relationship between the two variables.

- Such as the relationship between Income and expenditure, experience and Salary, etc.

Dr. S. M. Patil, Computer Engineering Department

13

11-06-2024

14 of 137

Learning with Regression and Trees

Simple Linear Regression in Machine Learning

Forecasting new observations.

- Such as Weather forecasting according to temperature, Revenue of a company according to the investments in a year, etc.

Simple Linear Regression Model:

The Simple Linear Regression model can be represented using the below equation:

y= a₀+a₁x+ ε

a0= It is the intercept of the Regression line (can be obtained putting x=0)�a1= It is the slope of the regression line, which tells whether the line is increasing or decreasing.�ε = The error term. (For a good model it will be negligible)

Dr. S. M. Patil, Computer Engineering Department

14

11-06-2024

15 of 137

Learning with Regression and Trees

Implementation of Simple Linear Regression Algorithm using Python
Problem Statement example for Simple Linear Regression:
Here we are taking a dataset that has two variables:

- salary (dependent variable) and

- experience (Independent variable).

The goals of this problem is:

We want to find out if there is any correlation between these two variables
We will find the best fit line for the dataset.
How the dependent variable is changing by changing the independent variable.
we will create a Simple Linear Regression model to find out the best fitting line for representing the relationship between these two variables.

�

Dr. S. M. Patil, Computer Engineering Department

15

11-06-2024

16 of 137

Learning with Regression and Trees

Step-1: Data Pre-processing

First, we will import the three important libraries, which will help us for loading the dataset, plotting the graphs, and creating the Simple Linear Regression model.

Next, we will load the dataset into our code:

By executing the above line of code (ctrl+ENTER), we can read the dataset on our Spyder IDE screen by clicking on the variable explorer option�

Dr. S. M. Patil, Computer Engineering Department

16

11-06-2024

17 of 137

Learning with Regression and Trees

Dr. S. M. Patil, Computer Engineering Department

17

11-06-2024

18 of 137

Learning with Regression and Trees

After that, we need to extract the dependent and independent variables from the given dataset. The independent variable is years of experience, and the dependent variable is salary. Below is code for it:

�

Dr. S. M. Patil, Computer Engineering Department

18

11-06-2024

19 of 137

Dr. S. M. Patil, Computer Engineering Department

19

11-06-2024

20 of 137

Learning with Regression and Trees

Next, we will split both variables into the test set and training set. We have 30 observations, so we will take 20 observations for the training set and 10 observations for the test set. We are splitting our dataset so that we can train our model using a training dataset and then test the model using a test dataset. The code for this is given below:

By executing the above code, we will get x-test, x-train and y-test, y-train dataset. Consider the below images

�

Dr. S. M. Patil, Computer Engineering Department

20

11-06-2024

21 of 137

Dr. S. M. Patil, Computer Engineering Department

21

11-06-2024

22 of 137

Training data set

Dr. S. M. Patil, Computer Engineering Department

22

11-06-2024

23 of 137

Learning with Regression and Trees

For simple linear Regression, we will not use Feature Scaling. Because Python libraries take care of it for some cases, so we don't need to perform it here. Now, our dataset is well prepared to work on it and we are going to start building a Simple Linear Regression model for the given problem.

�

Dr. S. M. Patil, Computer Engineering Department

23

11-06-2024

24 of 137

Learning with Regression and Trees

Step-2: Fitting the Simple Linear Regression to the Training Set
Now the second step is to fit our model to the training dataset. To do so, we will import the LinearRegression class of the linear_model library from the scikit learn. After importing the class, we are going to create an object of the class named as a regressor. ��

�

Dr. S. M. Patil, Computer Engineering Department

24

11-06-2024

25 of 137

Learning with Regression and Trees

In the above code, we have used a fit() method to fit our Simple Linear Regression object to the training set. In the fit() function, we have passed the x_train and y_train, which is our training dataset for the dependent and an independent variable. We have fitted our regressor object to the training set so that the model can easily learn the correlations between the predictor and target variables. After executing the above lines of code, we will get the below output.

Dr. S. M. Patil, Computer Engineering Department

25

11-06-2024

26 of 137

Step: 3. Prediction of test set result:

dependent (salary) and an independent variable (Experience). So, now, our model is ready to predict the output for the new observations. In this step, we will provide the test dataset (new observations) to the model to check whether it can predict the correct output or not.
We will create a prediction vector y_pred, and x_pred, which will contain predictions of test dataset, and prediction of training set respectively.�

Dr. S. M. Patil, Computer Engineering Department

26

11-06-2024

27 of 137

Learning with Regression and Trees

Dr. S. M. Patil, Computer Engineering Department

27

11-06-2024

On executing the above lines of code, two variables named y_pred and x_pred will generate in the variable explorer options that contain salary predictions for the training set and test set.
You can check the variable by clicking on the variable explorer option in the IDE, and also compare the result by comparing values from y_pred and y_test. By comparing these values, we can check how good our model is performing.

28 of 137

Learning with Regression and Trees

Step: 4. visualizing the Training set results:

Now in this step, we will visualize the training set result. To do so, we will use the scatter() function of the pyplot library, which we have already imported in the pre-processing step. The scatter () function will create a scatter plot of observations.
In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary of employees. In the function, we will pass the real values of training set, which means a year of experience x_train, training set of Salaries y_train, and color of the observations. Here we are taking a green color for the observation, but it can be any color as per the choice.
Now, we need to plot the regression line, so for this, we will use the plot() function of the pyplot library. In this function, we will pass the years of experience for training set, predicted salary for training set x_pred, and color of the line.
Next, we will give the title for the plot. So here, we will use the title() function of the pyplot library and pass the name ("Salary vs Experience (Training Dataset)"�

Dr. S. M. Patil, Computer Engineering Department

28

11-06-2024

29 of 137

Learning with Regression and Trees

After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel() function.
Finally, we will represent all above things in a graph using show(). The code is given below:

Dr. S. M. Patil, Computer Engineering Department

29

11-06-2024

30 of 137

Learning with Regression and Trees

Output:
By executing the above lines of code, we will get the below graph plot as an output.

Dr. S. M. Patil, Computer Engineering Department

30

11-06-2024

31 of 137

Learning with Regression and Trees

In the above plot, we can see the real values observations in green dots and predicted values are covered by the red regression line. The regression line shows a correlation between the dependent and independent variable
The good fit of the line can be observed by calculating the difference between actual values and predicted values. But as we can see in the above plot, most of the observations are close to the regression line, hence our model is good for the training set

Dr. S. M. Patil, Computer Engineering Department

31

11-06-2024

32 of 137

Learning with Regression and Trees

Step: 5. visualizing the Test set results:

In the previous step, we have visualized the performance of our model on the training set. Now, we will do the same for the Test set. The complete code will remain the same as the above code, except in this, we will use x_test, and y_test instead of x_train and y_train.
Here we are also changing the color of observations and regression line to differentiate between the two plots, but it is optional.

Dr. S. M. Patil, Computer Engineering Department

32

11-06-2024

33 of 137

Learning with Regression and Trees

Dr. S. M. Patil, Computer Engineering Department

33

11-06-2024

34 of 137

Learning with Regression and Trees

Dr. S. M. Patil, Computer Engineering Department

34

11-06-2024

In the plot, there are observations given by the blue color,

and prediction is given by the red regression line.

As we can see, most of the observations are close to the regression line,

hence we can say our Simple Linear Regression is a good model and able to make good predictions.

35 of 137

Learning with Regression and Trees

Multiple Linear Regression

Multiple Linear Regression is one of the important regression algorithms which models the linear relationship between a single dependent continuous variable and more than one independent variable.
Example: Prediction of CO₂ emission based on engine size and number of cylinders in a car.

Some key points about MLR:

For MLR, the dependent or target variable(Y) must be the continuous/real, but the predictor or independent variable may be of continuous or categorical form.
Each feature variable must model the linear relationship with the dependent variable.
MLR tries to fit a regression line through a multidimensional space of data-points

Dr. S. M. Patil, Computer Engineering Department

35

11-06-2024

36 of 137

Learning with Regression and Trees

The Multiple Linear Regression Equation

where y is the predicted or expected value of the dependent variable, X1 through Xp are p distinct independent or predictor variables, b0 is the value of Y when all of the independent variables (X1 through Xp) are equal to zero, and b1 through bp are the estimated regression coefficients.
Assumptions for Multiple Linear Regression:

A linear relationship should exist between the Target and predictor variables.
The regression residuals must be normally distributed.
MLR assumes little or no multicollinearity (correlation between the independent variable) in data.

Dr. S. M. Patil, Computer Engineering Department

36

11-06-2024

37 of 137

Learning with Regression and Trees

Implementation of Multiple Linear Regression model using Python:

Problem :
We have a dataset of 50 start-up companies. This dataset contains five main information: R&D Spend, Administration Spend, Marketing Spend, State, and Profit for a financial year. Our goal is to create a model that can easily determine which company has a maximum profit, and which is the most affecting factor for the profit of a company.
Since we need to find the Profit, so it is the dependent variable, and the other four variables are independent variables.
Below are the main steps of deploying the MLR model:

Data Pre-processing Steps
Fitting the MLR model to the training set
Predicting the result of the test set

�

Dr. S. M. Patil, Computer Engineering Department

37

11-06-2024

38 of 137

Learning with Regression and Trees

Step-1: Data Pre-processing Step:

Importing libraries: Firstly we will import the library which will help in building the model. Below is the code for it:

Importing dataset: Now we will import the dataset(50_CompList), which contains all the variables. Below is the code for it:

Dr. S. M. Patil, Computer Engineering Department

38

11-06-2024

39 of 137

Learning with Regression and Trees

Output: We will get the dataset as:

Dr. S. M. Patil, Computer Engineering Department

39

11-06-2024

40 of 137

Learning with Regression and Trees

In above output, we can clearly see that there are five variables, in which four variables are continuous and one is categorical variable.
Extracting dependent and independent Variables:

Output:

Dr. S. M. Patil, Computer Engineering Department

40

11-06-2024

41 of 137

Learning with Regression and Trees

Dr. S. M. Patil, Computer Engineering Department

41

11-06-2024

42 of 137

Learning with Regression and Trees

As we can see in the above output, the last column contains categorical variables which are not suitable to apply directly for fitting the model. So we need to encode this variable.

Encoding Dummy Variables:

To encode the categorical variable into numbers, we will use the LabelEncoder class. But it is not sufficient because it still has some relational order, which may create a wrong model. So in order to remove this problem, we will use OneHotEncoder, which will create the dummy variables. Below is code for it:

Dr. S. M. Patil, Computer Engineering Department

42

11-06-2024

43 of 137

Learning with Regression and Trees

Here we are only encoding one independent variable, which is state as other variables are continuous.
Output

Dr. S. M. Patil, Computer Engineering Department

43

11-06-2024

44 of 137

Learning with Regression and Trees

Now, we are writing a single line of code just to avoid the dummy variable trap:

If we do not remove the first dummy variable, then it may introduce multicollinearity in the model.

Dr. S. M. Patil, Computer Engineering Department

44

11-06-2024

45 of 137

Learning with Regression and Trees

Now we will split the dataset into training and test set. The code for this is given below:

The above code will split our dataset into a training set and test set.
You can check the output by clicking on the variable explorer option given in Spyder IDE. The test set and training set will look like the below image:

Dr. S. M. Patil, Computer Engineering Department

45

11-06-2024

46 of 137

Learning with Regression and Trees

Dr. S. M. Patil, Computer Engineering Department

46

11-06-2024

47 of 137

Learning with Regression and Trees

Dr. S. M. Patil, Computer Engineering Department

47

11-06-2024

48 of 137

Learning with Regression and Trees

Step: 2- Fitting our MLR model to the Training set:

Now, we have well prepared our dataset in order to provide training, which means we will fit our regression model to the training set.
It will be similar to as we did in Simple Linear Regression model. The code for this will be:

Output:

Now, we have successfully trained our model using the training dataset. In the next step, we will test the performance of the model using the test dataset.

Dr. S. M. Patil, Computer Engineering Department

48

11-06-2024

49 of 137

Learning with Regression and Trees

Step: 3- Prediction of Test set results:

The last step for our model is checking the performance of the model.
We will do it by predicting the test set result. For prediction, we will create a y_pred vector. Below is the code for it:

By executing the above lines of code, a new vector will be generated under the variable explorer option. We can test our model by comparing the predicted values and test set values.

Dr. S. M. Patil, Computer Engineering Department

49

11-06-2024

50 of 137

Learning with Regression and Trees

Dr. S. M. Patil, Computer Engineering Department

50

11-06-2024

51 of 137

Learning with Regression and Trees

In the above output, we have predicted result set and test set. We can check model performance by comparing these two value index by index. For example, the first index has a predicted value of 103015$ profit and test/real value of 103282$ profit. The difference is only of 267$, which is a good prediction, so, finally, our model is completed here.
We can also check the score for training dataset and test dataset. Below is the code for it:

Output: The score is:

The above score tells that our model is 95% accurate with the training dataset and 93% accurate with the test dataset.

Dr. S. M. Patil, Computer Engineering Department

51

11-06-2024

52 of 137

Learning with Regression and Trees

Applications of Multiple Linear Regression:

There are mainly two applications of Multiple Linear Regression:
Effectiveness of Independent variable on prediction
Predicting the impact of changes

Dr. S. M. Patil, Computer Engineering Department

52

11-06-2024

53 of 137

Dr. S. M. Patil, Computer Engineering Department

53

11-06-2024

54 of 137

Dr. S. M. Patil, Computer Engineering Department

54

11-06-2024

55 of 137

Logistic Regression

Logistic regression is one of the most popular Machine Learning algorithms, which comes under the Supervised Learning technique.
It is used for predicting the categorical dependent variable using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are used. Linear Regression is used for solving Regression problems, whereas Logistic regression is used for solving the classification problems.
In Linear Regression, the output is the weighted sum of inputs. Logistic Regression is a generalized Linear Regression in the sense that we don’t output the weighted sum of inputs directly, but we pass it through a function that can map any real value between 0 and 1.

Dr. S. M. Patil, Computer Engineering Department

55

11-06-2024

56 of 137

Logistic Regression

In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability to provide probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data and can easily determine the most effective variables used for the classification.

Dr. S. M. Patil, Computer Engineering Department

56

11-06-2024

Logistic regression uses the concept of predictive modeling as regression; therefore, it is called logistic regression, but is used to classify samples; Therefore, it falls under the classification algorithm.

57 of 137

Logistic Regression

Dr. S. M. Patil, Computer Engineering Department

57

11-06-2024

58 of 137

Logistic Regression

Logistic Function (Sigmoid Function):
The sigmoid function is a mathematical function used to map the predicted values to probabilities.
It maps any real value into another value within a range of 0 and 1.
The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve like the "S" form.
In logistic regression, we use the concept of the threshold value, which defines the probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the threshold values tends to 0.

Dr. S. M. Patil, Computer Engineering Department

58

11-06-2024

59 of 137

Logistic Regression

Assumptions for Logistic Regression:

The dependent variable must be categorical in nature.
The independent variable(features) should not have multi-collinearity.

Logistic Regression Equation:

the equation of the straight line can be written as:

In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation by (1-y):

Dr. S. M. Patil, Computer Engineering Department

59

11-06-2024

60 of 137

Logistic Regression

But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will become:

The above equation is the final equation for Logistic Regression.

Dr. S. M. Patil, Computer Engineering Department

60

11-06-2024

61 of 137

Logistic Regression

Dr. S. M. Patil, Computer Engineering Department

61

11-06-2024

y= a₀+a₁x+ ε

62 of 137

Logistic Regression

Type of Logistic Regression:

Logistic Regression can be classified into three types:

Binomial: In binomial Logistic regression, there can be only two possible types of the dependent variables, such as 0 or 1, Pass or Fail, etc.

Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of the dependent variable, such as "cat", "dogs", or "sheep“

Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent variables, such as "low", "Medium", or "High".

Dr. S. M. Patil, Computer Engineering Department

62

11-06-2024

63 of 137

Logistic Regression

Implementation of Logistic Regression (Binomial)

Example: There is a dataset given which contains the information of various users obtained from the social networking sites. There is a car making company that has recently launched a new SUV car. So the company wanted to check how many users from the dataset, wants to purchase the car.

Dr. S. M. Patil, Computer Engineering Department

63

11-06-2024

64 of 137

Logistic Regression

Dr. S. M. Patil, Computer Engineering Department

64

11-06-2024

we will predict the

purchased variable (Dependent Variable)

by using age and salary (Independent variables).

65 of 137

Logistic Regression

Steps in Logistic Regression:

Data Pre-processing step
Fitting Logistic Regression to the Training set
Predicting the test result
Test accuracy of the result(Creation of Confusion matrix)
Visualizing the test set result.

Dr. S. M. Patil, Computer Engineering Department

65

11-06-2024

66 of 137

Logistic Regression

Data Pre-processing step:

Dr. S. M. Patil, Computer Engineering Department

66

11-06-2024

67 of 137

Logistic Regression

Data Pre-processing step:

Dr. S. M. Patil, Computer Engineering Department

67

11-06-2024

68 of 137

Logistic Regression :

Data Pre-processing step:

Now, we will extract the dependent and independent variables from the given dataset.

Dr. S. M. Patil, Computer Engineering Department

68

11-06-2024

69 of 137

Logistic Regression :

Dr. S. M. Patil, Computer Engineering Department

69

11-06-2024

70 of 137

Logistic Regression

Now we will split the dataset into a training set and test set

Dr. S. M. Patil, Computer Engineering Department

70

11-06-2024

71 of 137

Logistic Regression

Dr. S. M. Patil, Computer Engineering Department

71

11-06-2024

72 of 137

Logistic Regression

In logistic regression, we will do feature scaling because we want accurate result of predictions. Here we will only scale the independent variable because dependent variable have only 0 and 1 values.

Dr. S. M. Patil, Computer Engineering Department

72

11-06-2024

73 of 137

Logistic Regression

The scaled output Is given below:

Dr. S. M. Patil, Computer Engineering Department

73

11-06-2024

74 of 137

Logistic Regression :

2. Fitting Logistic Regression to the Training set:

For providing training or fitting the model to the training set, we will import the LogisticRegression class of the sklearn library.
After importing the class, we will create a classifier object and use it to fit the model to the logistic regression

Dr. S. M. Patil, Computer Engineering Department

74

11-06-2024

75 of 137

Logistic Regression :

Fitting Logistic Regression to the Training set:

Output

Dr. S. M. Patil, Computer Engineering Department

75

11-06-2024

76 of 137

Logistic Regression :

3. Predicting the Test Result

Dr. S. M. Patil, Computer Engineering Department

76

11-06-2024

The output image shows the corresponding predicted users who want to purchase or not purchase the car.

77 of 137

Logistic Regression :

4. Test Accuracy of the result

Now we will create the confusion matrix here to check the accuracy of the classification.
To create it, we need to import the confusion_matrix function of the sklearn library.
After importing the function, we will call it using a new variable cm. The function takes two parameters, mainly y_true( the actual values) and y_pred (the targeted value return by the classifier).

Dr. S. M. Patil, Computer Engineering Department

77

11-06-2024

78 of 137

Logistic Regression :

Dr. S. M. Patil, Computer Engineering Department

78

11-06-2024

We can find the accuracy of the predicted result by interpreting the confusion matrix. By above output, we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect Output).

79 of 137

Logistic Regression :

5. Visualizing the training set result

To visualize the result, we will use ListedColormap class of matplotlib library.

Dr. S. M. Patil, Computer Engineering Department

79

11-06-2024

80 of 137

Dr. S. M. Patil, Computer Engineering Department

80

11-06-2024

81 of 137

In the above code, we have imported the ListedColormap class of Matplotlib library to create the colormap for visualizing the result. We have created two new variables x_set and y_set to replace x_train and y_train. After that, we have used the nm.meshgrid command to create a rectangular grid, which has a range of -1(minimum) to 1 (maximum). The pixel points we have taken are of 0.01 resolution.
To create a filled contour, we have used mtp.contourf command, it will create regions of provided colors (purple and green). In this function, we have passed the classifier.predict to show the predicted data points predicted by the classifier.

Dr. S. M. Patil, Computer Engineering Department

81

11-06-2024

82 of 137

Dr. S. M. Patil, Computer Engineering Department

82

11-06-2024

In the above graph, we can see that there are some Green points within the green region and Purple points within the purple region.
All these data points are the observation points from the training set, which shows the result for purchased variables.
This graph is made by using two independent variables i.e., Age on the x-axis and Estimated salary on the y-axis.

83 of 137

Dr. S. M. Patil, Computer Engineering Department

83

11-06-2024

The purple point observations are for which purchased (dependent variable) is probably 0, i.e., users who did not purchase the SUV car.

The green point observations are for which purchased (dependent variable) is probably 1 means user who purchased the SUV car.
We can also estimate from the graph that the users who are younger with low salary, did not purchase the car, whereas older users with high estimated salary purchased the car.
But there are some purple points in the green region (Buying the car) and some green points in the purple region(Not buying the car). So we can say that younger users with a high estimated salary purchased the car, whereas an older user with a low estimated salary did not purchase the car.

84 of 137

Logistic Regression :

The goal of the classifier:
We have successfully visualized the training set result for the logistic regression, and our goal for this classification is to divide the users who purchased the SUV car and who did not purchase the car. So from the output graph, we can clearly see the two regions (Purple and Green) with the observation points. The Purple region is for those users who didn't buy the car, and Green Region is for those users who purchased the car.

Dr. S. M. Patil, Computer Engineering Department

84

11-06-2024

85 of 137

Logistic Regression :

Visualizing the test set result:

Our model is well trained using the training dataset. Now, we will visualize the result for new observations (Test set). The code for the test set will remain same as above except that here we will use x_test and y_test instead of x_train and y_train.

Dr. S. M. Patil, Computer Engineering Department

85

11-06-2024

86 of 137

Dr. S. M. Patil, Computer Engineering Department

86

11-06-2024

87 of 137

Logistic Regression :

Dr. S. M. Patil, Computer Engineering Department

87

11-06-2024

The above graph shows the test set result. As we can see, the graph is divided into two regions (Purple and Green). And Green observations are in the green region, and Purple observations are in the purple region.

So we can say it is a good prediction and model. Some of the green and purple data points are in different regions, which can be ignored as we have already calculated this error using the confusion matrix (11 Incorrect output).

Hence our model is pretty good and ready to make new predictions for this classification problem.

88 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

88

11-06-2024

89 of 137

Classification and Regression Trees (CART)

Step 1: Start at the root node with all training instances

�Step 2: Select an attribute on the basis of splitting criteria (Gain Ratio or other impurity metrics, discussed below)

�Step 3: Partition instances according to selected attribute recursively

Partitioning stops when:� There are no examples left� All examples for a given node belong to the same class� There are no remaining attributes for further partitioning – majority class is the leaf

Dr. S. M. Patil, Computer Engineering Department

89

11-06-2024

90 of 137

Dr. S. M. Patil, Computer Engineering Department

90

11-06-2024

91 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

91

11-06-2024

92 of 137

Classification and Regression Trees (CART)

For selection the best attribute , we will write some rules to get error, the attribute with minimum error will select for splitting purpose

Dr. S. M. Patil, Computer Engineering Department

92

11-06-2024

93 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

93

11-06-2024

94 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

94

11-06-2024

95 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

95

11-06-2024

96 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

96

11-06-2024

97 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

97

11-06-2024

If we have ambiguity , then we select individual error

98 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

98

11-06-2024

99 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

99

11-06-2024

100 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

100

11-06-2024

101 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

101

11-06-2024

102 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

102

11-06-2024

103 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

103

11-06-2024

104 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

104

11-06-2024

105 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

105

11-06-2024

106 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

106

11-06-2024

107 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

107

11-06-2024

108 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

108

11-06-2024

109 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

109

11-06-2024

110 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

110

11-06-2024

111 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

111

11-06-2024

112 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

112

11-06-2024

113 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

113

11-06-2024

114 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

114

11-06-2024

115 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

115

11-06-2024

116 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

116

11-06-2024

117 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

117

11-06-2024

118 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

118

11-06-2024

119 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

119

11-06-2024

120 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

120

11-06-2024

121 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

121

11-06-2024

122 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

122

11-06-2024

123 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

123

11-06-2024

124 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

124

11-06-2024

125 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

125

11-06-2024

126 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

126

11-06-2024

127 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

127

11-06-2024

128 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

128

11-06-2024

129 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

129

11-06-2024

130 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

130

11-06-2024

131 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

131

11-06-2024

132 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

132

11-06-2024

133 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

133

11-06-2024

134 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

134

11-06-2024

135 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

135

11-06-2024

136 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

136

11-06-2024

137 of 137

Classification and Regression Trees (CART)

Dr. S. M. Patil, Computer Engineering Department

137

11-06-2024