Learning with Regression and Tree
Regression Analysis in Machine learning
Terminologies Related to the Regression Analysis:
Why do we use Regression Analysis?
Types of Regression
There are various types of regressions which are used in data science and machine learning.
Each type has its own importance on different scenarios, but at the core, all the regression methods analyze the effect of the independent variable on dependent variables. Here we are discussing some important types of regression which are given below:
Types of Regression Models:
Linear Regression:
Here, Y = dependent variables (target variables),
X= Independent variables (predictor variables),
a and b are the linear coefficients
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
Some popular applications of linear regression are:
Logistic Regression:
The function can be represented as:
When we provide the input values (data) to the function, it gives the S-curve as follows:
It uses the concept of threshold levels, values above the threshold level are rounded up to 1, and values below the threshold level are rounded up to 0.
There are three types of logistic regression:
Polynomial Regression:
Support Vector Regression:
Support Vector Machine is a supervised learning algorithm which can be used for regression as well as classification problems. So if we use it for regression problems, then it is termed as Support Vector Regression.
Support Vector Regression is a regression algorithm which works for continuous variables. Below are some keywords which are used in Support Vector Regression:
In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum number of datapoints are covered in that margin. The main goal of SVR is to consider the maximum datapoints within the boundary lines and the hyperplane (best-fit line) must contain a maximum number of datapoints. Consider the below image:
Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.
Decision Tree Regression:
Decision Tree is a supervised learning algorithm which can be used for solving both classification and regression problems.
It can solve problems for both categorical and numerical data
Decision Tree regression builds a tree-like structure in which each internal node represents the "test" for an attribute, each branch represent the result of the test, and each leaf node represents the final decision or result.
A decision tree is constructed starting from the root node/parent node (dataset), which splits into left and right child nodes (subsets of dataset). These child nodes are further divided into their children node, and themselves become the parent node of those nodes.
Ridge Regression:
Lasso Regression:
Overfitting and Underfitting in Machine Learning
Before understanding the overfitting and underfitting, let's understand some basic term that will help to understand this topic well:
Overfitting
How to avoid the Overfitting in Model
Underfitting
How to avoid underfitting:
Goodness of Fit
Bias and Variance in Machine Learning
Errors in Machine Learning?
What is Bias?
What is a Variance Error?
Variance tells that how much a random variable is different from its expected value.
A model that shows high variance learns a lot and perform well with the training dataset, and does not generalize well with the unseen dataset. As a result, such a model gives good results with the training dataset but shows high error rates on the test dataset.
Since, with high variance, the model learns too much from the dataset, it leads to overfitting of the model. A model with high variance has the below problems:
Ways to reduce High Bias:
High bias mainly occurs due to a much simple model. Below are some ways to reduce the high bias:
Ways to Reduce High Variance:
Different Combinations of Bias-Variance
There are four possible combinations of bias and variances
Bias-Variance Trade-Off
For an accurate prediction of the model, algorithms need a low variance and low bias. But this is not possible because bias and variance are related to each other:
Hence, the Bias-Variance trade-off is about finding the sweet spot to make a balance between bias and variance errors.
Linear Regression
Types of Linear Regression
Linear Regression Line
A linear line showing the relationship between the dependent and independent variables is called a regression line. A regression line can show two types of relationship:
Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and independent variable increases on the X-axis, then such a relationship is called a negative linear relationship.
Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then such a relationship is termed as a Positive linear relationship.
Finding the best fit line:
When working with linear regression, our main goal is to find the best fit line that means the error between predicted values and actual values should be minimized. The best fit line will have the least error.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to calculate this we use cost function.
Cost function-
What is Linear Regression?
Linear Regression Terminologies
Cost Function
The best fit line can be based on the linear equation given below.
Advantages | Disadvantages |
Linear regression performs exceptionally well for linearly separable data | The assumption of linearity between dependent and independent variables |
Easier to implement, interpret and efficient to train | It is often quite prone to noise and overfitting |
It handles overfitting pretty well using dimensionally reduction techniques, regularization, and cross-validation | Linear regression is quite sensitive to outliers |
One more advantage is the extrapolation beyond a specific data set | It is prone to multicollinearity |
Advantages And Disadvantages
Linear Regression Formula
Y= a + bX
We will find the value of a and b by using the below formula
Where,
x and y are two variables on the regression line.
b = Slope of the line.
a = y-intercept of the line.
x = Values of the first data set.
y = Values of the second data set.
Least Square Regression Line or Linear Regression Line
Y= B0+B1X
Where
B0 is a constant
B1 is the regression coefficient
ŷ = b0+b1x
where b0 is a constant
b1 is the regression coefficient,
x is the independent variable,
ŷ is known as the predicted value of the dependent variable.
Properties of Linear Regression
For the regression line where the regression parameters b0 and b1are defined, the following properties are applicable:
Regression Coefficient
The regression coefficient is given by the equation :
Y= B0+B1X
Where
B0 is a constant
B1 is the regression coefficient
Given below is the formula to find the value of the regression coefficient.
∑[(xi-x)(yi-y)]
∑[(xi-x)2]
Where xi and yi are the observed data sets.
And x and y are the mean value.
B1=b1 =
Example
Least Squares method
Multiple Linear Regression
Multiple Linear Regression is one of the important regression algorithms which models the linear relationship between a single dependent continuous variable and more than one independent variable.
Some key points about MLR:
MLR equation:
The target variable (Y) in Many Linear Regression is a linear mixture of multiple predictor variables x1, x2, x3,...,xn. The equation becomes: Since it is an upgrade of Simple Linear Regression, the same is done to the multiple linear regression equation.
Y= b0+b1x1+ b2x2+ b3x3+...... bnxn ............... (a)
Where,
Y= Output/Response variable
b0, b1, b2, b3 , bn....= Coefficients of the model.
x1, x2, x3, x4,...= Various Independent / feature variable
Applications of Multiple Linear Regression:
Multiple Linear Regression has primarily two applications: