Linear Regression
Created by Shane Dalton
Contents
Simple Linear Regression
Regression Problem
Derivation of minimization
Residuals
Residuals
Regression Assumptions
In order to properly assess the accuracy of our Regression we need to assume a few things about our data:
If any of these conditions are untrue then it will affect our ability to interpret our results and produce valid statistical measures of their accuracy. Each requirement not being satisfied will contribute a different error to our estimate, and some will render results useless if not followed.
Regression Assumptions
The data used to fit the model are representative of the population you are seeking to predict
Ex:
Regression Assumptions
The true underlying relationship between X and Y is linear (or can be transformed to linear)
(This can be clustered with spectral clustering or classified with an SVM classifier easily)
Regression Assumptions
The variance of the residuals is constant (not heteroscedastic)
Regression Assumptions
The residuals must be independent ( there is no relation in residuals i.e. one residual does not affect the next in any way)
Common Pitfalls
Common Pitfalls
Multiple Linear Regression
Multiple Linear Regression
Multiple Linear Regression finds the line of best fit in a higher dimensional space by applying the minimization function from simple linear regression to n-dimensional space, where n is the number of predictive variables to be used in the regression. This is typically done by gradient descent optimization as it is more efficient than computing a matrix inverse for large N
Multiple Linear Regression
Extensions of Linear Regression
The basic assumption of linearity can be relaxed by transforming the response variables, predictive variables, or both.
Polynomial Regression is the next model up in complexity from linear regression, and complex methods such as basis function models can arbitrarily extend this.
Common Pitfalls (multicollinearity)
Feature Selection (wrapper methods)
Variance Vs Bias Trade Off
Validation Methods
Overfitting
Overfitting
Overfitting
Overfitting
Validation Set Testing
Training/Test Split Validation
Leave One Out Cross Validation
K-fold Validation
Comparison of Validation Methods
Validation set: simplest and least computationally intensive, prone to high variance and bias depending on the randomness of the subset of data chosen for training and validation.
K-fold validation: for k in [5…10] K-fold validation is empirically observed to be optimal, also not very computationally intensive even with large datasets.
LOOCV: Most computationally intensive, produces least biased estimates of the error of the model, has greater variance.
Feature Processing (normalization)
Iowa Housing Market Sale Price Predictions
Shane Dalton
Abstract
Given Decision Tree and Linear Regression models, which would offer the best accuracy in the prediction of a home’s price given a suitable training set? The dataset to be analyzed is the Ames Iowa real estate dataset. By preparing the data and training multiple models, a relatively accurate estimator for the price of a house was created and tested against test data giving good results. The 2nd order polynomial Linear Regression model was determined to have the best accuracy by RMSE score, beating out Decision Tree and 1st order regression models to produce the best predictor (RMSE’s of 28,484, 42,557, and 33,374 respectively).
Motivation
A common problem in statistical analysis is to determine an accurate estimator for a presently unknown target variable given an input of known features. This has broad applications across all scientific and business domains, and is helps businesses and scientists create solutions. In the case of the data in this analysis, this predictive model could be used to perform housing appraisals without the intervention of a human being, or to augment. As a real estate company, having a reliable estimator of a home’s value would ensure consistent profitability and pass on the best value to consumers in the market.
Dataset(s)
Data Preparation and Cleaning
Data Preparation and Cleaning
Research Question(s)
Methods
Methods
Methods
Methods
Findings
Findings
Findings
Findings
Findings
Findings
Limitations
Possible improvements
Conclusions
Acknowledgements
References