2 of 54

Contents

Simple Linear Regression
Mathematics of Simple Regression
Interpretations of Residuals
Assumptions of regression
Common pitfalls
Multiple Linear Regression
Extensions of linear regression
Overfitting
Variance Vs Bias Trade offs
Validation Methods
Feature Selection
Iowa Housing Market Dataset Analysis

3 of 54

Simple Linear Regression

A statistical method that allows us to study relationships between two variables.
Ex: After creating a model using training data, we can use this model to predict outputs given new inputs.
Capable of modeling statistical relationships between observed data, assuming that data meets certain criteria. E.g. given x, what value of y should we expect?

4 of 54

Regression Problem

5 of 54

Derivation of minimization

8 of 54

Regression Assumptions

In order to properly assess the accuracy of our Regression we need to assume a few things about our data:

The data used to fit the model are representative of the population you are seeking to predict
The true underlying relationship between X and Y is linear ( or can be transformed to linear), that is the value to be predicted is reasonably close to a linear combination of the predictive variables.
The variance of the residuals is constant (not heteroscedastic)
The residuals must be independent ( there is no relation in residuals i.e. one residual does not affect the next in any way)
The residuals must be randomly distributed

If any of these conditions are untrue then it will affect our ability to interpret our results and produce valid statistical measures of their accuracy. Each requirement not being satisfied will contribute a different error to our estimate, and some will render results useless if not followed.

9 of 54

Regression Assumptions

The data used to fit the model are representative of the population you are seeking to predict

Ex:

Using linear Regression to estimate values that are significantly outside the mean of the samples will result in wildly inaccurate results
As the model builder it is always your responsibility to make sure your data makes sense, and this is the primary purpose of the exploratory data analysis.

10 of 54

Regression Assumptions

The true underlying relationship between X and Y is linear (or can be transformed to linear)

(This can be clustered with spectral clustering or classified with an SVM classifier easily)

11 of 54

Regression Assumptions

The variance of the residuals is constant (not heteroscedastic)

12 of 54

Regression Assumptions

The residuals must be independent ( there is no relation in residuals i.e. one residual does not affect the next in any way)

13 of 54

Common Pitfalls

14 of 54

Common Pitfalls

15 of 54

Multiple Linear Regression

16 of 54

Multiple Linear Regression

Multiple Linear Regression finds the line of best fit in a higher dimensional space by applying the minimization function from simple linear regression to n-dimensional space, where n is the number of predictive variables to be used in the regression. This is typically done by gradient descent optimization as it is more efficient than computing a matrix inverse for large N

17 of 54

Multiple Linear Regression

18 of 54

Extensions of Linear Regression

The basic assumption of linearity can be relaxed by transforming the response variables, predictive variables, or both.

Polynomial Regression is the next model up in complexity from linear regression, and complex methods such as basis function models can arbitrarily extend this.

19 of 54

Common Pitfalls (multicollinearity)

Arises when several variables are highly correlated with each other
Can make it difficult to interpret the meaning of your coefficients
Multi-collinearity can result in models with High variance that fail to predict effectively
By using stepwise regression, it is possible to select the feature that contains the most information first, and then recursively do this until an added features decreases accuracy of the model.
Additionally application of principle component analysis can be used to describe the variables with the highest variance in a new feature space, by effectively combining the results into a new feature vector that captures the largest portion of the variance from the initial features possible.

20 of 54

Feature Selection (wrapper methods)

Wrapper methods create an iterative pipeline that automates feature selection by forward selecting the most predictive feature at each iteration. In this way the progress towards an optimal combination of the features is achieved. When an iteration occurs where no feature provides a benefit to the training model, the selection process stops and an optimal subset of the features has been found.

21 of 54

Variance Vs Bias Trade Off

Model selection largely comes down to optimizing these two parameters

22 of 54

Validation Methods

So now that we know what regression is, and how to apply it to vectors of predictive inputs to produce outputs, how do we assess the accuracy of the output?
There are several ways to do this, but first we need to discuss accuracy in the context of bias and variance.
The key to finding a good model is selecting features that capture most of the variance in the relationship with the response variable.
We quantify the accuracy of a result in regression by comparing the result of a prediction from the model, with the known actual value for our response variable. This is done with what is known as a test-training split.
But first let’s look at some examples of overfitting of a model.

23 of 54

Overfitting

24 of 54

Overfitting

25 of 54

Overfitting

26 of 54

Overfitting

27 of 54

Validation Set Testing

This method of validation is the simplest, and effective at getting a rough idea in regards to the accuracy of your model.
It works by splitting data into two sets, the training set, and the testing set.
The training set is used to build the regression model, and then the built model has the predictive variables from the test set ran through it, to produce another set of response values, which are then compared to the actual “known” values for each observation.
The difference between the actual value: y_actual, and the predicted value: y_predicted is squared, and then for each observation is summed and divided by the sum of the observations to yield the statistic known as the Root Mean Squared Error (RMSE).

28 of 54

Training/Test Split Validation

29 of 54

Leave One Out Cross Validation

30 of 54

K-fold Validation

31 of 54

Comparison of Validation Methods

Validation set: simplest and least computationally intensive, prone to high variance and bias depending on the randomness of the subset of data chosen for training and validation.

K-fold validation: for k in [5…10] K-fold validation is empirically observed to be optimal, also not very computationally intensive even with large datasets.

LOOCV: Most computationally intensive, produces least biased estimates of the error of the model, has greater variance.

32 of 54

Feature Processing (normalization)

Normalization involves scaling all of your predictive variables to the same range, i.e. mapping several values to a proportional range between [-1,1] or [0,1], [0,20], etc.
Many machine learning algorithms rely on the Euclidean distance between two points, and standardization has a greater impact on these algorithms than regression, but for model interoperability normalization is a good practice to adopt.

Rescaling
Mean Normalization
Standardization

33 of 54

Iowa Housing Market Sale Price Predictions

Shane Dalton

34 of 54

Abstract

Given Decision Tree and Linear Regression models, which would offer the best accuracy in the prediction of a home’s price given a suitable training set? The dataset to be analyzed is the Ames Iowa real estate dataset. By preparing the data and training multiple models, a relatively accurate estimator for the price of a house was created and tested against test data giving good results. The 2nd order polynomial Linear Regression model was determined to have the best accuracy by RMSE score, beating out Decision Tree and 1st order regression models to produce the best predictor (RMSE’s of 28,484, 42,557, and 33,374 respectively).

35 of 54

Motivation

A common problem in statistical analysis is to determine an accurate estimator for a presently unknown target variable given an input of known features. This has broad applications across all scientific and business domains, and is helps businesses and scientists create solutions. In the case of the data in this analysis, this predictive model could be used to perform housing appraisals without the intervention of a human being, or to augment. As a real estate company, having a reliable estimator of a home’s value would ensure consistent profitability and pass on the best value to consumers in the market.

36 of 54

Dataset(s)

The Ames, Iowa dataset contains 80 features and 2930 observations

The Dataset was found on kaggle here : link
and for the analysis only 1459 rows were used with a 67/33 training split.
The actual dataset contains 80 features (20 continuous, 23 nominal, 23 ordinal, and 14 discrete) read more here : http://jse.amstat.org/v19n3/decock.pdf
The subset of data that I used selected 10 features.

37 of 54

Data Preparation and Cleaning

Picking the features to stay within the scope of the project was difficult, I wanted to use some more advanced models but the feature engineering to take advantage of them was significant.
There were many categorical, ordinal, and discrete variables that would have needed to be remapped to numeric values to be included in the results.
Many variables described the same information example: GarageCarCapacity and GarageSqFt, both of these variables varied together and were correlated so including one would have provided little additional benefit and just added noise.
The continuous data was overall very nice with no missing data.

38 of 54

Data Preparation and Cleaning

Before analysis began, some major outliers were inspected for and removed.

39 of 54

Research Question(s)

Given the Iowa housing dataset, is it possible to accurately predict housing prices?
Which model produces optimal results for the data set?
How can we be sure our results are meaningful?

40 of 54

Methods

Basic descriptive statistics were used to understand sale price distribution, indicating 75% of homes under $214,000.

41 of 54

Methods

In order to select the most relevant features a correlation matrix was used.
Areas with darker colors indicate more powerful relationships between the variables in that proximity.
Selected:

42 of 54

Methods

Tuning of Models

Inflection point determination was used in the tuning of the decision tree, for finding optimal parameters of tree depth and max leaf count
When tuning the regression model, it was determined by visual analysis of residual plots that a polynomial model of degree two most accurately predicted test data.

43 of 54

Methods

Decision Tree, Linear Regression, and polynomial Regression models were used from scikit learn in order to create a proper comparison.
After the selection of features and the cleaning of outliers, the dataset was split into two groups, training and test data.
The training data was used to build the linear and decision tree estimators, and then these estimators were used to predict a set of outputs from the test inputs. Finally these predicted outputs were compared to the actual outputs and analyzed to determine accuracy.
A variety of numerical and graphical methods were used, including RMSE calculations, Residual plots, output distributions, and predicted vs actual value plots.

44 of 54

Findings

The Decision Tree regressor model had more variance in prediction values than the Lasso, Ridge, or Linear models.

45 of 54

Findings

Based on the plots indicating prediction accuracy (residual plots) there are some non-linear patterns in this data that the linear regression analysis was unable to capture.
There is a clear curve present in this data, indicating that a nonlinear relationship exists among input/outputs.

The presence of this relationship prompted further analysis with a 2nd degree polynomial approximation
After fitting our features to a 2nd degree polynomial we can see a lot of the inverted U shape from above has been corrected for in the model. Plots of higher degree produced lower accuracies.

46 of 54

Findings

A comparison of the distribution of predicted vs actual values can provide a quick sanity check about a models accuracy. Here we can see the predicted values for the decision tree vs the actual values,

The RMSE penalizes non-linearly for values that deviate significantly from actual, and the tree’s accuracy decreases for large predictions so it’s not surprising that the decision tree had an RMSE of 42,557, the worst of the models.

47 of 54

Findings

Predicted values for the simple linear regression vs actual

The RMSE for this model was 33,374.25, a fair estimator.

48 of 54

Findings

Predicted values for the polynomial linear regression vs actual

Here we can see that the polynomial fit matches rather nicely to the actual data, and the RMSE reflects that with a score of 28,484, making it the best model used in my analysis.

49 of 54

Findings

Based on the residual plot, we can see that our estimator is considerably more accurate for home values less than estimated $400,000. The plot on the left shows the relative cone of uncertainty regarding estimator accuracy, and the one on the right shows clearly the linear trend in accuracy of predicted vs. actual price for the polynomial model.

50 of 54

Limitations

The data is from a limited time window of 2006-2010
The data only concerns a small subset of the real estate market, Ames, Iowa
The Great Recession could certainly have impacted this data, skewing it’s future interpretability
Despite finding the most accurate model of the multiple I tried It’s RMSE was still 28,484 (decent for magnitude of predictions but in the real world could certainly be optimized for more effective pricing).

Cross Validation should be implemented to ensure realistic accuracy scores for the various regression models used.
According to online research the most accurate models have produced RMSE’s of 21,000 link using Random Forest modeling techniques.
By utilizing random subsamples of the data to generate “trees” and taking the result of a comitte of trees, it is possible to avoid multi-colinearity and

51 of 54

Possible improvements

A great way to achieve more accurate results would be extensive feature engineering on the initial dataset, with statistical methods to suss out what’s important and what’s not.
If better feature engineering were applied, and a more advanced model, such as random forests was used I would expect more accurate results (~20,000 RMSE).
Roughly 80% of the usable data was not included due to simplistic variable selection employed.
The second order polynomial regression did increase predictive accuracy as home value increased, but at the expense of consistent accuracy for lower priced homes.
As tested, my Ridge and Lasso models were untuned, and their results were surprisingly similar to regular regression, leading me to believe that I was not utilizing their full power during my analysis. This is however, just a suspicion.

52 of 54

Conclusions

Given the Iowa housing dataset, is it possible to predict housing price?

The models I developed were quite capable of producing a reasonable estimate of a homes sale price given a set of only 10 input variables. Yes.

Which model produces optimal results for the data set?

The 2nd degree polynomial Ridge Regression model produced the best results with a recorded R^2 of .868, doing much better than Decision Tree with an R^2 of .742.

How can we be sure our results are meaningful?

Given the interpretation of the residual plots, it appears that the model fit the data very well, managing to predict reasonably in most cases (the exceptions being houses with higher values, where sale price was more random). This is further backed up by the histogram plots for the different models. We can see in the histograms that there are no strange pockets of predictions far outside the norm.

53 of 54

Acknowledgements

I collected my data from Kaggle’s online data science platform, thanks!
As I worked through the project I looked up various modeling terminologies and read through the Ames Iowa Dataset Paper
My wife gave my occasional feedback as I worked on the project.

1 of 54

2 of 54

3 of 54

4 of 54

5 of 54

6 of 54

7 of 54

8 of 54

9 of 54

10 of 54

11 of 54

12 of 54

13 of 54

14 of 54

15 of 54

16 of 54

17 of 54

18 of 54

19 of 54

20 of 54

21 of 54

22 of 54

23 of 54

24 of 54

25 of 54

26 of 54

27 of 54

28 of 54

29 of 54

30 of 54

31 of 54

32 of 54

33 of 54

34 of 54

35 of 54

36 of 54

37 of 54

38 of 54

39 of 54

40 of 54

41 of 54

42 of 54

43 of 54

44 of 54

45 of 54

46 of 54

47 of 54

48 of 54

49 of 54

50 of 54

51 of 54

52 of 54

53 of 54

54 of 54