STAT 324 Project

An exploratory analysis of Real Estate Sales

Project Report

By: Nikolas Garcia, Jeff Wilson, Ali Hernandez, and Jonathan Kisch

Institution: California Polytechnic State University, San Luis Obispo

Class: STAT 324

Academic term: Spring 2015

Professor: Jeff Sklar

Executive Summary

Utilizing the textbook, Applied Linear Regression Models, 4th ed., by Kutner, Nachtsheim, & Neter, our research team obtained data from appendix C. 7 on Real Estate Sales. The original data set included 12 variables and an identification number. Data was collected by a city tax assessor who was interested in predicting residential home sales prices as a function of various characteristics of each property. All data was collected during the year 2002 and includes 522 “arms-length transactions” where homes were sold. The 12 variables provided with transactions the in data set are:

Sales price: A quantitative value representing sales price of the residence in dollars.
Finished square feet: A quantitative value of finished area of residence in square feet.
Number of bedrooms: A quantitative value of the total number of bedrooms in the residence.
Number of bathrooms: A quantitative value of the total number of bathrooms in the residence.
Air conditioning: A categorical, binary variable, where it indicates the presence or absence of air conditioning with a “1 if the house contained air conditioning” and “0 otherwise.”
Garage size: A quantitative representation of the number of cars that the garage will hold.
Pool: A categorical, binary variable that indicates the presence or absence of a swimming pool.
Year built: A quantitative variable which gives the year the property was built. Thus, we created “Age,” a new quantitative predictor variable that subtracted (2015-’Year’) to obtain the age of each property.
Quality: A categorical variable provides the quality of construction as either “High, Medium, or Low,” we made that into 3 separate categorical binary variables where, for example, “the value is 1 if the quality is high and 0 otherwise,” etc…
Style: A categorical predictor that indicates architectural style, with a range of 1 -11, depending on the style.
Lot size: A quantitative variable that represents the size of the lot that property is built on in units of square feet.
Adjacent to highway: A binary categorical predictor that indicates either the presence or absence of adjacency to highway, indicator variable where “1” is yes and “0” is no.

Next, according to the best subsets procedure we used to fit our model, the predicting variables that were recommended for predicting log of sales price of real estate in a midwestern city in 2002 include: log10lotsize, baths, garagesize, log10propsize, and log10age. The model we decided to use is: Log10Price = 2.39+ 0.0138Baths + 0.0244GarageSize + 0.843Log10PropSize - 0.302Log10Age + 0.139Log10LotSize + 0.0369Pool. We will define each coefficient as follows: β0 : 2.39, the estimated average sales price for a house with no pool and all else equal to zero, is $245.47. β1 : 0.0138, for each unit increase in number of baths, there is a 3.23% increase in sales price, after adjusting for all other predictors in the model. β2 : 0.0244, for each unit increase in number of cars that garage will hold, there is a 5.78% increase in sales price, after adjusting for all other predictors in the model. β3 : 0.843, for each 1 percent increase in property size, there is an associated increase in sales price of 0.843%, after adjusting for all other predictors in the model. β4 : 0.302, for each 1 percent increase in age, there is an associated decrease in sales price of 0.302%. β5 : 0.139, for each 1 percent increase in lot size, there is an associated increase in sales price of 0.139%, after adjusting for all other predictors in the model. β6 : 0.0369, if the property has a pool, there is an 3.69% higher sales price than if it did not have a pool, after adjusting for all other predictors in the model. For completeness, we ask the reader to refer to appendix E.

Conclusively, our team developed the best subset model for predicting residential home sales prices. As part of this least squares linear regression analysis, assumptions were verified before variables were being used, outliers were addressed appropriately and influential observations were taken into account before realizing a final model. Thus, all remedial measures, if necessary, were applied before making any inferences. The final model, after all necessary transformations, proved to be both parsimonious and appropriate in predicting real estate sales. The final model produced an R2adj of 81.5%, thus, 81.5% of the observed variability in residential home sales prices in a midwestern city can be explained when including the variables listed above in the model, after adjusting for the number of predictors.

Main Report

Upon initial analysis of the scatterplot and correlation matrices (see Appendix F) between our response variable and the predictor variables in the data set, we found there to be troubling displays violating linearity as well as constant error variance. After regressing price on all of the original predictor variables, we found many violated assumptions in the underlying model (see appendix A). We also noticed that some of the predictor variables were redundant such as quality, style and air conditioning. We removed these predictors from the candidates from our model for parsimony. It was apparent that the assumptions for constant error variance and linearity were violated and as a first step to correcting this, we decided to perform a Log10 transformation on our response variable sales price. Consequently, there was a significant improvement with respect to error variance for all of our explanatory variables (see appendix B). Even after the transformation on Y there were still assumption violations on lot size, property size, and age . Therefore we performed Log10 transformations on these explanatory variables with great success on improving assumptions (see appendix A). Using our transformed variables, we ran a best subsets procedure to find and evaluate the most effective potential models according to R2adj and Mallow’s Cp and the root mean square error. During this procedure, we found that there were two potential models with similar qualities for predicting sales price. These two models had the same R2adj of 0.813. One of the models had a Mallow’s Cp equal to the number of predictors while the other had a Cp that was less than the number of predictors. The rest of the possible candidate models had high Mallow’s Cp values with respect to the number of parameters, which enabled us to consolidate our work down to the final model listed below. According to the best subsets procedure, the predicting variables that were recommended for our final model include: log10price, baths, garagesize, log10propsize, log10age, log10lotsize and pool. Thus, a final least squares linear regression model was implemented:

Log10Price = 2.39+ 0.0138(Baths) + 0.0244(GarageSize) + 0.843(Log10PropSize) - 0.302(Log10Age) + 0.139(Log10LotSize) + 0.0369(Pool)

Employment of the best subsets procedure typically is not designed to include indicator variables, we performed the test with all quantitative predictors and then tested each of our indicator variables separately with our model to find the best fit. We also considered the possibility of an interaction term and created interaction terms between each quantitative predictor variable and our qualitative variable of whether or not the property has a pool. Running a stepwise regression revealed that the partial t-tests indicated the interaction between pool and Log10Age was a significant predictor, with a p-value of 0.010. However, when we included this variable into our model, the R2adj marginally increased from 81.3% to 81.52%, which is a minimal increase (see appendix C). We also observed that Mallow’s Cp was 8.3 with p=8 and therefore we decided to not include the interaction term. This allows our model to remain parsimonious and yet still unbiased.

Unusual observations (Table 1):

Observation	Leverage (hii>0.0402)	Cook’s D (Di)	\| DFFITS \| (>0.232)
6	.0887	.014	- .31
11	.049	.128	- .96
81	.129	.005	- .187
96	.03	.089	.805
108	.061	.079	.748
161	.075	.032	.477
514	.02	.035	.50

**Highlighted values represent significant findings**

To evaluate the data for outliers with respect to the response variable, a Bonferroni test for outliers was performed at the 5% Family-Wise significance level, indicating that any studentized deleted t residual with an absolute value of over 3.746 should be considered an outlier with respect to the response variable Y. Observations 11 and 96 met this condition, therefore we will assume that these observations are indeed outliers. With respect to leverage, it was found that most of our unusual observations have leverage values (hii) that are greater than our high leverage threshold. The mean hat value was obtained by multiplying 3(p / n), also known as the diagonals of our hat matrix. Therefore our findings support that observations 6, 11, 81, 108, and 161 all produce high leverage on our regression line. With respect to Cook’s distance, all of the suspected outliers’ Di values fall under the 20th percentile threshold of 0.5454, thus, we can assume these cases have little influence on the fitted values in the model. Moving on to DFFITS, which measures the influence that each case i has on each fitted value Ŷi and in layman’s terms is the number of standard deviations that Ŷi changes if the ith case is included. The guidelines for detecting influential cases indicate our data set is large (greater than 50 observations) and specify to use the function 2*√p/n to determine the threshold for influential DFFITS values. This produced a threshold of 2*√7/515 = 0.232. According to our unusual observations, using absolute value of | DFFITS | values, we found that most of the unusual observations exercise influence on Ŷi, including observations 6, 11, 96, 108, 161, and 514 (see Table 1).

After verifying the validity of our linear regression analysis, we studied pairwise relationships and narrowed down our model to the most useful explanatory variables. We managed to test the need for an interaction term, identify variables violating residual assumptions, and finalize variable transformations to clean up the data. Using log base 10 transformations on our response and some explanatory variables alleviated patterns of non-constant error variance or non-linearity, but an Anderson-Darling p-value of <.05 revealed normality of the errors was violated (see appendix B). After this, we examined possible outliers and influential observations for possible elimination. After eliminating observations that were influential and/or outliers: 6, 11, 81, 108, and 161, the regression model saw no statistically significant changes. Therefore, we conclude that no individual observations should be removed from the data set. With a large sample size (n=522) the problem of non-normality of the errors is less serious and we can proceed cautiously despite the violation. The procedure we outlined up until this point unveiled the best model for predicting average sales price for homes in a midwestern city in 2002, as a function of various characteristics of the home and surrounding property. This model consists of log10price, baths, garage size, log10prop size, log10age, log10lot size and pool as the predictor variables. The final model that we use to predict average sales price is: Log10Price = 2.39+ 0.0138Baths + 0.0244GarageSize + 0.843Log10PropSize - 0.302Log10Age + 0.139Log10LotSize + 0.0369Pool. This final model has a global F-statistic of 383.97 and a p-value of < 0.001, therefore, we conclude at 0.05 level that our model is collectively useful in predicting average sales price of homes in the midwestern city in the year 2002. The model produces an R2adj of 81.5%, which suggests that 81.5% of the observed variability in Log10SalesPrice can be explained by baths, garagesize, log10propsize, log10age, log10lotsize and pool, adjusting for the number of predictors in the model. The Variance Inflation Factors (VIFs) for each explanatory variable are all less than 2.742 which indicates that there is no multicollinearity in the model between the explanatory variables (see appendix D). The assumptions for regression analysis all appear to be satisfied according to plots of residuals vs fitted, normality, and residuals vs time. According to these plots each condition is satisfied (see appendix B). Normality is satisfied with a large sample size, constant error variance is satisfied after our Log10 transformations of sales price and various X variables, and finally, independence of the errors appears random in residuals vs time plot. Therefore, our final least squares linear regression model as outlined above is a significant predictor of residential home sale prices in a midwestern city as a function of the various characteristics of the home and the surrounding city.