Misy 267-010
Final Project
Page
UNIVERSITY OF DELAWARE
FICO RETAIL DATA: FINAL PROJECT
Liana Tran, Juliana Mooney, Shawnice Shields,
Cristina Young, and Bricen Yozzo
MISY 267010
Berkow
10 May 2016
Goals:
The data we chose was Fico Retail Credit data. The data set consisted of variables regarding the customers responses and previous behaviors to the gardening catalogue response mailing data set. The gardening catalog response mailing project consisted of 35 variables, and our goal as a group was to use the R analytic tool on the Fico website to see which variables explained the data set the best using intuition and data analysis. Our problem was when all of the variables in the set were compared to the most recent campaign of Perf_Purchcase_Amt, it was hard to see which variables contributed the most. Additionally, having so many variables in our data set creates the potential problem for overfitting. This required further investigation.
Conclusions:
After careful consideration and testing we have concluded that we do not have the most reliable model. Only about 9% of the variation on average in Performance Purchase Amount is explained by Dollar Retail Lifetime, Maximum Purchase Amount, Performance Catalogue Purchase Flag, Performance Retail Purchase Flag, and Performance Web Purchase Flag . We selected our variables by using our intuition, considering the correlation values, and by performing backwards stepwise function. After this process, we narrowed 35 variables down to 5 significant variables, which decreased our chances of overfitting the data. The training and testing sample errors were large enough to spark concern and forced us to consider investigating further. With the data given to us, we created the best model that we could. We believe that if we were given more variables to choose from, such as age, income, and soil type, we could have created a better model to predict the response to the campaign (Performance Purchase Amount).
Methodology:
The technique used to select these variables was backwards selection. We slowly removed variables that did not benefit our model while considering the correlations between the potential predictor variables and the response variable.
The response variable is the perf_purch_amt which is how much did customer spend after campaign.
Predictor Variables | Description |
Dollar_Ret_Lifetime | Dollars spent through retail lifetime EX: So if you spend 1000 over your lifetime you multiply that by .015 it will increase your purchase amount from the catalogue if you get a coupon |
Max_Purch_Amount | Maximum order dollars |
Perf_Ctg_Purchase_Flag | Did customer respond to campaign through catalog. 1=Response through catalog; 0 = No response through catalog |
Perf_Ret_Purchase_Flag | Did customer respond to campaign through retail. 1=Response through retail; 0 = No response through retail |
Perf_Web_Purchase_Flag | Did customer respond to campaign through web. 1=Response through web; 0 = No response through web |
We split the data randomly into a training (70%) and testing (30%) set.
Model 1: Training data with intercept
This model represents the training data with intercept. All of the variables are statistically significant showing three stars.
Model 2: Training data without intercept.
This model represents the training data without intercept. We decided to remove the intercept because it did not make sense intuitively. After doing this all of the variables are statistically significant showing three stars. You cannot spend a negative amount of money when you don’t purchase anything.
Assumptions
When testing for a good linear model, we found model 1 to best represent the data. This model resulted in an R squared of .09099. It was clear after testing Perf_Purchase_Amt with each variable that there is no good linear model, based on the nonlinear scatter plots. This tells us it is hard to predict retail data because retail shopping is often unpredictable. Therefore, this assumption is violated.
When looking at the correlation matrix, there is no perfect multicollinearity.There are no relationships between predictor variables that are equal to 0 or 1. This assumption is therefore met.
To test for independent errors, we first considered the intuition of the group, which tells us that one shopper’s purchasing activity will not affect the next shopper’s purchasing activity. Next, we used a residual plot, which results in a graph with no pattern. This tells us that errors are independent. Lastly, we consider the results of the ACF plot. The first black “nub” does not surpass the bounds of the dotted line, which further proves our conclusion that errors are independent and that this assumption is met.
When testing for heteroskedasticity, we found that this model is heteroskedastic. By looking at the fitted versus resid plot, it is clear that there are unequal variances within the response. This assumption is being violated, however, we can fix this problem by bootstrapping.
When testing for normally distributed errors, we looked at the histograms and the QQ-plots. Neither the distribution of Perf_Purchase_Amt response variable (first histogram) and the distribution of model error residuals (second histogram) appear to be normal because they do not show a bell curve. As you can see they are clearly right-skewed. Therefore, both distributions are not normally distributed, on average and this assumption is violated. The Q-Q Plot does not show a linear relationship. We can determine that the variable’s distribution is not normal and may affect the significance test for the coefficient, on average, however, we would suggest bootstrapping to fix the data.
In this model outliers appear to be a concern according to the data. Since Perf_Purchase_Amt has a large range that is continuous there is a lot of potential for there to be outliers. For example, there will be shoppers that will buy all of their gardening supplies from this store, or others who may purchase supplies there a few times, or seasonally. Since there are outliers we can suggest winsorizing, by identifying and reassigning values of the response to a boundary value.
By looking at the histograms of the in and out of sample results and evaluation of errors we can make predictions of the model by how it ran in the testing set. By looking at the histograms, the models are not consistent with each other. When looking at the difference of the means, they are off by about $18 on average. Also by looking at the standard deviation, it is off by about $4 on average. Therefore we have decided this is not a good model.
After narrowing down our model, we found that approximately 9% of the variation in performance purchase amount is explained by the variability in the response variables. Approximately the other 91% could be explained by other economic factors. The other factors that may be used to better predict this model could include age, season, household income, accessibility to their marketing campaigns, weather conditions, and type of soil. For example, the age of a household owner may determine how likely they may be to purchase gardening products, based on their response to a marketing campaign. Older customers may not be as physically able to use these products as effectively as a younger customer would.
In discussion of the resulting model, we can conclude that the model is not good enough to test future data on. Throughout this data analysis, it is evident, based on violations of assumptions and low variation, that trying to predict Perf_Perchase_Amt is difficult because customer purchasing activity is usually unpredictable. The variables do not relate enough the customer response to the most recent campaign.