1 of 14

Predictive Analytics for Business Nanodegree

Combining Predictive Techniques

Andre Costa

Udacity Mentor

Since Jan/17

2 of 14

Agenda

Project Summary�
Pain point #1: Getting the Correct Cluster Distribution�
Pain point #2: Predicting Cluster for New Stores

Pain point #3: Setting up the data for the forecast�
Q&A

3 of 14

PROJECT SUMMARY

4 of 14

Project Summary

As the project name implies the goal here is to combine some of the techniques leaned into a single analysis

We will be revisiting the following skills acquired in the course:

Clustering Analysis - Task 1
Classification Models - Task 2
Time Series Forecasting - Task 3
Building Visualizations - Task 1, 3

Tips for starting the project:

Make sure to review the past lessons
Revisit the previous workflows build as the setup is very similar
Review questions in knowledge

5 of 14

Getting the Correct Cluster Distribution

6 of 14

Getting the Correct Cluster Distribution

As the whole project depends on this answer being correct it is very important to get it right

7 of 14

Getting the Correct Cluster Distribution

Depending on Alteryx Version we should get 23 or 25 stores for cluster 1

We should have the data for 2015 only.
We should group the sum sales of each category by store.

We should have sum sales values for 85 stores for each one of the categories.
Make sure to also get the total sales per store

Next we need to calculate the percentage of each food category

Ex: % of the Dry Grocery sales => “[Sum_Dry_Grocery]/[Total_Sales]x100”.
The variable should also be of type double

Configuring the “K-Centroids Cluster Analysis tool”

We should have selected the 9 category fields. (All the categories we have)
Make sure to use the z-score to standardize the fields
We should not select Total_Sales variable
Use default argument for “Number of Starting Seeds”

Make sure that you use K- Means as the clustering method.

8 of 14

Predicting Cluster for New Stores

9 of 14

Predicting Cluster for New Stores

Leverage your analysis for the Creditworthiness project here

Make sure to use all the available variables we have in StoreDemographicData file

We should not use PCA analysis here

Remember to use a 20% validation sample with Random Seed = 3
The model will provide a value 0-1 for each cluster

Make sure we are using the cluster as a string variable
We should assign the cluster with the highest value for each store
Something similar to the following should work:

IF [Score_1] > [Score_2] AND [Score_1] > [Score_3] THEN “1” �ELSEIF [Score_2] > [Score_1] AND [Score_2] > [Score_3] THEN “2” �ELSE “3” ENDIF

10 of 14

Setting up the data for the forecast

11 of 14

Setting up the data for the forecast

Setting up the data for existing stores:

12 of 14

Setting up the data for the forecast

Setting up the data for new stores:

13 of 14

Setting up the data for the forecast

Other important points to keep in mind for task 3

We should use a 6-month holdout sample
We are looking for Produce sales only
For the New stores we need to remember to multiply the forecast as the predictions are for a single store
To select the model we only need to use the existing store data

Once we select the model (ARIMA or ETS) we don’t need to rerun the analysis for each cluster
When comparing the model we need to make sure we are using the accuracy measures or an actual vs forecast comparison for the holdout sample months (using the TS compare tool)

For the Visualization we should set-up our data as the following:

1 of 14

2 of 14

3 of 14

4 of 14

5 of 14

6 of 14

7 of 14

8 of 14

9 of 14

10 of 14

11 of 14

12 of 14

13 of 14

14 of 14