1 of 14

Predictive Analytics for Business Nanodegree

Combining Predictive Techniques

Andre Costa

Udacity Mentor

Since Jan/17

2 of 14

2

Agenda

  • Project Summary�
  • Pain point #1: Getting the Correct Cluster Distribution�
  • Pain point #2: Predicting Cluster for New Stores

  • Pain point #3: Setting up the data for the forecast�
  • Q&A

3 of 14

PROJECT SUMMARY

4 of 14

4

Project Summary

As the project name implies the goal here is to combine some of the techniques leaned into a single analysis

We will be revisiting the following skills acquired in the course:

  • Clustering Analysis - Task 1
  • Classification Models - Task 2
  • Time Series Forecasting - Task 3
  • Building Visualizations - Task 1, 3

Tips for starting the project:

  • Make sure to review the past lessons
  • Revisit the previous workflows build as the setup is very similar
  • Review questions in knowledge

5 of 14

Getting the Correct Cluster Distribution

6 of 14

6

Getting the Correct Cluster Distribution

As the whole project depends on this answer being correct it is very important to get it right

7 of 14

7

Getting the Correct Cluster Distribution

Depending on Alteryx Version we should get 23 or 25 stores for cluster 1

  • We should have the data for 2015 only.
  • We should group the sum sales of each category by store.
    • We should have sum sales values for 85 stores for each one of the categories.
    • Make sure to also get the total sales per store
  • Next we need to calculate the percentage of each food category
    • Ex: % of the Dry Grocery sales => “[Sum_Dry_Grocery]/[Total_Sales]x100”.
    • The variable should also be of type double
  • Configuring the “K-Centroids Cluster Analysis tool”
    • We should have selected the 9 category fields. (All the categories we have)
    • Make sure to use the z-score to standardize the fields
    • We should not select Total_Sales variable
    • Use default argument for “Number of Starting Seeds”
  • Make sure that you use K- Means as the clustering method.

8 of 14

Predicting Cluster for New Stores

9 of 14

9

Predicting Cluster for New Stores

Leverage your analysis for the Creditworthiness project here

  • Make sure to use all the available variables we have in StoreDemographicData file
    • We should not use PCA analysis here
  • Remember to use a 20% validation sample with Random Seed = 3
  • The model will provide a value 0-1 for each cluster
    • Make sure we are using the cluster as a string variable
    • We should assign the cluster with the highest value for each store
    • Something similar to the following should work:
      • IF [Score_1] > [Score_2] AND [Score_1] > [Score_3] THEN “1” �ELSEIF [Score_2] > [Score_1] AND [Score_2] > [Score_3] THEN “2” �ELSE “3” ENDIF

10 of 14

Setting up the data for the forecast

11 of 14

11

Setting up the data for the forecast

Setting up the data for existing stores:

12 of 14

12

Setting up the data for the forecast

Setting up the data for new stores:

13 of 14

13

Setting up the data for the forecast

Other important points to keep in mind for task 3

  • We should use a 6-month holdout sample
  • We are looking for Produce sales only
  • For the New stores we need to remember to multiply the forecast as the predictions are for a single store
  • To select the model we only need to use the existing store data
    • Once we select the model (ARIMA or ETS) we don’t need to rerun the analysis for each cluster
    • When comparing the model we need to make sure we are using the accuracy measures or an actual vs forecast comparison for the holdout sample months (using the TS compare tool)
  • For the Visualization we should set-up our data as the following:

14 of 14

Q&A