1 of 187

©2018 dataiku, Inc. | dataiku.com | contact@dataiku.com | @dataiku

2 of 187

Dataiku Training

DISCOVERY

3 of 187

Agenda

Discovery

  1. Dataiku Introduction
  2. Demo - Loan Default Prediction
  3. Hands-on Exercise - Credit Card Transactions

4 of 187

Agenda

Discovery

  • Dataiku Introduction
  • Demo - Loan Default Prediction
  • Hands-on Exercise - Credit Card Transactions

5 of 187

A Brief History of Dataiku

Key Company Milestones

2013

2015

2016

First 100+

users customers

First 500+

users customers

First 10+

users customers

First customer

Dataiku Created

Beta

version 1

1st tool integrating

data prep and ML

version 2

Real-time collaboration

Spark integration

version 3

Scenarios, automation and version control

version 4

Advanced scalability,

cloud support

version 5

Deep learning, Containerization, Project wikis

Office in London

COMPANY

Office in New York

20th employee

$28M Series B

100th

customer

300th

customer

1st CAB

Customer Advisory Board

50th

customer

2017

2018

2014

CUSTOMERS

SOFTWARE

$14M Series A

$4M raised

$101M Series C

7 new offices in EMEA, APAC, Americas

2019

First 1000+

users customers

version 6

Enterprise Scale Elastic AI with Kubernetes

$1.4B valuation

2020

100th employee

400th employee

1000+

EGG Visitors

200th employee

version 7

Explainable AI and

Advanced Visual Statistics

6 of 187

7 of 187

End-to-end Software

Dataiku DSS

8 of 187

End-to-End Platform Solution

Skill agnostic platform that supports all data project stakeholders

DATA CATALOGING AND CONNECTIVITY

DATA PREPARATION

(AUTO) MACHINE LEARNING

PRODUCTION DEPLOYMENT

VISUAL

CODE

GOVERNANCE, VERSIONING, AUDIT AND REUSE

9 of 187

Data Access and Processing

Dataiku DSS

CONNECT TO YOUR (MANY) DATA SOURCES

LEVERAGE VARIOUS COMPUTE RESOURCES

DEVELOP REUSABLE/ MAINTAINABLE DATA FLOWS

UNDERSTAND YOUR DATA

CODE… OR CLICK

10 of 187

Machine Learning

Dataiku DSS

AUTO-ML OR CODE BASED

FULL TRAINING MANAGEMENT

11 of 187

Leverage your full stack and skills

Solution Overview: Architecture

Browser based interface

VISUAL DEVELOPMENT

COMPLETE CODING ENVIRONMENTS

VISUALIZATION

COLLABORATION AND PROJECT MANAGEMENT

AUDIT, MONITORING AND SCHEDULING

User/task specific

interaction modes

LINUX SERVER

ON PREMISE OR MANAGED CLOUD

CENTRALIZED OR AD-HOC DATA SOURCES, DATABASES, DATA LAKE

AVAILABLE OR SPUN-UP PROCESSING RESOURCES

Leveraging best storage and compute resources

Dataiku deployment servers for enterprise grade operationalization

PRODUCTION SYSTEMS

Centralized server to facilitate

access to data, ressources, and foster collaboration

12 of 187

The “tower of Babel” effect of Data Projects

The Classic Data Project Silos

IT Monitoring

Refactored data project

Auto ML

Visual Data Preparation

R, Python, Notebooks, and (many) Data Science Libraries

Visualization and validation

Visualization and validation

Model Monitoring

Business Analyst

Data Scientist

Analytics Leader

Data Engineer

13 of 187

Bring Business Analysts, Engineers, and Scientists Together

Share a common environment and understanding

IT Monitoring

Refactored data project

Auto ML

Visual Data Preparation

R, Python, Notebooks, and (many) Data Science Libraries

Visualization and validation

Visualization and validation

Model Monitoring

Analytics Leader

Business Analyst

Data Scientist

Data Engineer

14 of 187

Agenda

Discovery

  • Dataiku Introduction
  • Demo - Loan Default Prediction
  • Hands-on Exercise - Credit Card Transactions

15 of 187

Use Dataiku to predict likelihood of default

Default Prediction

Can we use previous customer data of a bank to make a model to predict whether a loan will be paid back or not?

16 of 187

Agenda

Discovery

  • Dataiku Introduction
  • Demo - Loan Default Prediction
  • Hands-on Exercise - Credit Card Transactions

17 of 187

Credit Card Transactions

18 of 187

Agenda

Discovery

Hands-on Exercise

  • Background
  • DSS Login and Set up Account
  • Creating Projects and Connecting to Data
  • Stacking Datasets
  • Joins
  • Computation Engines
  • Data Cleaning [1]
  • Charts [1]
  • Splitting Data
  • Machine Learning
  • Filtering Data
  • Data Cleaning [2]
  • Group by Aggregations
  • Charts [2]

Green Track

Purple Track

19 of 187

Agenda

Discovery

Hands-on Exercise

  • Background
  • DSS Login and Set up Account
  • Creating Projects and Connecting to Data
  • Stacking Datasets
  • Joins
  • Computation Engines
  • Data Cleaning [1]
  • Charts [1]
  • Splitting Data
  • Machine Learning
  • Filtering Data
  • Data Cleaning [2]
  • Group by Aggregations
  • Charts [2]

Green Track

Purple Track

20 of 187

Elo

Background

  • Major Brazilian debit and credit card brand
  • Over 80 million issued cards
  • We have data on:
    • Credit card transactions (Jan ‘17 - Apr ‘18)
    • Merchant info
    • Credit card holder info

21 of 187

Elo

Challenge #1

Challenge: Predict fraudulent credit card transactions

Steps:

  • Train a machine learning model using 2017 data
  • Use the model to make predictions for 2018 data

Green Track

22 of 187

Elo

Challenge #2

Challenge: Optimize different card rewards programs

Steps:

  • Calculate theoretical past benefits for each reward program
  • Recommend card changes to customers with the highest value add potential

Purple Track

23 of 187

Agenda

Discovery

Hands-on Exercise

  • Background
  • DSS Login and Set up Account
  • Creating Projects and Connecting to Data
  • Stacking Datasets
  • Joins
  • Computation Engines
  • Data Cleaning [1]
  • Charts [1]
  • Splitting Data
  • Machine Learning
  • Filtering Data
  • Data Cleaning [2]
  • Group by Aggregations
  • Charts [2]

Green Track

Purple Track

24 of 187

Login to DSS

https://dssX-design-COMPANYNAME.training.dataiku.com� (replace the X with your instance number)� �Username: userY�Password: INSTANCE_PASSWORD�

25 of 187

Set up your account

26 of 187

Set up your account

27 of 187

Change your display name (to your name)

28 of 187

Agenda

Discovery

Hands-on Exercise

  • Background
  • DSS Login and Set up Account
  • Creating Projects and Connecting to Data
  • Stacking Datasets
  • Joins
  • Computation Engines
  • Data Cleaning [1]
  • Charts [1]
  • Splitting Data
  • Machine Learning
  • Filtering Data
  • Data Cleaning [2]
  • Group by Aggregations
  • Charts [2]

Green Track

Purple Track

29 of 187

Create a new project

Name your project “Credit Card Fraud” + your name

30 of 187

Data Sources Overview

  • In this case, we are simply pulling 4 tables all from a SQL database
  • DSS allows you to import data from multiple sources within the same project
    • Uploaded files, HDFS, SQL, noSQL, external APIs, etc

31 of 187

Data Sources Overview: 4 Datasets

transactions_2017 and transactions_2018

    • consumer credit card transactions

cardholder_info

    • credit score for each cardholder

merchant_info

    • Lookup table for merchants

32 of 187

Import data from PostgreSQL

Select “Choose connection to import from”

33 of 187

Import data from PostgreSQL

transactions_2017 and transactions_2018: consumer credit card transactions

cardholder_info: credit score for each cardholder

merchant_info: lookup table for merchants

  • Choose the postgres-local connection
  • Select all 4 tables and import

34 of 187

Import data from PostgreSQL

The default table names are good - create all 4 datasets

35 of 187

You should now have 4 datasets in your flow

Double click on a dataset to view it

36 of 187

Explore the transaction data

Each row is a transaction. The ‘authorized_flag’ shows whether it was approved or not.

37 of 187

Column attributes

We can see information about each column and its contents

Column name

Schema - the storage type

ex. string, int, double, array

Meaning - the “human readable” meaning

ex. text, integer, IP address, gender

38 of 187

Agenda

Discovery

Hands-on Exercise

  • Background
  • DSS Login and Set up Account
  • Creating Projects and Connecting to Data
  • Stacking Datasets
  • Joins
  • Computation Engines
  • Data Cleaning [1]
  • Charts [1]
  • Splitting Data
  • Machine Learning
  • Filtering Data
  • Data Cleaning [2]
  • Group by Aggregations
  • Charts [2]

Green Track

Purple Track

39 of 187

Stack

  • Merge two or more datasets into one
  • Equivalent to a “union all” SQL statement
  • Good for datasets with substantially the same columns/schema

40 of 187

Stack the transactions datasets together

41 of 187

Rename the output dataset “transactions_stacked”

42 of 187

Stack

43 of 187

Your flow should look like this

44 of 187

Agenda

Discovery

Hands-on Exercise

  • Background
  • DSS Login and Set up Account
  • Creating Projects and Connecting to Data
  • Stacking Datasets
  • Joins
  • Computation Engines
  • Data Cleaning [1]
  • Charts [1]
  • Splitting Data
  • Machine Learning
  • Filtering Data
  • Data Cleaning [2]
  • Group by Aggregations
  • Charts [2]

Green Track

Purple Track

45 of 187

Join

  • Join two or more datasets
  • Perform inner joins, left outer joins, full outer joins, right joins, etc.
  • Define join conditions using the UI or SQL

46 of 187

Join the stacked transactions dataset with the other two datasets

  1. LEFT JOIN transactions_stacked dataset with cardholder_info dataset ON ‘card_id’ = ‘internal_card_mapping
  2. LEFT JOIN transactions dataset with merchant_info dataset ON ‘merchant_id and merchant_category_id
  3. ADD PREFIXES “cardholder” andmerchant” to columns from the two datasets respectively

Do this all in ONE Join recipe

47 of 187

Join the stacked transactions dataset with the other two datasets

  • Click on the transactions_stacked dataset
  • Choose the Join recipe

48 of 187

Join the stacked transactions dataset with the other two datasets

  • Select the cardholder_info dataset from the dropdown
  • Name the output dataset transactions_joined

49 of 187

LEFT JOIN transactions_stacked with cardholder_info

Sometimes DSS guesses the wrong desired join columns

Click the equals sign to change the join

50 of 187

Change the columns to join on

card_id’ = ‘internal_card_mapping

LEFT JOIN transactions_stacked with cardholder_info

51 of 187

Add a new dataset input

Select the merchant_info dataset from the dropdown

LEFT JOIN transactions_stacked with merchant_info

52 of 187

This looks good!

Now let’s look at the Selected Columns tab

LEFT JOIN transactions_stacked with merchant_info

53 of 187

Add prefixes to columns from joined datasets

  • Deselect ‘card_internal_card_mapping
  • Select ‘merchant_latitude’ and ‘merchant_longitude

Any ideas why merchant latitude/longitude cols weren’t included initially?

54 of 187

Run your join

Don’t be afraid of schema updates! (Click the blue button)

55 of 187

Flow check

56 of 187

Agenda

Discovery

Hands-on Exercise

  • Background
  • DSS Login and Set up Account
  • Creating Projects and Connecting to Data
  • Stacking Datasets
  • Joins
  • Computation Engines
  • Data Cleaning [1]
  • Charts [1]
  • Splitting Data
  • Machine Learning
  • Filtering Data
  • Data Cleaning [2]
  • Group by Aggregations
  • Charts [2]

Green Track

Purple Track

57 of 187

Computation

Dataiku leverages your infrastructure

Run in Database

Enterprise SQL, Analytic SQL

Run In Cluster

Spark, Impala, Hive, …

ML in Memory

Python Scikit-Learn, R, …

Database Data

Vertica,

Greenplum,

Redshift,

PostgreSQL,

Data Lake

Cassandra,

HDFS, …

File System Data

Host File System,

Remote File System, …

SMART COMPUTATION

Dataiku DSS � automatically chooses

the most effective execution engine

among the engines available next to the input data of each computation.

Distributed ML

Mllib, H2O, …

Run in Memory

Python, R, …

58 of 187

Check out the computation engine for our Join recipe

Click to view or change the engine

59 of 187

Check out the computation engine for our Join recipe

We can see different available engines based on:

  • the recipe type
  • input/output dataset locations
  • infrastructure available to you

Leave this engine as SQL

60 of 187

Where can computation happen?

61 of 187

Agenda

Discovery

Hands-on Exercise

  • Background
  • DSS Login and Set up Account
  • Creating Projects and Connecting to Data
  • Stacking Datasets
  • Joins
  • Computation Engines
  • Data Cleaning [1]
  • Charts [1]
  • Splitting Data
  • Machine Learning
  • Filtering Data
  • Data Cleaning [2]
  • Group by Aggregations
  • Charts [2]

Green Track

Purple Track

62 of 187

Prepare

~90 different processors, including

  • Find and replace
  • Date manipulations
  • Mathematical operations on columns
  • Splitting columns
  • Extracting parts of columns
  • Text parsing and NLP
  • Geographic transformations

63 of 187

Create a Prepare recipe from the transactions_joined dataset

64 of 187

Round 1: Create time-based features

  1. Parse dates: card_first_active_month and purchase_date
  2. Extract date components from purchase_date_parsed
    • year, month, day, dow, hour
  3. Create a purchase_weekend (sat/sun) column using a DSS formula
  4. Create a days_active column: the time from the cardholder’s first_active_month until purchase_date

65 of 187

Parse card_first_active_month

We want yyyy-MM format

1

2

3

66 of 187

Parse purchase_date

Note: if you wanted to parse in-place, you could delete the output column name to do that

1

2

3

67 of 187

Extract date components from purchase_date_parsed

Clean up column names:

for example… purchase_date_parsed_year → purchase_year

Add:

purchase_dow

purchase_hour

1

2

68 of 187

Create binary purchase_weekend column

Let’s add this one using the

+ ADD A NEW STEP button

Use a DSS Formula!

We are defining weekends as Friday/Saturday

69 of 187

Create binary purchase_weekend column

Click “EDIT” to pull up examples and reference

Make sure to “SAVE” your formula back to the processor

You can add comments to document potentially complex formulas!

70 of 187

Compute the time from first active month to purchase date

Let’s add this one using the + ADD A NEW STEP button

Name the output column:

days_active

71 of 187

You can Group your date cleaning steps

Call your group “Date Formatting

This is much easier to read!

You can also modify colors or add descriptions !

72 of 187

Round 2: Create geolocation features

  • Create geopoint columns
    • merchant_location using the merchant latitude & longitude
    • cardholder_location using the cardholder latitude & longitude
  • Reverse geocode
    • merchant_location to get the merchant_state
    • cardholder_location to get the cardholder_state

73 of 187

Create merchant geopoint column

Use the processor search

74 of 187

Create cardholder geopoint column

Do the same thing for the cardholder locations!

75 of 187

For step 2, we will be using the reverse geocoding plugin…

76 of 187

Plugins

Many Types

In DSS plugins, the user can develop :

  • Custom Recipes
    • A basic user interface allowing the users to interact with variables of the underlying code: R, Python, SQL in Python
    • Example : Query a rest API with some settings; apply a statistical function to a field of dataset
  • Custom Dataset
    • Example : query a salesforce table.
  • Much More
    • New Preparation Step Processor
    • Custom Scenario Step
    • Dataset Exporter
    • Web App/R Markdown Template

Plugins are then available as a DSS extension

77 of 187

Reverse geocode the merchant location

Use the processor search

We can choose what to extract.

In “region”, enter “merchant_state

78 of 187

Reverse geocode the cardholder location

Feeling lazy? Duplicate and modify!

79 of 187

Compute the distance between cardholders and merchants

Use the processor search

80 of 187

Note the “invalid” data

DSS has two “data types” listed:

  1. Storage Type: indicates how dataset backend should store data
  2. Meaning: rich semantic type that can be more specific and is automatically detected

The red refers to invalid meanings. These will not necessarily cause errors, they are more to help guide you.

Including non-US data is fine for now, so we can leave the “invalid” data in.

81 of 187

Double check that you’ve added all steps

...and then click “RUN

82 of 187

Analyze the transactions_joined_prepared dataset

83 of 187

Click on a column name and ‘Analyze’ to see more

  • Distributions
  • Summary stats
  • Value counts

Stats can be computed on the sample or whole dataset

84 of 187

Questions

  1. What is the highest individual purchase amount?

  • What percentage of transactions are unauthorized (fraudulent)?

85 of 187

When analyzing a column, you can run stats on the entire dataset

86 of 187

Questions

  • What is the highest individual purchase amount?

$48,147

  • What percentage of transactions are unauthorized?

7.4%

87 of 187

Agenda

Discovery

Hands-on Exercise

  • Background
  • DSS Login and Set up Account
  • Creating Projects and Connecting to Data
  • Stacking Datasets
  • Joins
  • Computation Engines
  • Data Cleaning [1]
  • Charts [1]
  • Splitting Data
  • Machine Learning
  • Filtering Data
  • Data Cleaning [2]
  • Group by Aggregations
  • Charts [2]

Green Track

Purple Track

88 of 187

Charts

89 of 187

Build a chart on the transactions_stacked_prepared dataset

Task:

  • Map the fraud rate by US state
  • Which states have the highest rates?

Details:

  • Filled administrative map
  • Geo - merchant_location
  • Fill - authorized_flag (AVG)

Note:

  • Change the sample to 100,000 random (left hand menu)

90 of 187

Maps

91 of 187

Scatter plots

Task:

  • Plot the purchase_amount by FICO score - show which transactions are fraudulent
  • Does it look like people with higher FICO scores have more fraudulent transactions?
  • What about high vs. low transaction amounts?

Details:

  • Scatter plot
  • X-axis - card_fico_score
  • Y-axis - purchase_amount
  • Color - authorized_flag

Note:

  • Change the sample to 1,000 random (left hand menu)
  • Filter for only purchases less than $1000

92 of 187

Scatter plots

93 of 187

Publish both of your charts to a dashboard

Add more content to your dashboard by clicking the green plus button

  • Add a title (e.g. Credit Card Fraud Patterns)

94 of 187

Dashboards

95 of 187

Agenda

Discovery

Hands-on Exercise

  • Background
  • DSS Login and Set up Account
  • Creating Projects and Connecting to Data
  • Stacking Datasets
  • Joins
  • Computation Engines
  • Data Cleaning [1]
  • Charts [1]
  • Splitting Data
  • Machine Learning
  • Filtering Data
  • Data Cleaning [2]
  • Group by Aggregations
  • Charts [2]

Green Track

Purple Track

96 of 187

Challenge 1

Predict fraudulent transactions using machine learning

Green Track

97 of 187

Challenge 1 Steps

  • Split the data into records where we know the fraud/no fraud status (2017 and 2018 transactions, respectively)
  • Train a machine learning model in DSS using known 2017 data
  • Use the model to make predictions for unknown 2018 data

98 of 187

Split

99 of 187

Split

  • Split a dataset
    • Randomly
    • Based on some condition

100 of 187

Split the joined dataset into known and unknown datasets

  1. Create two new datasets, transactions_known and transactions_unknown
  2. Define a filter
    • where “authorized_flag” is defined → transactions_known
    • all other rows → transactions_unknown

We can train our model only on the records where we know the value of “authorized_flag

101 of 187

Create a split recipe and the two output datasets

102 of 187

Split into known and unknown datasets

...and then click “RUN

103 of 187

Agenda

Discovery

Hands-on Exercise

  • Background
  • DSS Login and Set up Account
  • Creating Projects and Connecting to Data
  • Stacking Datasets
  • Joins
  • Computation Engines
  • Data Cleaning [1]
  • Charts [1]
  • Splitting Data
  • Machine Learning
  • Filtering Data
  • Data Cleaning [2]
  • Group by Aggregations
  • Charts [2]

Green Track

Purple Track

104 of 187

Machine Learning

  • Create machine learning models using the visual interface
    • Leverage open source packages like scikit-learn, xgboost, MLlib, Keras, Tensorflow, and H2O
  • Perform batch-scoring on new data using a trained model
  • Deploy models as REST API endpoints

105 of 187

Train a model to predict “authorized_flag

  1. Go to the Lab for the transactions_known dataset
  2. Create a new Visual Model
  3. In the Design tab, try different algorithms and hyperparameters
  4. Train your models - find the best one

106 of 187

Click on the transactions_known dataset and go to the Lab

Then click “Auto ML” Prediction

107 of 187

Design the modeling task as you’d like

  • Select your target variable: authorized_flag
  • Select “Quick Prototypes”

108 of 187

Click the Design tab to change ML settings

109 of 187

Machine learning concepts

Regression - predict a continuous numeric target variable

Classification - predict a discrete “class” of your target variable

What type of problem is ours?

Classification

110 of 187

Classification - imbalanced data

Notice that our target column ‘authorized_flag’ is imbalanced - this is problematic

Two things we can try:

  1. Rebalance our train/test sets to undersample rows from the more popular class
  2. Weight error metrics according to the inverse frequency of each class

111 of 187

Rebalance train/test sets

Choose 100,000 records balanced from the ‘authorized_flag’ column

112 of 187

Train/test split

Note: it’s common to split your known data into:

  • 80% - data to train the model
  • 20% - hold-out data to test your model performance

113 of 187

Weight error metrics

You can weight error calculations based on the target column class (or another column)

  • Class weights is generally a good place to start

Look out for blue boxes for tips

114 of 187

Feature selection

Choose the columns in our dataset to consider when modeling credit card fraud

Consider:

  • Only include columns which you’ll have prior to the moment you need to make a prediction (no data leakage)
  • Intuition and subject matter expertise are important

115 of 187

You can turn features “On” in the “Features handling” tab

Notice:

  1. Feature rescaling
  2. Missingness imputation

1

2

116 of 187

Turn on the following features

(From the point of sale)

  • purchase_month
  • purchase_day
  • purchase_dow
  • purchase_weekend
  • purchase_hour
  • purchase_amount
  • item_category

(From cardholder_info table)

  • days_active
  • card_reward_program
  • card_latitude
  • card_longitude
  • card_fico_score
  • card_age
  • card_state

(From the merchant_info table)

  • merchant_subsector_description
  • merchant_latitude
  • merchant_longitude
  • merchant_cardholder_distance
  • merchant_state
  • merchant_category_id

117 of 187

Algorithm selection and hyperparameter tuning

Try out different ML algorithms/hyperparameters - then see which works best

Consider:

  • Try both linear and non-linear models at first
  • Try wider ranges of hyperparameters first, then hone in on the better performing values (e.g. start with random forest max depths of 6, 12, 18. If 12 performs best, then try 10, 12, 14….and so on)

118 of 187

Add a Random Forest with these hyperparameters

Add new hyperparameters as blue bubbles

DSS will try all possible combinations and show you the best one

119 of 187

What is a Random Forest?

Check out the blue info boxes for more on each algorithm

120 of 187

Add a Logistic Regression model & train!

This is a classic algorithm and easier to interpret than others

121 of 187

Questions

  • What are the most important variables used in your model?

  • What is your best ROC AUC?

  • Analyze your predictions for cardholders in each state - does your model make drastically higher or lower predictions for any states?

122 of 187

Analyze model performance

Metrics like Accuracy, ROC AUC, F1 score, and mean squared error are important.

Consider:

  • Look at feature importances (how is each column considered by the model) - do they make sense?
  • Look at the confusion matrix - do you care more about false negatives, false positives, true negatives, true positives?

123 of 187

Analyze model performance

Random forest has the highest AUC - let’s check it out in detail

124 of 187

Variable importance

merchant_cardholder_distance’ and ‘card_age’ seem to be strong predictors of fraudulent transactions

Tip: some data scientists remove features below a certain importance threshold (e.g. 4%) in the next iteration

125 of 187

Variable importance

Takeaway:

Keep only important features (above 4% importance)

In my random forest model, these were:

    • merchant_cardholder_distance
    • card_age
    • purchase_amount
    • merchant_subsector_description
    • merchant_longitude
    • merchant_category_id
    • card_fico_score

Your model may be different!

126 of 187

Subpopulation analysis

Make sure your model doesn’t perform significantly different for different subpopulations

127 of 187

Hyperparameter tuning

Find the model configuration which had the highest score

For me, it was:

  • n_estimators = 400
  • max_depth = 18
  • min_samples_leaf = 3

128 of 187

Confusion matrix

This matrix shows the counts of true positives, true negatives, false positives, and false negatives from the holdout test set

Tip: consider your problem.

Do you care more about catching all positives?

Do you care about raw accuracy?

  • Toggle the above threshold to change model sensitivity

129 of 187

Hyperparameter tuning

Takeaway:

  • Try a new grid search with hyperparameters closer to your previous best configuration

My best model:

  • n_estimators = 400
  • max_depth = 18
  • min_samples_leaf = 3

Next time try:

  • n_estimators = 400, 350, 450
  • max_depth = 18, 16, 20
  • min_samples_leaf = 3, 5

130 of 187

Now improve your model

  1. Remove unimportant features
  2. Add in ‘signature_provided’ feature
  3. Add back in ‘merchant_state’ - change the category handling from “dummy encoding” to “impact coding
  4. Change your random forest grid search to new hyperparameters

131 of 187

Remove unimportant features

Go back to the design tab -> features handling

132 of 187

Turn on ‘signature_provided

Perhaps this is a new helpful variable

133 of 187

Keep ‘merchant_state

Change the handling to “Impact coding

  • This will reduce the sparsity of your dataset

134 of 187

Change the hyperparameter grid for Random Forest

135 of 187

Then train your models again

Machine learning is an iterative process

136 of 187

Consider model training time

137 of 187

Analyze the Random Forest model again

We can see an improvement over the previous session

  • Higher AUC is better
  • AUC = 1 is a perfect model

138 of 187

Variable importance

Our changes to ‘signature_provided’ and ‘merchant_state’ seem to have improved the model

139 of 187

Confusion matrix

This model caught only 841 out of 4,791 fraudulent transactions in the hold out test set

Not great, but better than a random guess!

140 of 187

Go back to the Results tab and analyze the Logistic Regression

141 of 187

Regression coefficients

Logistic regression is a linear model, so we can see both the magnitude and directional impact of different variables on our target

  • In this model, many merchant-related variables have strong positive or negative effects on fraud

142 of 187

Choose a model and deploy it to your project flow

Consider:

  • Random Forest has a higher AUC (your model may be different)
  • Logistic regression is easier to interpret (both the magnitude and direction of variable relationships)

143 of 187

Deploy your model to the flow

144 of 187

Use your model to score the unknown transactions

… Run the recipe as is

145 of 187

Your scored dataset will have new prediction columns

146 of 187

Flow check

147 of 187

Agenda

Discovery

Hands-on Exercise

  • Background
  • DSS Login and Set up Account
  • Creating Projects and Connecting to Data
  • Stacking Datasets
  • Joins
  • Computation Engines
  • Data Cleaning [1]
  • Charts [1]
  • Splitting Data
  • Machine Learning
  • Filtering Data
  • Data Cleaning [2]
  • Group by Aggregations
  • Charts [2]

Green Track

Purple Track

148 of 187

Challenge 2

Find optimal card rewards program for each customer

Purple Track

149 of 187

Card Rewards Programs

Cash back

1.5% cash back on all purchases

Dining and entertainment

*FICO score above 650

6% cash back on dining and entertainment

4% at grocery stores

1% on all other purchases

Travel

8% cash back on travel and dining

1% on all other purchases

150 of 187

Challenge 2 Steps

  • Calculate the theoretical benefits of each reward program
  • Suggest optimal cards based on spending history

151 of 187

Filter

152 of 187

Filter

  • Filter a dataset
    • Randomly
    • Based on some condition

153 of 187

Filter out fraudulent transactions

In this problem, we only want to calculate rewards on valid purchases

Only keep rows where ‘authorized_flag’ == 1

154 of 187

Create a Filter recipe from the transactions_joined_prepared dataset

Name the output dataset ‘transactions_joined_prepared_authorized

155 of 187

Filter out fraudulent transactions

… and run the recipe

156 of 187

Agenda

Discovery

Hands-on Exercise

  • Background
  • DSS Login and Set up Account
  • Creating Projects and Connecting to Data
  • Stacking Datasets
  • Joins
  • Computation Engines
  • Data Cleaning [1]
  • Charts [1]
  • Splitting Data
  • Machine Learning
  • Filtering Data
  • Data Cleaning [2]
  • Group by Aggregations
  • Charts [2]

Green Track

Purple Track

157 of 187

Prepare

158 of 187

Create a Prepare recipe from the transactions_joined_prepared_authorized dataset

Name the output dataset ‘transaction_reward_value

159 of 187

Create new columns to calculate the theoretical benefits of each rewards program

reward_cash_back:

1.5% cash back on all purchases

reward_dining_entertainment:

**FICO score above 650

6% cash back on dining and entertainment

4% at grocery stores

1% on all other purchases

reward_travel:

8% cash back on travel and dining

1% on all other purchases

Use a Formula processor for each one

160 of 187

Keep only columns relevant to the exercise

161 of 187

You may see a screen that looks like this

Don’t worry! These columns will be deleted per our previous step - that’s why they are empty

162 of 187

reward_cash_back calculation

You can add comments to each step

Formula:

purchase_amount*0.015

163 of 187

reward_dining_entertainment calculation

Formula:

if(card_fico_score<650, 0, if(or(merchant_subsector_description == "restaurant/dining", merchant_subsector_description == "misc entertainment"), purchase_amount*0.06, if(merchant_subsector_description == "groceries", purchase_amount*0.04, purchase_amount*0.01)))

Just part of the formula shown here

164 of 187

reward_travel calculation

Formula:

if(merchant_subsector_description == "flights", purchase_amount*0.08, purchase_amount*0.01)

Just part of the formula shown here

165 of 187

Double check that you’ve added all steps

...and then click “RUN

166 of 187

Agenda

Discovery

Hands-on Exercise

  • Background
  • DSS Login and Set up Account
  • Creating Projects and Connecting to Data
  • Stacking Datasets
  • Joins
  • Computation Engines
  • Data Cleaning [1]
  • Charts [1]
  • Splitting Data
  • Machine Learning
  • Filtering Data
  • Data Cleaning [2]
  • Group by Aggregations
  • Charts [2]

Green Track

Purple Track

167 of 187

Group by

168 of 187

Group by

Use this recipe to calculate summary information from groups of rows in a dataset

169 of 187

Group the transaction_reward_value dataset

Group by:

  • card_id

Then compute these new columns:

  • Sum purchase_amount
  • Sum reward_cash_back
  • Sum reward_dining_entertainment
  • Sum reward_travel
  • First card_fico_score
  • First reward_program

Our output dataset will have one row per cardholder, and will calculate the theoretical benefits from each reward program across their purchase history

170 of 187

Group the transaction_reward_value dataset

171 of 187

Choose the aggregations

...and run the recipe

172 of 187

Prepare

173 of 187

Calculate the optimal rewards program

  • Create a new column showing each cardholder’s optimal rewards program

  • Compute value-add to changing programs

Start with a Prepare recipe on the

transaction_reward_value_by_card_id dataset

174 of 187

Calculate the optimal rewards program

Formula:

if(and(reward_cash_back_sum >= reward_travel_sum, reward_cash_back_sum >= reward_dining_entertainment_sum), "cash_back", if(and(reward_dining_entertainment_sum >= reward_cash_back_sum, reward_dining_entertainment_sum >= reward_travel_sum), "dining_entertainment", "travel"))

Just part of the formula shown here

175 of 187

Calculate the optimal rewards program amount

Formula:

cells["reward_" + best_reward_program + "_sum"].value

Just part of the formula shown here

See the Reference doc on the cells function

176 of 187

Calculate the current rewards program amount

Formula:

cells["reward_" + card_reward_program_first + "_sum"].value

Just part of the formula shown here

177 of 187

Calculate the value add from switching to the optimal program

Formula:

best_reward_amount - current_reward_amount

178 of 187

Calculate the optimal rewards program amount

Formula:

if(value_add>0, concat(card_reward_program_first, " --> ", best_reward_program), "no switch")

Just part of the formula shown here

See the Reference doc on the concat function

179 of 187

Double check that you’ve added all steps

...and then click “RUN

180 of 187

Agenda

Discovery

Hands-on Exercise

  • Background
  • DSS Login and Set up Account
  • Creating Projects and Connecting to Data
  • Stacking Datasets
  • Joins
  • Computation Engines
  • Data Cleaning [1]
  • Charts [1]
  • Splitting Data
  • Machine Learning
  • Filtering Data
  • Data Cleaning [2]
  • Group by Aggregations
  • Charts [2]

Green Track

Purple Track

181 of 187

More Charts

182 of 187

Bar charts

Task:

  • Plot the following by card switch type
    1. average value_add
    2. count of records

  • Which rewards program switch would benefit the most cardholders?

  • Which program switch would have the highest average value add?

Details:

  • Bar chart
  • X-axis - switch
  • Y-axis - value_add, count of records (right axis)

183 of 187

Bar charts

184 of 187

Bar charts - answers

Task:

  • Plot the following by card switch type
    • average value_add
    • count of records

  • Which rewards program switch would benefit the most cardholders?

travel -> cash_back

  • Which program switch would have the highest average value add?

entertainment -> travel

Details:

  • Bar chart
  • X-axis - switch
  • Y-axis - value_add, count of records (right axis)

185 of 187

Flow check

186 of 187

Thank you

187 of 187

Additional Resources

learn.dataiku.com

  • Tutorials
  • Videos

gallery.dataiku.com

  • Sample projects

doc.dataiku.com

  • Extensive technical documentation about DSS

community.dataiku.com

  • Public Q&A
  • (like stack overflow)