©2018 dataiku, Inc. | dataiku.com | contact@dataiku.com | @dataiku
Dataiku Training
DISCOVERY
Agenda
Discovery
Agenda
Discovery
A Brief History of Dataiku
Key Company Milestones
2013
2015
2016
First 100+
users customers
First 500+
users customers
First 10+
users customers
First customer
Dataiku Created
Beta
version 1
1st tool integrating
data prep and ML
version 2
Real-time collaboration
Spark integration
version 3
Scenarios, automation and version control
version 4
Advanced scalability,
cloud support
version 5
Deep learning, Containerization, Project wikis
Office in London
COMPANY
Office in New York
20th employee
$28M Series B
100th
customer
300th
customer
1st CAB
Customer Advisory Board
50th
customer
2017
2018
2014
CUSTOMERS
SOFTWARE
$14M Series A
$4M raised
$101M Series C
7 new offices in EMEA, APAC, Americas
2019
First 1000+
users customers
version 6
Enterprise Scale Elastic AI with Kubernetes
$1.4B valuation
2020
100th employee
400th employee
1000+
EGG Visitors
200th employee
version 7
Explainable AI and
Advanced Visual Statistics
End-to-end Software
Dataiku DSS
End-to-End Platform Solution
Skill agnostic platform that supports all data project stakeholders
DATA CATALOGING AND CONNECTIVITY
DATA PREPARATION
(AUTO) MACHINE LEARNING
PRODUCTION DEPLOYMENT
VISUAL
CODE
GOVERNANCE, VERSIONING, AUDIT AND REUSE
Data Access and Processing
Dataiku DSS
CONNECT TO YOUR (MANY) DATA SOURCES
LEVERAGE VARIOUS COMPUTE RESOURCES
DEVELOP REUSABLE/ MAINTAINABLE DATA FLOWS
UNDERSTAND YOUR DATA
CODE… OR CLICK
Machine Learning
Dataiku DSS
AUTO-ML OR CODE BASED
FULL TRAINING MANAGEMENT
Leverage your full stack and skills
Solution Overview: Architecture
Browser based interface
VISUAL DEVELOPMENT
COMPLETE CODING ENVIRONMENTS
VISUALIZATION
COLLABORATION AND PROJECT MANAGEMENT
AUDIT, MONITORING AND SCHEDULING
User/task specific
interaction modes
LINUX SERVER
ON PREMISE OR MANAGED CLOUD
CENTRALIZED OR AD-HOC DATA SOURCES, DATABASES, DATA LAKE
AVAILABLE OR SPUN-UP PROCESSING RESOURCES
Leveraging best storage and compute resources
Dataiku deployment servers for enterprise grade operationalization
PRODUCTION SYSTEMS
Centralized server to facilitate
access to data, ressources, and foster collaboration
The “tower of Babel” effect of Data Projects
The Classic Data Project Silos
IT Monitoring
Refactored data project
Auto ML
Visual Data Preparation
R, Python, Notebooks, and (many) Data Science Libraries
Visualization and validation
Visualization and validation
Model Monitoring
Business Analyst
Data Scientist
Analytics Leader
Data Engineer
Bring Business Analysts, Engineers, and Scientists Together
Share a common environment and understanding
IT Monitoring
Refactored data project
Auto ML
Visual Data Preparation
R, Python, Notebooks, and (many) Data Science Libraries
Visualization and validation
Visualization and validation
Model Monitoring
Analytics Leader
Business Analyst
Data Scientist
Data Engineer
Agenda
Discovery
Use Dataiku to predict likelihood of default
Default Prediction
Can we use previous customer data of a bank to make a model to predict whether a loan will be paid back or not?
Agenda
Discovery
Credit Card Transactions
Agenda
Discovery
Hands-on Exercise
Green Track
Purple Track
Agenda
Discovery
Hands-on Exercise
Green Track
Purple Track
Elo
Background
Elo
Challenge #1
Challenge: Predict fraudulent credit card transactions
Steps:
Green Track
Elo
Challenge #2
Challenge: Optimize different card rewards programs
Steps:
Purple Track
Agenda
Discovery
Hands-on Exercise
Green Track
Purple Track
Login to DSS
https://dssX-design-COMPANYNAME.training.dataiku.com� (replace the X with your instance number)� �Username: userY�Password: INSTANCE_PASSWORD�
Set up your account
Set up your account
Change your display name (to your name)
Agenda
Discovery
Hands-on Exercise
Green Track
Purple Track
Create a new project
Name your project “Credit Card Fraud” + your name
Data Sources Overview
Data Sources Overview: 4 Datasets
transactions_2017 and transactions_2018
cardholder_info
merchant_info
Import data from PostgreSQL
Select “Choose connection to import from”
Import data from PostgreSQL
transactions_2017 and transactions_2018: consumer credit card transactions
cardholder_info: credit score for each cardholder
merchant_info: lookup table for merchants
Import data from PostgreSQL
The default table names are good - create all 4 datasets
You should now have 4 datasets in your flow
Double click on a dataset to view it
Explore the transaction data
Each row is a transaction. The ‘authorized_flag’ shows whether it was approved or not.
Column attributes
We can see information about each column and its contents
Column name
Schema - the storage type
ex. string, int, double, array
Meaning - the “human readable” meaning
ex. text, integer, IP address, gender
Agenda
Discovery
Hands-on Exercise
Green Track
Purple Track
Stack
Stack the transactions datasets together
Rename the output dataset “transactions_stacked”
Stack
Your flow should look like this
Agenda
Discovery
Hands-on Exercise
Green Track
Purple Track
Join
Join the stacked transactions dataset with the other two datasets
Do this all in ONE Join recipe
Join the stacked transactions dataset with the other two datasets
Join the stacked transactions dataset with the other two datasets
LEFT JOIN transactions_stacked with cardholder_info
Sometimes DSS guesses the wrong desired join columns
Click the equals sign to change the join
Change the columns to join on
‘card_id’ = ‘internal_card_mapping’
LEFT JOIN transactions_stacked with cardholder_info
Add a new dataset input
Select the merchant_info dataset from the dropdown
LEFT JOIN transactions_stacked with merchant_info
This looks good!
Now let’s look at the Selected Columns tab
LEFT JOIN transactions_stacked with merchant_info
Add prefixes to columns from joined datasets
Any ideas why merchant latitude/longitude cols weren’t included initially?
Run your join
Don’t be afraid of schema updates! (Click the blue button)
Flow check
Agenda
Discovery
Hands-on Exercise
Green Track
Purple Track
Computation
Dataiku leverages your infrastructure
Run in Database
Enterprise SQL, Analytic SQL
Run In Cluster
Spark, Impala, Hive, …
ML in Memory
Python Scikit-Learn, R, …
Database Data
Vertica,
Greenplum,
Redshift,
PostgreSQL,
…
Data Lake
Cassandra,
HDFS, …
File System Data
Host File System,
Remote File System, …
SMART COMPUTATION
Dataiku DSS � automatically chooses
�the most effective execution engine
among the engines available next to the input data of each computation.
Distributed ML
Mllib, H2O, …
Run in Memory
Python, R, …
Check out the computation engine for our Join recipe
Click to view or change the engine
Check out the computation engine for our Join recipe
We can see different available engines based on:
Leave this engine as SQL
Where can computation happen?
Agenda
Discovery
Hands-on Exercise
Green Track
Purple Track
Prepare
~90 different processors, including
Create a Prepare recipe from the transactions_joined dataset
Round 1: Create time-based features
Parse card_first_active_month
We want yyyy-MM format
1
2
3
Parse purchase_date
Note: if you wanted to parse in-place, you could delete the output column name to do that
1
2
3
Extract date components from purchase_date_parsed
Clean up column names:
for example… purchase_date_parsed_year → purchase_year
Add:
purchase_dow
purchase_hour
1
2
Create binary purchase_weekend column
Let’s add this one using the
+ ADD A NEW STEP button
Use a DSS Formula!
We are defining weekends as Friday/Saturday
Create binary purchase_weekend column
Click “EDIT” to pull up examples and reference
Make sure to “SAVE” your formula back to the processor
You can add comments to document potentially complex formulas!
Compute the time from first active month to purchase date
Let’s add this one using the + ADD A NEW STEP button
Name the output column:
days_active
You can Group your date cleaning steps
Call your group “Date Formatting”
This is much easier to read!
You can also modify colors or add descriptions !
Round 2: Create geolocation features
Create merchant geopoint column
Use the processor search
Create cardholder geopoint column
Do the same thing for the cardholder locations!
For step 2, we will be using the reverse geocoding plugin…
Plugins
Many Types
In DSS plugins, the user can develop :
Plugins are then available as a DSS extension
Reverse geocode the merchant location
Use the processor search
We can choose what to extract.
In “region”, enter “merchant_state”
Reverse geocode the cardholder location
Feeling lazy? Duplicate and modify!
Compute the distance between cardholders and merchants
Use the processor search
Note the “invalid” data
DSS has two “data types” listed:
The red refers to invalid meanings. These will not necessarily cause errors, they are more to help guide you.
Including non-US data is fine for now, so we can leave the “invalid” data in.
Double check that you’ve added all steps
...and then click “RUN”
Analyze the transactions_joined_prepared dataset
Click on a column name and ‘Analyze’ to see more
Stats can be computed on the sample or whole dataset
Questions
When analyzing a column, you can run stats on the entire dataset
Questions
$48,147
7.4%
Agenda
Discovery
Hands-on Exercise
Green Track
Purple Track
Charts
Build a chart on the transactions_stacked_prepared dataset
Task:
Details:
Note:
Maps
Scatter plots
Task:
Details:
Note:
Scatter plots
Publish both of your charts to a dashboard
Add more content to your dashboard by clicking the green plus button
Dashboards
Agenda
Discovery
Hands-on Exercise
Green Track
Purple Track
Challenge 1
Predict fraudulent transactions using machine learning
Green Track
Challenge 1 Steps
Split
Split
Split the joined dataset into known and unknown datasets
We can train our model only on the records where we know the value of “authorized_flag”
Create a split recipe and the two output datasets
Split into known and unknown datasets
...and then click “RUN”
Agenda
Discovery
Hands-on Exercise
Green Track
Purple Track
Machine Learning
Train a model to predict “authorized_flag”
Click on the transactions_known dataset and go to the Lab
Then click “Auto ML” Prediction
Design the modeling task as you’d like
Click the Design tab to change ML settings
Machine learning concepts
Regression - predict a continuous numeric target variable
Classification - predict a discrete “class” of your target variable
What type of problem is ours?
Classification
Classification - imbalanced data
Notice that our target column ‘authorized_flag’ is imbalanced - this is problematic
Two things we can try:
Rebalance train/test sets
Choose 100,000 records balanced from the ‘authorized_flag’ column
Train/test split
Note: it’s common to split your known data into:
Weight error metrics
You can weight error calculations based on the target column class (or another column)
Look out for blue boxes for tips
Feature selection
Choose the columns in our dataset to consider when modeling credit card fraud
Consider:
You can turn features “On” in the “Features handling” tab
Notice:
1
2
Turn on the following features
(From the point of sale)
(From cardholder_info table)
(From the merchant_info table)
Algorithm selection and hyperparameter tuning
Try out different ML algorithms/hyperparameters - then see which works best
Consider:
Add a Random Forest with these hyperparameters
Add new hyperparameters as blue bubbles
DSS will try all possible combinations and show you the best one
What is a Random Forest?
Check out the blue info boxes for more on each algorithm
Add a Logistic Regression model & train!
This is a classic algorithm and easier to interpret than others
Questions
Analyze model performance
Metrics like Accuracy, ROC AUC, F1 score, and mean squared error are important.
Consider:
Analyze model performance
Random forest has the highest AUC - let’s check it out in detail
Variable importance
‘merchant_cardholder_distance’ and ‘card_age’ seem to be strong predictors of fraudulent transactions
Tip: some data scientists remove features below a certain importance threshold (e.g. 4%) in the next iteration
Variable importance
Takeaway:
Keep only important features (above 4% importance)
In my random forest model, these were:
Your model may be different!
Subpopulation analysis
Make sure your model doesn’t perform significantly different for different subpopulations
Hyperparameter tuning
Find the model configuration which had the highest score
For me, it was:
Confusion matrix
This matrix shows the counts of true positives, true negatives, false positives, and false negatives from the holdout test set
Tip: consider your problem.
Do you care more about catching all positives?
Do you care about raw accuracy?
Hyperparameter tuning
Takeaway:
My best model:
Next time try:
Now improve your model
Remove unimportant features
Go back to the design tab -> features handling
Turn on ‘signature_provided’
Perhaps this is a new helpful variable
Keep ‘merchant_state’
Change the handling to “Impact coding”
Change the hyperparameter grid for Random Forest
Then train your models again
Machine learning is an iterative process
Consider model training time
Analyze the Random Forest model again
We can see an improvement over the previous session
Variable importance
Our changes to ‘signature_provided’ and ‘merchant_state’ seem to have improved the model
Confusion matrix
This model caught only 841 out of 4,791 fraudulent transactions in the hold out test set
Not great, but better than a random guess!
Go back to the Results tab and analyze the Logistic Regression
Regression coefficients
Logistic regression is a linear model, so we can see both the magnitude and directional impact of different variables on our target
Choose a model and deploy it to your project flow
Consider:
Deploy your model to the flow
Use your model to score the unknown transactions
… Run the recipe as is
Your scored dataset will have new prediction columns
Flow check
Agenda
Discovery
Hands-on Exercise
Green Track
Purple Track
Challenge 2
Find optimal card rewards program for each customer
Purple Track
Card Rewards Programs
Cash back
1.5% cash back on all purchases
Dining and entertainment
*FICO score above 650
6% cash back on dining and entertainment
4% at grocery stores
1% on all other purchases
Travel
8% cash back on travel and dining
1% on all other purchases
Challenge 2 Steps
Filter
Filter
Filter out fraudulent transactions
In this problem, we only want to calculate rewards on valid purchases
Only keep rows where ‘authorized_flag’ == 1
Create a Filter recipe from the transactions_joined_prepared dataset
Name the output dataset ‘transactions_joined_prepared_authorized’
Filter out fraudulent transactions
… and run the recipe
Agenda
Discovery
Hands-on Exercise
Green Track
Purple Track
Prepare
Create a Prepare recipe from the transactions_joined_prepared_authorized dataset
Name the output dataset ‘transaction_reward_value’
Create new columns to calculate the theoretical benefits of each rewards program
reward_cash_back:
1.5% cash back on all purchases
reward_dining_entertainment:
**FICO score above 650
6% cash back on dining and entertainment
4% at grocery stores
1% on all other purchases
reward_travel:
8% cash back on travel and dining
1% on all other purchases
Use a Formula processor for each one
Keep only columns relevant to the exercise
You may see a screen that looks like this
Don’t worry! These columns will be deleted per our previous step - that’s why they are empty
reward_cash_back calculation
You can add comments to each step
Formula:
purchase_amount*0.015
reward_dining_entertainment calculation
Formula:
if(card_fico_score<650, 0, if(or(merchant_subsector_description == "restaurant/dining", merchant_subsector_description == "misc entertainment"), purchase_amount*0.06, if(merchant_subsector_description == "groceries", purchase_amount*0.04, purchase_amount*0.01)))
Just part of the formula shown here
reward_travel calculation
Formula:
if(merchant_subsector_description == "flights", purchase_amount*0.08, purchase_amount*0.01)
Just part of the formula shown here
Double check that you’ve added all steps
...and then click “RUN”
Agenda
Discovery
Hands-on Exercise
Green Track
Purple Track
Group by
Group by
Use this recipe to calculate summary information from groups of rows in a dataset
Group the transaction_reward_value dataset
Group by:
Then compute these new columns:
Our output dataset will have one row per cardholder, and will calculate the theoretical benefits from each reward program across their purchase history
Group the transaction_reward_value dataset
Choose the aggregations
...and run the recipe
Prepare
Calculate the optimal rewards program
Start with a Prepare recipe on the
transaction_reward_value_by_card_id dataset
Calculate the optimal rewards program
Formula:
if(and(reward_cash_back_sum >= reward_travel_sum, reward_cash_back_sum >= reward_dining_entertainment_sum), "cash_back", if(and(reward_dining_entertainment_sum >= reward_cash_back_sum, reward_dining_entertainment_sum >= reward_travel_sum), "dining_entertainment", "travel"))
Just part of the formula shown here
Calculate the optimal rewards program amount
Formula:
cells["reward_" + best_reward_program + "_sum"].value
Just part of the formula shown here
See the Reference doc on the cells function
Calculate the current rewards program amount
Formula:
cells["reward_" + card_reward_program_first + "_sum"].value
Just part of the formula shown here
Calculate the value add from switching to the optimal program
Formula:
best_reward_amount - current_reward_amount
Calculate the optimal rewards program amount
Formula:
if(value_add>0, concat(card_reward_program_first, " --> ", best_reward_program), "no switch")
Just part of the formula shown here
See the Reference doc on the concat function
Double check that you’ve added all steps
...and then click “RUN”
Agenda
Discovery
Hands-on Exercise
Green Track
Purple Track
More Charts
Bar charts
Task:
Details:
Bar charts
Bar charts - answers
Task:
travel -> cash_back
entertainment -> travel
Details:
Flow check
Thank you
Additional Resources
learn.dataiku.com
gallery.dataiku.com
doc.dataiku.com
community.dataiku.com