1 of 25

GROUP 2

MACHINE LEARNING BOOTCAMP

2 of 25

Contents

1

Problem Identification & 

Data Checks

2

Variable Understanding &

Data Wrangling

3

Feature Understanding &

Variable Selection

4

Model Development & Evaluation – 1st

5

Model Development & Evaluation – 2nd

3 of 25

accountNumber

0

customerId

0

creditLimit

0

availableMoney

0

transactionDateTime

0

transactionAmount

0

merchanName

0

acqCountry

4562

merchantCountryCode

724

posEntryMode

4054

posConditionCode

409

merchantCategoryCode

0

currentExpDate

0

accountOpenDate

0

dateOfLastAddressChange

0

cardCVV

0

  • Dropped variables: ‘echoBuffer’, ‘mechantCity’, ‘merchantState’, ‘merchantZp’, ‘posOnPremises’ and ‘recurringAuthInd’

  • Dropped rows with ‘null’

Number of Rows with “null” values

4 of 25

1: Problem Identification & Data Checks

False

0.98421

True

0.01579

  • 98.4% no fraud
  • 1.6% fraud
  • To avoid oversampling issues in subsequent analysis, take all fraud cases first and then take an equal number of non-fraud cases for analysis

5 of 25

II:Variable Understanding & Data Wrangling

Finding

Transaction Amount: Majority of the transaction amount is within 0 to $250, the distribution looks good as a modelling variable directly.

Credit Limit: The value are not normal distribution, hard to directly use as a feature

Available Money: The available money shows low given low credit limit to start with, it will be interesting to look at the utilization to determine the card usage.

Current Balance: Similar as previous insights, current balance seems low

All numerical variable distribution

6 of 25

Transactions in fraud cases are higher in amount compared to transactions in non-fraud cases.

Hypothesis 1: Cases with high value of transactionAmount are more likely to be fraud cases

7 of 25

Correlation among all numerical variables

Finding:Current Balance + Available Money = Credit Limit

8 of 25

cardPresent

Row Count

Row Count %

Target Count

Target Count %

Target Rate %

0

true

352868

44.87

3455

27.82

0.98

1

false

433495

55.13

8962

72.18

2.07

cardPresent:

cardPresent as false, the target rate is 2.07% while cardPresent is true is 0.98%. Therefore, without card has larger chance to be fraud.

Hypothesis 2: Without card is more likely to be fraud

9 of 25

transactionType

Row Count

Row Count%

Target Count

Target Count %

Target Rate %

0

ADDRESS_VERIFICATION

20169

2.57

116

0.94

0.58

1

PURCHASE

745193

94.85

11950

96.35

1.60

2

REVERSAL

20303

2.85

337

2.72

1.66

transactionType:

Hypothesis 3: Purchase or reversal is more likely to be fraud

purchase target rate is 1.60, reversal target rate is 1.66, compare to adress_verfication, purchase or reversal has larger chance to be fraud.

10 of 25

Hypothesis 4: some of the Merchantcategorys are more likely to fraud

merchantCategoryCode

Row Count

Row Count %

Target Count

Target Count %

Target Rate %

0

gym

2209

0.28

0

0.00

0.00

1

cable/phone

1382

0.18

0

0.00

0.00

2

online_subscriptions

11067

1.41

0

0.00

0.00

3

mobileapps

14990

1.91

0

0.00

0.00

4

food_delivery

6000

0.76

0

0.00

0.00

5

fuel

23910

3.04

0

0.00

0.00

6

personal care

18964

2..41

86

0.69

0.45

7

health

19092

2.43

90

0.72

0.47

8

hotels

34097

4.34

250

2.01

0.73

9

subscriptions

22901

2.91

216

1.74

0.94

10

fastfood

112138

14.26

1074

8.65

0.96

11

entertainment

80098

10.19

961

7.74

1.20

12

auto

21651

2.75

273

2.20

1.26

13

food

75490

9.60

1014

8.17

1.34

14

furniture

7432

0.95

103

0.83

1.39

15

online_gifts

66238

8.42

1606

12.93

2.42

16

online_retail

202156

25.71

4938

39.77

2.44

17

rideshare

51136

6.50

1272

10.24

2.49

18

airline

15412

1.96

534

4.30

3.46

11 of 25

Hypothesis 5: The larger the difference between the transactionAmount and the average for the past 7 or 30 days, the more likely it is to be a fraud.

12 of 25

3: Feature Engineering & Variable Selection

All selected variables

Numerical Variables:

transactionAmount

utilization = currentBalance/creditLimit

transactionAmount7Diff

transactionAmount30Diff

Categorical Variables:

CardPresent

TransactionType

MerchantCategoryCode

13 of 25

Dummy Variables

cardPresent

Row Count

Row Count %

Target Count

Target Count %

Target Rate %

0

true

352868

44.87

3455

27.82

0.98

1

false

433495

55.13

8962

72.18

2.07

cardPresent:

True is 1;

False is 0.

14 of 25

transactionType

Row Count

Row Count%

Target Count

Target Count %

Target Rate %

0

ADDRESS_VERIFICATION

20169

2.57

116

0.94

0.58

1

PURCHASE

745193

94.85

11950

96.35

1.60

2

REVERSAL

20303

2.85

337

2.72

1.66

transactionType:

If it is a Purchase or Reversal type, it is 1; if it is an Address, it is 0.

15 of 25

Divide them into three categories(MerchantCategoryCode):

C1: 0-1

C2: 1-2

C3: 2-3

add three dummy variable columns:

code_under1

code_under2

code_rest

16 of 25

Final Selected Variables

cardpresent

purchasereveral

code_under1

code_under2

code_rest

Transaction

acqCode

utilization

Transaction amount 7diff

Transaction amount 30diff

0

1

1

0

0

165.26

1

0.827724

42.807143

39.452963

0

1

0

0

1

298.85

1

0.059069

2.627500

2.627500

0

1

1

0

0

406.89

1

0.086502

212.500000

264.901250

0

1

0

0

1

439.31

1

0.403761

150.872857

288.637059

0

1

0

0

1

266.09

1

0.651798

140.730000

121.099273

1

1

0

1

0

11.6

1

0.455689

-127.604262

-127.349921

1

1

0

1

0

53.06

1

0.719842

-72.768265

-89.679709

0

1

0

0

1

195.06

1

0.151087

36.476250

45.918267

0

1

0

0

1

40.44

1

0.906970

99.140611

-102.675385

0

0

0

0

1

0.00

1

0.828903

152.301933

-149.420587

17 of 25

Logistic Regression Model

precision

recall

f1-score

support

False

0.66

0.61

0.63

2441

True

0.65

0.70

0.67

2526

accuracy

0.64

0.68

0.65

4967

macro avg

0.65

0.65

0.65

4967

weighted avg

0.65

0.65

0.65

4967

Confusion Matrix:

Accuracy Report:

18 of 25

Coefficients

transactionAmount

1.4698587424173901

code_rest

0.9044358641187724

transactionAmount7Diff

0.06449556709958816

cardPresent

0.02071320364010008

acqCode

0.004379849093978679

code_under2

0.0004828690815489611

transactionAmount30Diff

-0.00022864403539822457

purchaseReversal

-0.1967447474923735

code_under1

-0.10911740592971167

utilization

-0.8937164815560348

expirationDateKeyInMatch

-0.8116121682548728

19 of 25

ROC Curve

20 of 25

Decision Tree

precision

recall

fl-score

support

False

0.66

0.58

0.61

2441

True

0.63

0.71

0.67

2526

accuracy

0.64

4967

Macro avg

0.64

0.64

0.64

4967

Weighted avg

0.64

0.64

0.64

4967

21 of 25

Importance of Variables

22 of 25

Random Forest

precision

recall

fl-score

support

False

0.66

0.63

0.64

2441

True

0.66

0.70

0.68

2526

accuracy

0.66

4967

Macro avg

0.66

0.66

0.66

4967

Weighted avg

0.66

0.66

0.66

4967

23 of 25

Potential Improvement

accountNumber

0

customerId

0

creditLimit

0

availableMoney

0

transactionDateTime

0

transactionAmount

0

merchanName

0

acqCountry

4562

merchantCountryCode

724

posEntryMode

4054

posConditionCode

409

merchantCategoryCode

0

currentExpDate

0

accountOpenDate

0

dateOfLastAddressChange

0

cardCVV

0

The number of instances where acqCountry and merchantCountryCode are null and not the same. If only one is missing or if both are missing, there may be some information missing.

24 of 25

echoBuffer

merchantCity

merchantState

merchantZip

posOnPremises

recurringAuthInd

Missing all data

enteredCVV

0

cardLasr4Digits

0

transactionType

698

echoBuffer

7866363

currentBalance

0

merchantCity

786363

merchantState

786363

merchantZip

786363

cardPresent

0

posOnPremises

786363

recurringAuthInd

786363

expirationDateKeyInMatch

0

isFraud

0

25 of 25

It can also introduce the variable of whether there is a multi-swipe,

CREDITS: This presentation template was created by Slidesgo, including icon by Flaticon, and infographics & images from Freepik