1	Problem Identification & Data Checks
2	Variable Understanding & Data Wrangling
3	Feature Understanding & Variable Selection
4	Model Development & Evaluation – 1st
5	Model Development & Evaluation – 2nd

3 of 25

accountNumber	0
customerId	0
creditLimit	0
availableMoney	0
transactionDateTime	0
transactionAmount	0
merchanName	0
acqCountry	4562

merchantCountryCode	724
posEntryMode	4054
posConditionCode	409
merchantCategoryCode	0
currentExpDate	0
accountOpenDate	0
dateOfLastAddressChange	0
cardCVV	0

Dropped variables: ‘echoBuffer’, ‘mechantCity’, ‘merchantState’, ‘merchantZp’, ‘posOnPremises’ and ‘recurringAuthInd’

Dropped rows with ‘null’

Number of Rows with “null” values

4 of 25

1: Problem Identification & Data Checks

False	0.98421
True	0.01579

98.4% no fraud
1.6% fraud
To avoid oversampling issues in subsequent analysis, take all fraud cases first and then take an equal number of non-fraud cases for analysis

5 of 25

II：Variable Understanding & Data Wrangling

Finding

Transaction Amount: Majority of the transaction amount is within 0 to $250, the distribution looks good as a modelling variable directly.

Credit Limit: The value are not normal distribution, hard to directly use as a feature

Available Money: The available money shows low given low credit limit to start with, it will be interesting to look at the utilization to determine the card usage.

Current Balance: Similar as previous insights, current balance seems low

All numerical variable distribution

6 of 25

Transactions in fraud cases are higher in amount compared to transactions in non-fraud cases.

Hypothesis 1: Cases with high value of transactionAmount are more likely to be fraud cases

7 of 25

Correlation among all numerical variables

Finding：Current Balance + Available Money = Credit Limit

8 of 25

	cardPresent	Row Count	Row Count %	Target Count	Target Count %	Target Rate %
0	true	352868	44.87	3455	27.82	0.98
1	false	433495	55.13	8962	72.18	2.07

cardPresent:

cardPresent as false, the target rate is 2.07% while cardPresent is true is 0.98%. Therefore, without card has larger chance to be fraud.

Hypothesis 2: Without card is more likely to be fraud

9 of 25

	transactionType	Row Count	Row Count%	Target Count	Target Count %	Target Rate %
0	ADDRESS_VERIFICATION	20169	2.57	116	0.94	0.58
1	PURCHASE	745193	94.85	11950	96.35	1.60
2	REVERSAL	20303	2.85	337	2.72	1.66

transactionType:

Hypothesis 3: Purchase or reversal is more likely to be fraud

purchase target rate is 1.60, reversal target rate is 1.66, compare to adress_verfication, purchase or reversal has larger chance to be fraud.

10 of 25

Hypothesis 4: some of the Merchantcategorys are more likely to fraud

	merchantCategoryCode	Row Count	Row Count %	Target Count	Target Count %	Target Rate %
0	gym	2209	0.28	0	0.00	0.00
1	cable/phone	1382	0.18	0	0.00	0.00
2	online_subscriptions	11067	1.41	0	0.00	0.00
3	mobileapps	14990	1.91	0	0.00	0.00
4	food_delivery	6000	0.76	0	0.00	0.00
5	fuel	23910	3.04	0	0.00	0.00
6	personal care	18964	2..41	86	0.69	0.45
7	health	19092	2.43	90	0.72	0.47
8	hotels	34097	4.34	250	2.01	0.73
9	subscriptions	22901	2.91	216	1.74	0.94

10	fastfood	112138	14.26	1074	8.65	0.96
11	entertainment	80098	10.19	961	7.74	1.20
12	auto	21651	2.75	273	2.20	1.26
13	food	75490	9.60	1014	8.17	1.34
14	furniture	7432	0.95	103	0.83	1.39
15	online_gifts	66238	8.42	1606	12.93	2.42
16	online_retail	202156	25.71	4938	39.77	2.44
17	rideshare	51136	6.50	1272	10.24	2.49
18	airline	15412	1.96	534	4.30	3.46

11 of 25

Hypothesis 5: The larger the difference between the transactionAmount and the average for the past 7 or 30 days, the more likely it is to be a fraud.

12 of 25

3: Feature Engineering & Variable Selection

All selected variables：

Numerical Variables:

transactionAmount

utilization = currentBalance/creditLimit

transactionAmount7Diff

transactionAmount30Diff

Categorical Variables:

CardPresent

TransactionType

MerchantCategoryCode

13 of 25

Dummy Variables

	cardPresent	Row Count	Row Count %	Target Count	Target Count %	Target Rate %
0	true	352868	44.87	3455	27.82	0.98
1	false	433495	55.13	8962	72.18	2.07

cardPresent:

True is 1;

False is 0.

14 of 25

	transactionType	Row Count	Row Count%	Target Count	Target Count %	Target Rate %
0	ADDRESS_VERIFICATION	20169	2.57	116	0.94	0.58
1	PURCHASE	745193	94.85	11950	96.35	1.60
2	REVERSAL	20303	2.85	337	2.72	1.66

transactionType:

If it is a Purchase or Reversal type, it is 1; if it is an Address, it is 0.

15 of 25

Divide them into three categories(MerchantCategoryCode):

C1: 0-1

C2: 1-2

C3: 2-3

add three dummy variable columns：

code_under1

code_under2

code_rest

16 of 25

Final Selected Variables

cardpresent	purchasereveral	code_under1	code_under2	code_rest	Transaction	acqCode	utilization	Transaction amount 7diff	Transaction amount 30diff
0	1	1	0	0	165.26	1	0.827724	42.807143	39.452963
0	1	0	0	1	298.85	1	0.059069	2.627500	2.627500
0	1	1	0	0	406.89	1	0.086502	212.500000	264.901250
0	1	0	0	1	439.31	1	0.403761	150.872857	288.637059
0	1	0	0	1	266.09	1	0.651798	140.730000	121.099273
…	…	…	…	…	…	…	…	…	…
1	1	0	1	0	11.6	1	0.455689	-127.604262	-127.349921
1	1	0	1	0	53.06	1	0.719842	-72.768265	-89.679709
0	1	0	0	1	195.06	1	0.151087	36.476250	45.918267
0	1	0	0	1	40.44	1	0.906970	99.140611	-102.675385
0	0	0	0	1	0.00	1	0.828903	152.301933	-149.420587

17 of 25

Logistic Regression Model

	precision	recall	f1-score	support
False	0.66	0.61	0.63	2441
True	0.65	0.70	0.67	2526
accuracy	0.64	0.68	0.65	4967
macro avg	0.65	0.65	0.65	4967
weighted avg	0.65	0.65	0.65	4967

Confusion Matrix:

Accuracy Report:

18 of 25

Coefficients

transactionAmount	1.4698587424173901
code_rest	0.9044358641187724
transactionAmount7Diff	0.06449556709958816
cardPresent	0.02071320364010008
acqCode	0.004379849093978679
code_under2	0.0004828690815489611
transactionAmount30Diff	-0.00022864403539822457
purchaseReversal	-0.1967447474923735
code_under1	-0.10911740592971167
utilization	-0.8937164815560348
expirationDateKeyInMatch	-0.8116121682548728

20 of 25

Decision Tree

	precision	recall	fl-score	support
False	0.66	0.58	0.61	2441
True	0.63	0.71	0.67	2526
accuracy			0.64	4967
Macro avg	0.64	0.64	0.64	4967
Weighted avg	0.64	0.64	0.64	4967

21 of 25

Importance of Variables

22 of 25

Random Forest

	precision	recall	fl-score	support
False	0.66	0.63	0.64	2441
True	0.66	0.70	0.68	2526
accuracy			0.66	4967
Macro avg	0.66	0.66	0.66	4967
Weighted avg	0.66	0.66	0.66	4967

23 of 25

Potential Improvement

accountNumber	0
customerId	0
creditLimit	0
availableMoney	0
transactionDateTime	0
transactionAmount	0
merchanName	0
acqCountry	4562
merchantCountryCode	724
posEntryMode	4054
posConditionCode	409
merchantCategoryCode	0
currentExpDate	0
accountOpenDate	0
dateOfLastAddressChange	0
cardCVV	0

The number of instances where acqCountry and merchantCountryCode are null and not the same. If only one is missing or if both are missing, there may be some information missing.

24 of 25

echoBuffer

merchantCity

merchantState

merchantZip

posOnPremises

recurringAuthInd

Missing all data

enteredCVV	0
cardLasr4Digits	0
transactionType	698
echoBuffer	7866363
currentBalance	0
merchantCity	786363
merchantState	786363
merchantZip	786363
cardPresent	0
posOnPremises	786363
recurringAuthInd	786363
expirationDateKeyInMatch	0
isFraud	0

1 of 25

2 of 25