GROUP 2
MACHINE LEARNING BOOTCAMP
Contents
1 | Problem Identification & Data Checks |
2 | Variable Understanding & Data Wrangling |
3 | Feature Understanding & Variable Selection |
4 | Model Development & Evaluation – 1st |
5 | Model Development & Evaluation – 2nd |
accountNumber | 0 |
customerId | 0 |
creditLimit | 0 |
availableMoney | 0 |
transactionDateTime | 0 |
transactionAmount | 0 |
merchanName | 0 |
acqCountry | 4562 |
merchantCountryCode | 724 |
posEntryMode | 4054 |
posConditionCode | 409 |
merchantCategoryCode | 0 |
currentExpDate | 0 |
accountOpenDate | 0 |
dateOfLastAddressChange | 0 |
cardCVV | 0 |
Number of Rows with “null” values
1: Problem Identification & Data Checks
False | 0.98421 |
True | 0.01579 |
II:Variable Understanding & Data Wrangling
Finding
Transaction Amount: Majority of the transaction amount is within 0 to $250, the distribution looks good as a modelling variable directly.
Credit Limit: The value are not normal distribution, hard to directly use as a feature
Available Money: The available money shows low given low credit limit to start with, it will be interesting to look at the utilization to determine the card usage.
Current Balance: Similar as previous insights, current balance seems low
All numerical variable distribution
Transactions in fraud cases are higher in amount compared to transactions in non-fraud cases.
Hypothesis 1: Cases with high value of transactionAmount are more likely to be fraud cases
Correlation among all numerical variables
Finding:Current Balance + Available Money = Credit Limit
| cardPresent | Row Count | Row Count % | Target Count | Target Count % | Target Rate % |
0 | true | 352868 | 44.87 | 3455 | 27.82 | 0.98 |
1 | false | 433495 | 55.13 | 8962 | 72.18 | 2.07 |
cardPresent:
cardPresent as false, the target rate is 2.07% while cardPresent is true is 0.98%. Therefore, without card has larger chance to be fraud.
Hypothesis 2: Without card is more likely to be fraud
| transactionType | Row Count | Row Count% | Target Count | Target Count % | Target Rate % |
0 | ADDRESS_VERIFICATION | 20169 | 2.57 | 116 | 0.94 | 0.58 |
1 | PURCHASE | 745193 | 94.85 | 11950 | 96.35 | 1.60 |
2 | REVERSAL | 20303 | 2.85 | 337 | 2.72 | 1.66 |
transactionType:
Hypothesis 3: Purchase or reversal is more likely to be fraud
purchase target rate is 1.60, reversal target rate is 1.66, compare to adress_verfication, purchase or reversal has larger chance to be fraud.
Hypothesis 4: some of the Merchantcategorys are more likely to fraud
| merchantCategoryCode | Row Count | Row Count % | Target Count | Target Count % | Target Rate % |
0 | gym | 2209 | 0.28 | 0 | 0.00 | 0.00 |
1 | cable/phone | 1382 | 0.18 | 0 | 0.00 | 0.00 |
2 | online_subscriptions | 11067 | 1.41 | 0 | 0.00 | 0.00 |
3 | mobileapps | 14990 | 1.91 | 0 | 0.00 | 0.00 |
4 | food_delivery | 6000 | 0.76 | 0 | 0.00 | 0.00 |
5 | fuel | 23910 | 3.04 | 0 | 0.00 | 0.00 |
6 | personal care | 18964 | 2..41 | 86 | 0.69 | 0.45 |
7 | health | 19092 | 2.43 | 90 | 0.72 | 0.47 |
8 | hotels | 34097 | 4.34 | 250 | 2.01 | 0.73 |
9 | subscriptions | 22901 | 2.91 | 216 | 1.74 | 0.94 |
10 | fastfood | 112138 | 14.26 | 1074 | 8.65 | 0.96 |
11 | entertainment | 80098 | 10.19 | 961 | 7.74 | 1.20 |
12 | auto | 21651 | 2.75 | 273 | 2.20 | 1.26 |
13 | food | 75490 | 9.60 | 1014 | 8.17 | 1.34 |
14 | furniture | 7432 | 0.95 | 103 | 0.83 | 1.39 |
15 | online_gifts | 66238 | 8.42 | 1606 | 12.93 | 2.42 |
16 | online_retail | 202156 | 25.71 | 4938 | 39.77 | 2.44 |
17 | rideshare | 51136 | 6.50 | 1272 | 10.24 | 2.49 |
18 | airline | 15412 | 1.96 | 534 | 4.30 | 3.46 |
Hypothesis 5: The larger the difference between the transactionAmount and the average for the past 7 or 30 days, the more likely it is to be a fraud.
3: Feature Engineering & Variable Selection
All selected variables:
Numerical Variables:
transactionAmount
utilization = currentBalance/creditLimit
transactionAmount7Diff
transactionAmount30Diff
Categorical Variables:
CardPresent
TransactionType
MerchantCategoryCode
Dummy Variables
| cardPresent | Row Count | Row Count % | Target Count | Target Count % | Target Rate % |
0 | true | 352868 | 44.87 | 3455 | 27.82 | 0.98 |
1 | false | 433495 | 55.13 | 8962 | 72.18 | 2.07 |
cardPresent:
True is 1;
False is 0.
| transactionType | Row Count | Row Count% | Target Count | Target Count % | Target Rate % |
0 | ADDRESS_VERIFICATION | 20169 | 2.57 | 116 | 0.94 | 0.58 |
1 | PURCHASE | 745193 | 94.85 | 11950 | 96.35 | 1.60 |
2 | REVERSAL | 20303 | 2.85 | 337 | 2.72 | 1.66 |
transactionType:
If it is a Purchase or Reversal type, it is 1; if it is an Address, it is 0.
Divide them into three categories(MerchantCategoryCode):
C1: 0-1
C2: 1-2
C3: 2-3
add three dummy variable columns:
code_under1
code_under2
code_rest
Final Selected Variables
cardpresent | purchasereveral | code_under1 | code_under2 | code_rest | Transaction | acqCode | utilization | Transaction amount 7diff | Transaction amount 30diff |
0 | 1 | 1 | 0 | 0 | 165.26 | 1 | 0.827724 | 42.807143 | 39.452963 |
0 | 1 | 0 | 0 | 1 | 298.85 | 1 | 0.059069 | 2.627500 | 2.627500 |
0 | 1 | 1 | 0 | 0 | 406.89 | 1 | 0.086502 | 212.500000 | 264.901250 |
0 | 1 | 0 | 0 | 1 | 439.31 | 1 | 0.403761 | 150.872857 | 288.637059 |
0 | 1 | 0 | 0 | 1 | 266.09 | 1 | 0.651798 | 140.730000 | 121.099273 |
… | … | … | … | … | … | … | … | … | … |
1 | 1 | 0 | 1 | 0 | 11.6 | 1 | 0.455689 | -127.604262 | -127.349921 |
1 | 1 | 0 | 1 | 0 | 53.06 | 1 | 0.719842 | -72.768265 | -89.679709 |
0 | 1 | 0 | 0 | 1 | 195.06 | 1 | 0.151087 | 36.476250 | 45.918267 |
0 | 1 | 0 | 0 | 1 | 40.44 | 1 | 0.906970 | 99.140611 | -102.675385 |
0 | 0 | 0 | 0 | 1 | 0.00 | 1 | 0.828903 | 152.301933 | -149.420587 |
Logistic Regression Model
| precision | recall | f1-score | support |
False | 0.66 | 0.61 | 0.63 | 2441 |
True | 0.65 | 0.70 | 0.67 | 2526 |
accuracy | 0.64 | 0.68 | 0.65 | 4967 |
macro avg | 0.65 | 0.65 | 0.65 | 4967 |
weighted avg | 0.65 | 0.65 | 0.65 | 4967 |
Confusion Matrix:
Accuracy Report:
Coefficients
transactionAmount | 1.4698587424173901 |
code_rest | 0.9044358641187724 |
transactionAmount7Diff | 0.06449556709958816 |
cardPresent | 0.02071320364010008 |
acqCode | 0.004379849093978679 |
code_under2 | 0.0004828690815489611 |
transactionAmount30Diff | -0.00022864403539822457 |
purchaseReversal | -0.1967447474923735 |
code_under1 | -0.10911740592971167 |
utilization | -0.8937164815560348 |
expirationDateKeyInMatch | -0.8116121682548728 |
ROC Curve
Decision Tree
| precision | recall | fl-score | support |
False | 0.66 | 0.58 | 0.61 | 2441 |
True | 0.63 | 0.71 | 0.67 | 2526 |
accuracy | | | 0.64 | 4967 |
Macro avg | 0.64 | 0.64 | 0.64 | 4967 |
Weighted avg | 0.64 | 0.64 | 0.64 | 4967 |
Importance of Variables
Random Forest
| precision | recall | fl-score | support |
False | 0.66 | 0.63 | 0.64 | 2441 |
True | 0.66 | 0.70 | 0.68 | 2526 |
accuracy | | | 0.66 | 4967 |
Macro avg | 0.66 | 0.66 | 0.66 | 4967 |
Weighted avg | 0.66 | 0.66 | 0.66 | 4967 |
Potential Improvement
accountNumber | 0 |
customerId | 0 |
creditLimit | 0 |
availableMoney | 0 |
transactionDateTime | 0 |
transactionAmount | 0 |
merchanName | 0 |
acqCountry | 4562 |
merchantCountryCode | 724 |
posEntryMode | 4054 |
posConditionCode | 409 |
merchantCategoryCode | 0 |
currentExpDate | 0 |
accountOpenDate | 0 |
dateOfLastAddressChange | 0 |
cardCVV | 0 |
The number of instances where acqCountry and merchantCountryCode are null and not the same. If only one is missing or if both are missing, there may be some information missing.
echoBuffer
merchantCity
merchantState
merchantZip
posOnPremises
recurringAuthInd
Missing all data
enteredCVV | 0 |
cardLasr4Digits | 0 |
transactionType | 698 |
echoBuffer | 7866363 |
currentBalance | 0 |
merchantCity | 786363 |
merchantState | 786363 |
merchantZip | 786363 |
cardPresent | 0 |
posOnPremises | 786363 |
recurringAuthInd | 786363 |
expirationDateKeyInMatch | 0 |
isFraud | 0 |