JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 8

Credit Transaction Fraud Prediction Model

Johann Abraham and Alp Unsal

2 of 8

Introduction

This project is our team’s first experience with building a machine learning prediction model.

From Kaggle, we used a dataset that contains transactions made by credit cards in September 2013 by European cardholders
Contains transactions that occurred over a 48 hour period, in which there were 492 frauds out of 284,807 transactions
Highly unbalanced dataset, as less that 0.2% of the data consists of fraudulent transactions

Importing Libraries

For this project, we had to utilize multiple libraries for various purposes:

Standard Plotting and Arithmetic Libraries: numpy, pandas, matplotlib, seaborn, plotly
Libraries for processing and balancing data: scikit-learn(PCA, Scaling), imblearn
Modelling Libraries: scikit-learn(Classification Strategies)
Testing Libraries: sklearn.model selection, sklearn.metrics

Exploratory Data Analysis

We decided to analyse the data through plotting graphs and visualizing the distribution and imbalance of the dataset to notice trends.

Only 0.2% of the dataset was fraudulent- highly unbalanced data
Fraudulent transactions had a more even time distribution compared to normal transactions ( higher fraudulent transaction activity at night time)

Processing the Data

With such a high imbalance in the dataset, it was imperative that we had to adjust our sample

If we train our model with the current sample, the model would almost always predict every transaction to be a normal transaction. We would also be unable to identify correlations between the variables without balancing the dataset.

Therefore, we need to create a new sub-sample that is standardized and balanced.

Modelling the Data

Once coming up with a balanced sub-sample, we now have to model the data for our prediction.

Using multiple classification strategies, we begin to train and test our datasets with each strategy and evaluating each classification type by obtaining the models accuracy score.

Use variety of testing tools such as accuracy score, F-1 score, precision score and recall score to validate the accuracy of each model

Cross Validating and Final Results

Once done with modelling methods, we need to conduct cross-validation in order to compare different methods and determine how well they performed

Confusion Matrix of the best-performing method (K-Nearest Neighbour)