1 of 8

Credit Transaction Fraud Prediction Model

Johann Abraham and Alp Unsal

2 of 8

Introduction

This project is our team’s first experience with building a machine learning prediction model.

  • From Kaggle, we used a dataset that contains transactions made by credit cards in September 2013 by European cardholders
  • Contains transactions that occurred over a 48 hour period, in which there were 492 frauds out of 284,807 transactions
  • Highly unbalanced dataset, as less that 0.2% of the data consists of fraudulent transactions

3 of 8

Importing Libraries

For this project, we had to utilize multiple libraries for various purposes:

  • Standard Plotting and Arithmetic Libraries: numpy, pandas, matplotlib, seaborn, plotly
  • Libraries for processing and balancing data: scikit-learn(PCA, Scaling), imblearn
  • Modelling Libraries: scikit-learn(Classification Strategies)
  • Testing Libraries: sklearn.model selection, sklearn.metrics

4 of 8

5 of 8

Exploratory Data Analysis

We decided to analyse the data through plotting graphs and visualizing the distribution and imbalance of the dataset to notice trends.

  • Only 0.2% of the dataset was fraudulent- highly unbalanced data
  • Fraudulent transactions had a more even time distribution compared to normal transactions ( higher fraudulent transaction activity at night time)

6 of 8

Processing the Data

  • With such a high imbalance in the dataset, it was imperative that we had to adjust our sample

  • If we train our model with the current sample, the model would almost always predict every transaction to be a normal transaction. We would also be unable to identify correlations between the variables without balancing the dataset.

  • Therefore, we need to create a new sub-sample that is standardized and balanced.

7 of 8

Modelling the Data

  • Once coming up with a balanced sub-sample, we now have to model the data for our prediction.

  • Using multiple classification strategies, we begin to train and test our datasets with each strategy and evaluating each classification type by obtaining the models accuracy score.

  • Use variety of testing tools such as accuracy score, F-1 score, precision score and recall score to validate the accuracy of each model

8 of 8

Cross Validating and Final Results

Once done with modelling methods, we need to conduct cross-validation in order to compare different methods and determine how well they performed

Confusion Matrix of the best-performing method (K-Nearest Neighbour)