1 of 26

Bank Fraud Detection

Data Science Project

2 of 26

Introduction

Bank fraud is a critical challenge in the financial sector, demanding innovative solutions for timely detection. Leveraging Neural Networks for this purpose offers a promising avenue. Neural Networks excel at recognizing intricate patterns, making them well-suited to identify fraudulent activities in the dynamic landscape of financial transactions.

3 of 26

Dataset Description

We are using a dataset from Kaggle but it got 1,000,000 row. So, we take only 20,000 row from it to be lighter for the resources to run it.
Here is the columns description:
fraud_bool (Target Column): Binary indicator (0 or 1) representing whether the transaction is fraudulent or not.
income: The reported income of the customer.
name_email_similarity: A measure of similarity between the customer's name and email.

4 of 26

Dataset Description (cont.)

prev_address_months_count: Number of months since the customer's last address change.
current_address_months_count: Number of months at the current address.
customer_age:Age of the customer.
days_since_request: Number of days since the customer's request.
intended_balcon_amount: The intended loan or credit amount.
payment_type: The type of payment used for the transaction.
zip_count_4w:Count of transactions within the last 4 weeks in the customer's ZIP code

5 of 26

Dataset Description (cont.)

velocity_6h: Transaction velocity in the last 6 hours.
velocity_24h: Transaction velocity in the last 24 hours.
velocity_4w: Transaction velocity in the last 4 weeks.
bank_branch_count_8w: Count of bank branches involved in transactions in the last 8 weeks.
date_of_birth_distinct_emails_4w: Count of distinct email addresses associated with the customer's date of birth in the last 4 weeks.
employment_status: The employment status of the customer.
credit_risk_score: Credit risk score assigned to the customer.
email_is_free: Binary indicator (0 or 1) denoting whether the email used is a free service.

6 of 26

Dataset Description (cont.)

housing_status: The housing status of the customer.
phone_home_valid: Binary indicator (0 or 1) indicating the validity of the home phone number.
phone_mobile_valid: Binary indicator (0 or 1) indicating the validity of the mobile phone number.
bank_months_count: Number of months the customer has been associated with the bank.
has_other_cards: Binary indicator (0 or 1) representing whether the customer has other credit cards.
proposed_credit_limit: The proposed credit limit for the customer.
foreign_request: Binary indicator (0 or 1) indicating whether the transaction request is from a foreign source.

7 of 26

Dataset Description (cont.)

source: The source of the transaction.
session_length_in_minutes: Duration of the transaction session in minutes.
device_os: The operating system of the device used for the transaction.
keep_alive_session: Binary indicator (0 or 1) denoting whether the session is kept alive.
device_distinct_emails_8w: Count of distinct email addresses associated with the device in the last 8 weeks.
device_fraud_count: Count of fraud-related incidents associated with the device.
month: The month in which the transaction occurred.

8 of 26

Preprocessing

9 of 26

Preprocessing (cont.)

We have removed column “prev_address_months_count” because most rows are missing values “-1”
We have replaced values of “-1” in column : “session_length_in_minutes” with the mean of the whole column

10 of 26

Preprocessing (cont.)

We have replaced “-1” values in column “device_distinct_emails_8w” with “1”
Replaced “-1” values in column “current_address_months_count” with mean of the column’s value

11 of 26

Preprocessing (cont.)

Replaced “-1” values in column “bank_months_count” with mean of the column’s value

12 of 26

Preprocessing (cont.)

We have done Label Encoding where we converted all categorical data to numeric values

13 of 26

Preprocessing (cont.)

We have used MinMax to to compute the minimum and maximum values of the features and then scale the features within the range [0, 1] to normalize it to have better scaling.

14 of 26

Visualization

Histogram for representing the amount of fraud vs non fraud transactions �(and they are balanced)

15 of 26

Models

We implemented 2 models from scratch �& used 1 pre-trained model (TabNet).

16 of 26

First Model

17 of 26

First Model

Both training accuracy and validation accuracy are performing well during the model fitting

NO Underfitting

18 of 26

First Model Results

Accuracy : 94.01%

Other metrics

19 of 26

First Model Results

Moderate Overfitting

20 of 26

Model Modification

21 of 26

Model Modification

22 of 26

Model Modification Results

Accuracy : 93.95%

Other metrics

23 of 26

Model Modification Results

Moderate Overfitting

1 of 26

2 of 26

3 of 26

4 of 26

5 of 26

6 of 26

7 of 26

8 of 26

9 of 26

10 of 26

11 of 26

12 of 26

13 of 26

14 of 26

15 of 26

16 of 26

17 of 26

18 of 26

19 of 26

20 of 26

21 of 26

22 of 26

23 of 26

24 of 26

25 of 26

26 of 26