2 of 15

What Are Machine Learning Algorithms?

Machine Learning algorithms are mathematical models that enable computers to learn from data and make decisions or predictions
These algorithms identify patterns in data and improve their performance over time without explicit programming.
Help automate decision making processes, improve efficiency, and enhance user experience.
Examples:

Netflix recommends shows using machine learning.
A spam filter learns to classify emails as spam

3 of 15

How Does R Help Create Machine Learning Algorithms?

R is a powerful tool for statistical computing and machine learning.
Supports data preprocessing, model training, evaluation, and visualization
Ideal for exploratory data analysis and statistical modeling
Example:

4 of 15

Introducing Our Dataset: Credit Card Fraud Detection

Predicting fraudulent transactions using machine learning.

Transaction Amount – How much was spent.
Time – Timestamp of the transaction.
MerchantID– Encoded features representing the user.
TransactionType– How the transaction was made
Class- Target variable Legitimate or Fraudulent
Objective: Train a model to detect fraud before financial loss occurs.

5 of 15

Using R To Visualize Data

Code: head(CreditCardFraud, 5)
Show the first 5 transactions.

Code: GGplot2

6 of 15

Using R To Manipulate Data

Many machine learning algorithms, including logistic regression, perform better when numerical features are on similar scales
Features with larger magnitudes might disproportionately influence the model if not scaled.
Scaling prevents certain features from dominating optimization, making training more efficient.
Large numerical values may cause computational issues in some algorithms
Scaling ensures that features are on the same scale and contribute fairly to model learning.

7 of 15

Splitting Data, Evaluation, and Model Generalization

Before creating the machine learning model, data is split to evaluate how well the machine learning model generalizes to new, unseen data
If we train and test the model on the same dataset, it might "memorize" the data instead of learning general patterns.
By testing on a separate set, we ensure the model's predictions are meaningful beyond the training data.
Splitting the data avoids overfitting. Overfitting means the model performs well on training data but poorly on new data

8 of 15

Using Logistic Regression For Fraud Detection

What is Logistic Regression?

Interpretable Probability Outputs.
Binary Classification
Computational Efficiency
Robust to Multicollinearity

9 of 15

Model Evaluation With A Confusion Matrix

A confusion matrix helps evaluate classification model performance by comparing actual vs. predicted values.
Accuracy: Overall correctness of the model
Precision: How many flagged fraud cases are actually fraud?
Recall (Sensitivity):How well does the model catch fraud?
F1-Score: Balances precision and recall

10 of 15

Applying the Decision Tree Model

TransactionID was removed as it’s not a relevant feature for fraud detection
It does not contain any meaningful information that can help distinguish between fraudulent and non-fraudulent transactions.
Keeping it in the model would not improve accuracy but instead might lead to unnecessary complexity. So we removed it to ensure the model is not influenced by it.

11 of 15

Applying the Decision Tree Model

This code is used to train a decision tree model using the rpart package in R and visualize the resulting tree.
The model learns from the training data to identify patterns that distinguish fraudulent from normal transactions.
It creates decision rules to split data at different points, forming a tree structure.
The rpart.plot() function visualizes the tree, showing how transactions are classified.

12 of 15

Applying the Decision Tree Model

The model has a high overall accuracy of 97.04%, meaning it correctly classified most transactions.
P-Value = 0.9428: The model is not significantly better than guessing the majority class.
McNemar’s Test P-Value = 5.797e-06: The model systematically misclassifies fraud cases, showing a bias toward non-fraud predictions.
Very Low Sensitivity (8.07%): The model misses most fraud cases (57 out of 62).
Low Precision (22.73%): Only 22.73% of transactions predicted as fraud are actually fraudulent.
Very High Specificity (99.30%): The model is great at identifying normal transactions correctly.

13 of 15

Understanding ROC Curves

What is an ROC Curve?
Receiver Operating Characteristic (ROC) plots the True Positive Rate (TPR) vs. False Positive Rate (FPR)
Used to evaluate how well a model distinguishes between fraud and non-fraudulent cases

2. Key Observations From Dataset

Both the Logistic Regression and Decision Tree show strong performance
The Logistic Regression curve is slightly above Decision tree suggesting it is a slightly better predictor.

14 of 15

Measuring Model Performance with AUC

What is AUC (Area Under the Curve)
AUC evaluates a model’s ability to distinguish between fraud and non-fraud cases.
AUC = 1.0 → Perfect Model
AUC = 0.5 → No predictive Power
AUC > 0.90 → Good model performance

2. AUC Scores for Our Models

Logistic Regression: 0.9419
Decision Tree: 0.9368
Both models perform well as they are both greater than 0.90.
Logistic Regression model has a slightly higher AUC, meaning it is better at detecting fraud.
Small difference of only 0.0051, so both models are effective.

15 of 15

Hypothesis Testing-Comparing AUC Scores

DeLong’s Test

Used because AUC values are not normally distributed, so a non-parametric test was needed.
Hypothesis:

H₀(Null Hypothesis): No significant difference between the two AUC values.
H₁(Alternative Hypothesis): A significant difference exists between the two AUC values.

Results:

P-Value = 0.7044
Since 0.7044 > 0.05 we fail to reject the null hypothesis, so the difference is not significant.
This suggests that both models perform well, small AUC difference is due to randomness.