1 of 15

2 of 15

What Are Machine Learning Algorithms?

  • Machine Learning algorithms are mathematical models that enable computers to learn from data and make decisions or predictions
  • These algorithms identify patterns in data and improve their performance over time without explicit programming.
  • Help automate decision making processes, improve efficiency, and enhance user experience.
  • Examples:
    • Netflix recommends shows using machine learning.
    • A spam filter learns to classify emails as spam

3 of 15

How Does R Help Create Machine Learning Algorithms?

  • R is a powerful tool for statistical computing and machine learning.
  • Supports data preprocessing, model training, evaluation, and visualization
  • Ideal for exploratory data analysis and statistical modeling
  • Example:

4 of 15

Introducing Our Dataset: Credit Card Fraud Detection

Predicting fraudulent transactions using machine learning.

  • Transaction Amount – How much was spent.
  • Time – Timestamp of the transaction.
  • MerchantID– Encoded features representing the user.
  • TransactionType– How the transaction was made
  • Class- Target variable Legitimate or Fraudulent
  • Objective: Train a model to detect fraud before financial loss occurs.

5 of 15

Using R To Visualize Data

    • Code: head(CreditCardFraud, 5)
    • Show the first 5 transactions.

    • Code: GGplot2

6 of 15

Using R To Manipulate Data

    • Many machine learning algorithms, including logistic regression, perform better when numerical features are on similar scales
    • Features with larger magnitudes might disproportionately influence the model if not scaled.
    • Scaling prevents certain features from dominating optimization, making training more efficient.
    • Large numerical values may cause computational issues in some algorithms
    • Scaling ensures that features are on the same scale and contribute fairly to model learning.

7 of 15

Splitting Data, Evaluation, and Model Generalization

    • Before creating the machine learning model, data is split to evaluate how well the machine learning model generalizes to new, unseen data
    • If we train and test the model on the same dataset, it might "memorize" the data instead of learning general patterns.
    • By testing on a separate set, we ensure the model's predictions are meaningful beyond the training data.
    • Splitting the data avoids overfitting. Overfitting means the model performs well on training data but poorly on new data

8 of 15

Using Logistic Regression For Fraud Detection

    • What is Logistic Regression?

    • Interpretable Probability Outputs.
    • Binary Classification
    • Computational Efficiency
    • Robust to Multicollinearity

 

9 of 15

Model Evaluation With A Confusion Matrix

    • A confusion matrix helps evaluate classification model performance by comparing actual vs. predicted values.
    • Accuracy: Overall correctness of the model
    • Precision: How many flagged fraud cases are actually fraud?
    • Recall (Sensitivity):How well does the model catch fraud?
    • F1-Score: Balances precision and recall

10 of 15

Applying the Decision Tree Model

  • TransactionID was removed as it’s not a relevant feature for fraud detection
  • It does not contain any meaningful information that can help distinguish between fraudulent and non-fraudulent transactions.
  • Keeping it in the model would not improve accuracy but instead might lead to unnecessary complexity. So we removed it to ensure the model is not influenced by it.

11 of 15

Applying the Decision Tree Model

  • This code is used to train a decision tree model using the rpart package in R and visualize the resulting tree.
  • The model learns from the training data to identify patterns that distinguish fraudulent from normal transactions.
  • It creates decision rules to split data at different points, forming a tree structure.
  • The rpart.plot() function visualizes the tree, showing how transactions are classified.

12 of 15

Applying the Decision Tree Model

  • The model has a high overall accuracy of 97.04%, meaning it correctly classified most transactions.
  • P-Value = 0.9428: The model is not significantly better than guessing the majority class.
  • McNemar’s Test P-Value = 5.797e-06: The model systematically misclassifies fraud cases, showing a bias toward non-fraud predictions.
  • Very Low Sensitivity (8.07%): The model misses most fraud cases (57 out of 62).
  • Low Precision (22.73%): Only 22.73% of transactions predicted as fraud are actually fraudulent.
  • Very High Specificity (99.30%): The model is great at identifying normal transactions correctly.

13 of 15

Understanding ROC Curves

  1. What is an ROC Curve?
  2. Receiver Operating Characteristic (ROC) plots the True Positive Rate (TPR) vs. False Positive Rate (FPR)
  3. Used to evaluate how well a model distinguishes between fraud and non-fraudulent cases

2. Key Observations From Dataset

  • Both the Logistic Regression and Decision Tree show strong performance
  • The Logistic Regression curve is slightly above Decision tree suggesting it is a slightly better predictor.

14 of 15

Measuring Model Performance with AUC

  1. What is AUC (Area Under the Curve)
  2. AUC evaluates a model’s ability to distinguish between fraud and non-fraud cases.
  3. AUC = 1.0 → Perfect Model
  4. AUC = 0.5 → No predictive Power
  5. AUC > 0.90 → Good model performance

2. AUC Scores for Our Models

  • Logistic Regression: 0.9419
  • Decision Tree: 0.9368
  • Both models perform well as they are both greater than 0.90.
  • Logistic Regression model has a slightly higher AUC, meaning it is better at detecting fraud.
  • Small difference of only 0.0051, so both models are effective.

15 of 15

Hypothesis Testing-Comparing AUC Scores

DeLong’s Test

  • Used because AUC values are not normally distributed, so a non-parametric test was needed.
  • Hypothesis:
    • H0(Null Hypothesis): No significant difference between the two AUC values.
    • H1(Alternative Hypothesis): A significant difference exists between the two AUC values.
  • Results:
    • P-Value = 0.7044
    • Since 0.7044 > 0.05 we fail to reject the null hypothesis, so the difference is not significant.
    • This suggests that both models perform well, small AUC difference is due to randomness.