1 of 55

ML Evaluation

Fardina Fathmiul Alam

CMSC 320 - Intro to Data Science

2 of 55

Topics We will Cover

Overfitting and Underfitting
Bias, Variance and Bias-Variance Tradeoff
Train, Test vs Validation Dataset
Cross-Validation Techniques

K-Fold
Stratified K-Fold
LOOCV

Different ML Evaluation Metrics

Accuracy
Confusion Matrix - Precision, Recall, F1 Score
Confidence

3 of 55

Overfitting and Underfitting

4 of 55

Overfitting and Underfitting are two crucial concepts in machine learning and are the prevalent causes for the poor performance of a machine learning model.

5 of 55

Overfitting in Machine Learning

Overfitting occurs when a model performs exceptionally well on training data but poorly on unseen test data.

Key Points:

Cause: The model learns noise and random fluctuations in the training data too well, rather than the underlying patterns.
Result: An overly complex model that fails to generalize to new data.
Impact: Decreased performance on test data, undermining the model's predictive capability.

�Think of it like acing a practice exam by memorizing answers rather than understanding the material.

6 of 55

Causes of Overfitting

Noisy Data: Irrelevant information in training data leads to learning noise instead of patterns.
Model Complexity: Overly complex models memorize noise instead of generalizing.
Insufficient Data Size: Small or non-diverse datasets lead to memorization of specific examples.

7 of 55

Some ways to Tackle Overfitting

Training model with sufficient data

Add more data: More diverse and representative data leads to more robust and generalizable models.

Using K-fold cross-validation (splitting train-validation-test) → Next Topic

Helps evaluate model performance and promote generalization to new data.

Using Regularization techniques (later topic)

Reduce overfitting by adding penalties to model parameters, making them less sensitive to noise

8 of 55

Underfitting

Our model is too general! It memorized one rule and is applying it everywhere.

Think about college admission process only looking at GPA

Definition: when a model fails to learn the patterns in the training data effectively, resulting in poor performance on both the training data and new, unseen test data. Tt is known as underfitting.

9 of 55

Causes of Underfitting?

Too Simple: The model is too simplistic to capture data complexity.
Missed Features: Important details are overlooked, weakening data representation.
Unreliable Predictions: Poor training performance leads to unreliable predictions on new data.

One of the way to tackle underfitting is Increase Model Complexity: If the model is too simple, increasing its complexity can alleviate underfitting. E.g.

Add More Layers or Neurons in Neural Network
In polynomial regression, you can increase the degree of the polynomial to allow the model to fit more complex curves to the data.

10 of 55

The Bias/Variance Tradeoff

A fundamental concept in machine learning that deals with the problems of overfitting and underfitting

Concept of Bias-Variance Tradeoff

As we want to minimize prediction error on both training and validation datasets, we need to know two types of errors in model performance-

Bias and
Variance.

12 of 55

Bias

Bias quantifies how much the predicted values differ from the actual (expected) values.

High bias → simple model → poor performance.

Bias refers to the error introduced by approximating a real-world problem (which may be complex) by a simplified model.

13 of 55

Variance

Variance is the variability of model prediction for a given data point or a value which tells us spread of our data.

High variance → complex model → great performance on training data but poor on new data.

Variance is the error introduced when a model is overly sensitive to fluctuations in the training data (measures how much the predictions for a given data point can change when the model is trained on different subsets of the data.).

14 of 55

Demo: Example for Bias and Variance

Underfitting

Overfitting

15 of 55

Training Set Error:

Test (Validation/Dev) Set Error

High Variance

High Bias

High Bias &

High Variance

Low Bias &

Low Variance

Assume, human error: 0%

Optimal (Bayes Error) is nearly 0 %

16 of 55

17 of 55

Generalization Error=Bias^2 +Variance+Irreducible Error

Irreducible Error: This is the noise inherent in the data itself, which can’t be eliminated regardless of the model.

Bias-Variance Tradeoff

Aim for a model that minimizes both bias and variance for optimal performance.

18 of 55

Basic Recipe of ML to tackle this

HIGH BIAS ?

(Training Data Performance)

Bigger Network

Train a longer time

HIGH VARIANCE ?

(Testing (val/dev) Data Performance)

Get More Data

Regularization techniques

No

Yes

Done

No

19 of 55

Combatting Training Failures

20 of 55

Start with “Testing and Training” (Holdout Method)

21 of 55

Often in Machine Learning Models

The point of having models is that, once they are trained, they will be able to classify new data.

In our bank example, no two loan applicants are unique. What we're hoping is that the model uncovered the underlying rules about who repays loans are not. Like "higher income good".

How do we know if it worked?

22 of 55

Mitigating Overfitting: The Importance of Data Splitting in Model Training

Issue: Training on the entire dataset can lead to overfitting, resulting in poor performance on new data. To compensate, sample data is often split into three subsets:

1. Training data (60-80%)	Used to fit a model.
2. Validation Data (10-20%)	Evaluate model performance and aids in parameter tuning and feature selection.
3. Test Data (10-20%)	Assesses final model performance and facilitates model comparison

23 of 55

Hide Some Data From Our Algorithm

We Test Our Model!

We already know their target label, use it to evaluate in on unseen data data

24 of 55

NEXT: Validation (Tuning or Development) sets

Suppose we want unbiased estimates of accuracy during the learning process (e.g. to choose the best level of decision-tree pruning)?

training set

test set

learned model

learning process

training set

validation set

learn models

select model

Partition training data into separate training/validation sets

8

Remember, in machine learning, both the test and validation datasets are used to evaluate the performance of a trained model but purpose are different!

25 of 55

Validation Sets

Unlike the test dataset, the validation dataset is typically a subset of the training data and is used iteratively during model development

Primary purpose: helps evaluate the model's performance during the training phase.
Not used for training; assists in fine-tuning hyperparameters and model architecture to optimize performance and reduce overfitting.
Iterative Use: Used throughout training; unlike the test dataset, which is reserved for final evaluation.

26 of 55

“Train/Validation/ Test split" method in ML can be risky!

Non-random splits can lead to overfitting.

E.g. one part of the data only includes people from a specific state, employees with a certain income level, or only women, it can make the model learn too much about these specific cases.

To prevent this issue, we use "cross-validation"

While ensuring Robust Model Evaluation

Limitations

27 of 55

The Cross Validation Techniques

A technique for evaluating machine learning model performance by partitioning data into multiple subsets.

Cross Validation Technique

Instead of relying on a single validation dataset, cross-validation involves splitting the dataset into multiple parts to ensure that the model is learn and evaluated on different parts of the data (promoting a more balanced evaluation and improving generalization).

29 of 55

Cross Validation Process

Dataset Splitting: Data is divided into train and test.

Training and Validation: Cross Validation

Dataset Splitting: Now training data is divided into subsets, called "folds".

Training Phrae: In each iteration, the model is trained on all folds except one.
Validation Phase: The remaining fold serves as the validation set.

Repetition: This process repeats for each fold as the validation set.

Performance Assessment: After all iterations, performance metrics are averaged to evaluate the model across all folds, ensuring good generalization and reducing overfitting.

Separate Testing Dataset Final Evaluation: After cross-validation, a separate testing dataset (not used in training or validation) is used to evaluate the final model. This ensures an unbiased assessment of the model's performance on entirely unseen data.

30 of 55

labeled data set

s₁	s₂	s₃	s₄	s₅

partition data

into n subsamples

iteratively leave one subsample out for the test set, train on the rest

Example: Cross validation

Suppose we have 100 instances, and we want to estimate accuracy with cross validation.

accuracy = 73/100 = 73%

For validation

31 of 55

Types of Cross Validation Techniques

K-Fold Cross-Validation
Leave-One-Out Cross-Validation (LOOCV)
Many more…..

K-Fold Cross Validation

How It Works:

Data Partitioning: The dataset is randomly split into K folds.
Training and Validation: The model is trained on K-1 folds and validated on the remaining one fold. This repeats K times, with each fold serving as the validation set once.
Performance Aggregation: After all iterations, average the performance metrics from each fold for a comprehensive evaluation.

A technique to evaluate a machine learning model by dividing the dataset into K equal-sized subsets, or "folds."

Example of Training and Validation (Cross Validation K=4 Fold in this case) and Testing Dataset

Common Values for K: Typical values are 5 or 10, adjustable based on dataset size.

33 of 55

Steps

34 of 55

(2) Stratified K-Fold Cross-Validation

A variation of k-fold cross-validation where the data is split into k folds, but with the constraint that each fold maintains the same proportion of samples from each class as the entire dataset.

useful when the dataset is imbalanced (e.g., one class is much more frequent than the other).

ensure each fold has a similar class distribution for fairer model evaluation.

In general, stratified k-fold is recommended for classification tasks, especially when dealing with imbalanced datasets.

KFold (K=5):

Randomly splits data into 5 folds, where the number of instances per class may vary.

StratifiedKFold:

Splits data into 5 folds, ensuring each fold has an equal and proportional distribution of instances from each class A, B and C.

35 of 55

(3) Leave One Out Cross Validation (LOOCV)

When every ounce of training data counts

Unlike K-Fold, there is no separate validation set.
Each iteration focuses on evaluating the model's performance on the left-out observation.
This process is repeated for every data point.

Particularly useful for small datasets.

Drawbacks: Computationally expensive and slower for larger datasets. However, it provides a highly reliable performance estimate.

Estimates a model's performance by training it on all data points except one, and then using that left-out point to test the model.

36 of 55

Leave One Out Cross Validation (LOOCV) Process

Split a dataset into a training set and a testing set, using all but one observation as part of the training set.

Train the model on n-1 data points (where n is the total number of data points).
Leave one observation out from the training set (This is where the method gets the name “leave-one-out” cross-validation).

This process is repeated 'n' times, with each data point being used as the test set once.

37 of 55

ML Model Evaluation

Using some evaluation metrics

38 of 55

Accuracy

The simplest way to check how well our model is doing is to look at it's accuracy.

100% accuracy is perfect, 0% accuracy is completely wrong.

Question: "If 90 out of 100 predictions are correct for spam detection in an email classification model, what is the model's accuracy?"

= 90/100 * 100% = 0.9 * 100= 90%

39 of 55

Class Imbalance

Imagine a dataset with 90% of the negative class and 10% of the positive class--is 80% accuracy impressive here? What about 90%?

Consider this scenario:

Class imbalance is a serious issue. Class imbalance happens when an overwhelming majority if your data is a single class.

However, 'Accuracy' cannot always be used as the sole metric to evaluate model performance

40 of 55

Situations With Class Imbalance

Class Imbalance Issue: Occurs when one class dominates the data, making one group significantly larger than others.

Examples of Imbalance:

Fraud: Most transactions are not fraudulent.
Disease: The majority don't have cancer.
Purchases: Most people don't buy a specific product.

Common Scenario: Typically involves a rare positive signal amidst negatives, which is a prevalent occurrence.

41 of 55

Limitation of “Accuracy” Metric

Issue with Imbalanced Data: Accuracy can mislead when one class dominates.

A high accuracy may come from predicting the majority class, ignoring minority performance.

Neglecting Error Types: Accuracy treats all errors equally, not distinguishing between false positives and false negatives.

“Accuracy is utterly useless if the class distribution in your data set is skewed”

So more alternative evaluation metrics are needed. NEXT: Confusion Matrix

42 of 55

Confusion Matrix

A confusion matrix is a matrix that summarizes the performance of a machine learning model on a set of test data.

43 of 55

Accuracy

The confusion matrix allows to calculate various evaluation metrics

Precision
Recall (Sensitivity or True Positive Rate)
F1-Score

44 of 55

Evaluation Metric: (1) Precision

Out of all positive predictions made, it indicates the proportion that are truly positive.

Precision is a metric that measures the accuracy of positive predictions made by a model.

Predicted Positive

TP

FP

How much we are precise (accurate) in identifying positive cases

45 of 55

Example: Precision

Higher precision is better.
value 1.0 indicates the model is always correct when predicting the target class and never makes a mistake.

Interpretation: Indicates the proportion of true positive results among all positive predictions.

Measured on a scale of 0 to 1 or as a percentage.

When to use precision? we only care about being correct about the things we identify as positive.

Google does not care if it turns away 1000 good engineers; Google just wants to make sure the ones it DOES hire are good

46 of 55

Recall (Sensitivity or True Positive Rate)

Recall is a measure of how many actual positives your model is able to recall from the data.

Recall: The proportion of true positives to the total actual positives (TP / (TP + FN)). It measures the model's ability to identify all positive instances.

Recall

How much we are good at detecting/ identifying all actual positive cases?

47 of 55

Recall

Recall focus how many of our positive class that we missed (actually positive but we predicted as negative FN).

Actual positives

Actual negatives

What we said was positive (TP)

Ones we missed (FN)

“Out of all the actual positive cases, how many did the model correctly identify?”

Recall

48 of 55

Measured on a scale of 0 to 1 or as a percentage.

Higher recall is preferable; it indicates fewer false negatives.
Crucial in scenarios where missing a positive instance is costly (e.g., disease detection, fraud detection).
value of 1.0 means the model captures all instances of the target class and never misses it in predictions.

Example: Recall

We'd use recall when we want to make sure we don't miss anything.

For example identifying people contagious with a deadly super plague

49 of 55

Precision vs. Recall

Testing for cancer
Our legal system
Fraud alerts
Loans

In-class Discussion

In the given scenarios, which metric (precision or recall) holds more practical importance or application?

Think about:

Precision: Aims to be right when it says something is positive (minimize false positives).

Recall: Aims to not miss anything that's actually positive (minimize false negatives).

50 of 55

Precision vs. Recall

Testing for cancer

Recall: Don't want to be wrong (all our cancer records should be predicted correctly), and we can always do further tests

Our legal system

Depends ( precision if aim to reduce false accusations or arrests; recall if priority is to capture all crimes, even if it requires extensive investigations).

Fraud alerts

Recall: Better to deny some good transactions than pay for the fraudulent ones

Loans

Precision: We only want to loan money to people who will pay it back

In-class Discussion

51 of 55

NEXT: Evaluating Confidence in Predictions: What about our confidence?

Incorporating Confidence into Model Evaluation

Confidence refers to how sure a model is about its prediction.

Machine learning algorithms often provide a confidence score (a probability, indicating the likelihood of a prediction being correct) for each prediction.

Different confidence levels can lead to different action.

If a model predicts "yes" with 99% confidence and it's wrong, this should be penalized more than if the model only predicted with 60% confidence.

Precision and recall don’t account for confidence, treating both predictions the same despite differing certainty.
One of the solution: Use log or cross-entropy loss - incorporates both accuracy and confidence

52 of 55

What if Confidence Matters?

Log Loss is a measure of accuracy that penalizes overconfidence by assigning higher penalties to confident but incorrect predictions, encouraging models to deliver both accurate and confident outputs.

A lower log loss indicates better model performance, with values closer to 0 being ideal.

N =total number of data points
M=outcomes or classes (usually 0 and 1).
Xij→ Actual outcome for data point

𝑖 for class 𝑗

Pij→ predicted probability that data point i belongs to class j.

Log Loss measures the accuracy of a model's predictions, specifically evaluating the confidence of predicted probabilities in classification tasks.

Log loss helps us understand how well the model's confidence matches reality; making it a crucial metric for evaluating binary classification models.

For Binary Classification

For multi-class classification where N is no. or rows and M is no. of classes

53 of 55

In some scenarios, we aim to minimize both false positives (FPs) and false negatives (FNs), maximizing both precision and recall.

Example:�When identifying harmful content:

High Precision: Ensures flagged content is accurate.
High Recall: Ensures no inappropriate content is overlooked.

Challenge: It’s often impossible to maximize both metrics simultaneously due to their trade-off:�“Increasing precision typically decreases recall, and vice versa.”

Solution: F1 Score

Next : F1 Score

54 of 55

F1 Score

Range: F1 Score ranges from 0 to 1; the closer it is to 1, the better the model’s performance.

Addresses the trade-off between precision and recall by equally weighting both metrics.

Ideal for imbalanced datasets where ensuring both low false positives and low false negatives is crucial, such as in medical diagnoses or fraud detection.

used when you seek a balance between precision and recall.

The F1 Score is a metric that combines precision and recall into a single value, providing a balanced measure of a model's performance

55 of 55

Summary

Precision: Focuses on minimizing false positives in predictions.

Recall: Focuses on minimizing false negatives in predictions.

Log Loss: Evaluates the alignment between predicted probabilities and true class probabilities, aiming for lower values closer to 0.

F1 Score: Balances both precision and recall in a single metric, useful for scenarios where false positives and false negatives are equally important.

1 of 55

2 of 55

3 of 55

4 of 55

5 of 55

6 of 55

7 of 55

8 of 55

9 of 55

10 of 55

11 of 55

12 of 55

13 of 55

14 of 55

15 of 55

16 of 55

17 of 55

18 of 55

19 of 55

20 of 55

21 of 55

22 of 55

23 of 55

24 of 55

25 of 55

26 of 55

27 of 55

28 of 55

29 of 55

30 of 55

31 of 55

32 of 55

33 of 55

34 of 55

35 of 55

36 of 55

37 of 55

38 of 55

39 of 55

40 of 55

41 of 55

42 of 55

43 of 55

44 of 55

45 of 55

46 of 55

47 of 55

48 of 55

49 of 55

50 of 55

51 of 55

52 of 55

53 of 55

54 of 55

55 of 55