ML Evaluation
Fardina Fathmiul Alam
CMSC 320 - Intro to Data Science
Topics We will Cover
Overfitting and Underfitting
Overfitting and Underfitting are two crucial concepts in machine learning and are the prevalent causes for the poor performance of a machine learning model.
Overfitting in Machine Learning
Overfitting occurs when a model performs exceptionally well on training data but poorly on unseen test data.
Key Points:
�Think of it like acing a practice exam by memorizing answers rather than understanding the material.
Causes of Overfitting
Some ways to Tackle Overfitting
Underfitting
Our model is too general! It memorized one rule and is applying it everywhere.
Definition: when a model fails to learn the patterns in the training data effectively, resulting in poor performance on both the training data and new, unseen test data. Tt is known as underfitting.
Causes of Underfitting?
One of the way to tackle underfitting is Increase Model Complexity: If the model is too simple, increasing its complexity can alleviate underfitting. E.g.
The Bias/Variance Tradeoff
A fundamental concept in machine learning that deals with the problems of overfitting and underfitting
NEXT
Concept of Bias-Variance Tradeoff
As we want to minimize prediction error on both training and validation datasets, we need to know two types of errors in model performance-
Bias
Bias quantifies how much the predicted values differ from the actual (expected) values.
High bias → simple model → poor performance.
Bias refers to the error introduced by approximating a real-world problem (which may be complex) by a simplified model.
Variance
Variance is the variability of model prediction for a given data point or a value which tells us spread of our data.
High variance → complex model → great performance on training data but poor on new data.
Variance is the error introduced when a model is overly sensitive to fluctuations in the training data (measures how much the predictions for a given data point can change when the model is trained on different subsets of the data.).
Demo: Example for Bias and Variance
Underfitting
Overfitting
Training Set Error:
Test (Validation/Dev) Set Error
High Variance
High Bias
High Bias &
High Variance
Low Bias &
Low Variance
Assume, human error: 0%
Optimal (Bayes Error) is nearly 0 %
Generalization Error=Bias^2 +Variance+Irreducible Error
Irreducible Error: This is the noise inherent in the data itself, which can’t be eliminated regardless of the model.
Bias-Variance Tradeoff
Aim for a model that minimizes both bias and variance for optimal performance.
Basic Recipe of ML to tackle this
HIGH BIAS ?
(Training Data Performance)
Bigger Network
Train a longer time
HIGH VARIANCE ?
(Testing (val/dev) Data Performance)
Get More Data
Regularization techniques
No
Yes
Yes
Done
No
Combatting Training Failures
Start with “Testing and Training” (Holdout Method)
Often in Machine Learning Models
The point of having models is that, once they are trained, they will be able to classify new data.
In our bank example, no two loan applicants are unique. What we're hoping is that the model uncovered the underlying rules about who repays loans are not. Like "higher income good".
How do we know if it worked?
Mitigating Overfitting: The Importance of Data Splitting in Model Training
Issue: Training on the entire dataset can lead to overfitting, resulting in poor performance on new data. To compensate, sample data is often split into three subsets:
1. Training data (60-80%) | Used to fit a model. |
2. Validation Data (10-20%) | Evaluate model performance and aids in parameter tuning and feature selection. |
3. Test Data (10-20%) | Assesses final model performance and facilitates model comparison |
Hide Some Data From Our Algorithm
We Test Our Model!
We already know their target label, use it to evaluate in on unseen data data
NEXT: Validation (Tuning or Development) sets
Suppose we want unbiased estimates of accuracy during the learning process (e.g. to choose the best level of decision-tree pruning)?
training set
test set
learned model
learning process
training set
validation set
learn models
select model
Partition training data into separate training/validation sets
8
Remember, in machine learning, both the test and validation datasets are used to evaluate the performance of a trained model but purpose are different!
Validation Sets
Unlike the test dataset, the validation dataset is typically a subset of the training data and is used iteratively during model development
“Train/Validation/ Test split" method in ML can be risky!
Non-random splits can lead to overfitting.
E.g. one part of the data only includes people from a specific state, employees with a certain income level, or only women, it can make the model learn too much about these specific cases.
To prevent this issue, we use "cross-validation"
While ensuring Robust Model Evaluation
Limitations
The Cross Validation Techniques
A technique for evaluating machine learning model performance by partitioning data into multiple subsets.
NEXT
Cross Validation Technique
Instead of relying on a single validation dataset, cross-validation involves splitting the dataset into multiple parts to ensure that the model is learn and evaluated on different parts of the data (promoting a more balanced evaluation and improving generalization).
Cross Validation Process
Dataset Splitting: Data is divided into train and test.
Training and Validation: Cross Validation
Dataset Splitting: Now training data is divided into subsets, called "folds".
Repetition: This process repeats for each fold as the validation set.
Performance Assessment: After all iterations, performance metrics are averaged to evaluate the model across all folds, ensuring good generalization and reducing overfitting.
Separate Testing Dataset Final Evaluation: After cross-validation, a separate testing dataset (not used in training or validation) is used to evaluate the final model. This ensures an unbiased assessment of the model's performance on entirely unseen data.
labeled data set
s1 | s2 | s3 | s4 | s5 |
partition data
into n subsamples
iteratively leave one subsample out for the test set, train on the rest
Example: Cross validation
Suppose we have 100 instances, and we want to estimate accuracy with cross validation.
accuracy = 73/100 = 73%
For validation
Types of Cross Validation Techniques
NEXT
How It Works:
A technique to evaluate a machine learning model by dividing the dataset into K equal-sized subsets, or "folds."
Example of Training and Validation (Cross Validation K=4 Fold in this case) and Testing Dataset
Common Values for K: Typical values are 5 or 10, adjustable based on dataset size.
Steps
(2) Stratified K-Fold Cross-Validation
A variation of k-fold cross-validation where the data is split into k folds, but with the constraint that each fold maintains the same proportion of samples from each class as the entire dataset.
In general, stratified k-fold is recommended for classification tasks, especially when dealing with imbalanced datasets.
KFold (K=5):
Randomly splits data into 5 folds, where the number of instances per class may vary.
StratifiedKFold:
Splits data into 5 folds, ensuring each fold has an equal and proportional distribution of instances from each class A, B and C.
(3) Leave One Out Cross Validation (LOOCV)
When every ounce of training data counts
Particularly useful for small datasets.
Drawbacks: Computationally expensive and slower for larger datasets. However, it provides a highly reliable performance estimate.
Estimates a model's performance by training it on all data points except one, and then using that left-out point to test the model.
Leave One Out Cross Validation (LOOCV) Process
ML Model Evaluation
Using some evaluation metrics
Accuracy
The simplest way to check how well our model is doing is to look at it's accuracy.
100% accuracy is perfect, 0% accuracy is completely wrong.
Question: "If 90 out of 100 predictions are correct for spam detection in an email classification model, what is the model's accuracy?"
= 90/100 * 100% = 0.9 * 100= 90%
Class Imbalance
Imagine a dataset with 90% of the negative class and 10% of the positive class--is 80% accuracy impressive here? What about 90%?
Consider this scenario:
Class imbalance is a serious issue. Class imbalance happens when an overwhelming majority if your data is a single class.
However, 'Accuracy' cannot always be used as the sole metric to evaluate model performance
Situations With Class Imbalance
Class Imbalance Issue: Occurs when one class dominates the data, making one group significantly larger than others.
Examples of Imbalance:
Common Scenario: Typically involves a rare positive signal amidst negatives, which is a prevalent occurrence.
Limitation of “Accuracy” Metric
Issue with Imbalanced Data: Accuracy can mislead when one class dominates.
Neglecting Error Types: Accuracy treats all errors equally, not distinguishing between false positives and false negatives.
“Accuracy is utterly useless if the class distribution in your data set is skewed”
So more alternative evaluation metrics are needed. NEXT: Confusion Matrix
Confusion Matrix
A confusion matrix is a matrix that summarizes the performance of a machine learning model on a set of test data.
The confusion matrix allows to calculate various evaluation metrics
Evaluation Metric: (1) Precision
Out of all positive predictions made, it indicates the proportion that are truly positive.
Precision is a metric that measures the accuracy of positive predictions made by a model.
Predicted Positive
TP
FP
How much we are precise (accurate) in identifying positive cases
Example: Precision
Interpretation: Indicates the proportion of true positive results among all positive predictions.
When to use precision? we only care about being correct about the things we identify as positive.
Recall (Sensitivity or True Positive Rate)
Recall is a measure of how many actual positives your model is able to recall from the data.
Recall: The proportion of true positives to the total actual positives (TP / (TP + FN)). It measures the model's ability to identify all positive instances.
Recall
How much we are good at detecting/ identifying all actual positive cases?
Recall
Recall focus how many of our positive class that we missed (actually positive but we predicted as negative FN).
Actual positives
Actual negatives
What we said was positive (TP)
Ones we missed (FN)
“Out of all the actual positive cases, how many did the model correctly identify?”
Recall
Example: Recall
We'd use recall when we want to make sure we don't miss anything.
Precision vs. Recall
In-class Discussion
In the given scenarios, which metric (precision or recall) holds more practical importance or application?
Think about:
Precision: Aims to be right when it says something is positive (minimize false positives).
Recall: Aims to not miss anything that's actually positive (minimize false negatives).
Precision vs. Recall
In-class Discussion
NEXT: Evaluating Confidence in Predictions: What about our confidence?
Incorporating Confidence into Model Evaluation
Confidence refers to how sure a model is about its prediction.
Machine learning algorithms often provide a confidence score (a probability, indicating the likelihood of a prediction being correct) for each prediction.
What if Confidence Matters?
Log Loss is a measure of accuracy that penalizes overconfidence by assigning higher penalties to confident but incorrect predictions, encouraging models to deliver both accurate and confident outputs.
A lower log loss indicates better model performance, with values closer to 0 being ideal.
𝑖 for class 𝑗
Log Loss measures the accuracy of a model's predictions, specifically evaluating the confidence of predicted probabilities in classification tasks.
Log loss helps us understand how well the model's confidence matches reality; making it a crucial metric for evaluating binary classification models.
For Binary Classification
For multi-class classification where N is no. or rows and M is no. of classes
In some scenarios, we aim to minimize both false positives (FPs) and false negatives (FNs), maximizing both precision and recall.
Example:�When identifying harmful content:
Challenge: It’s often impossible to maximize both metrics simultaneously due to their trade-off:�“Increasing precision typically decreases recall, and vice versa.”
Solution: F1 Score
Next : F1 Score
F1 Score
Range: F1 Score ranges from 0 to 1; the closer it is to 1, the better the model’s performance.
Addresses the trade-off between precision and recall by equally weighting both metrics.
Ideal for imbalanced datasets where ensuring both low false positives and low false negatives is crucial, such as in medical diagnoses or fraud detection.
used when you seek a balance between precision and recall.
The F1 Score is a metric that combines precision and recall into a single value, providing a balanced measure of a model's performance
Summary
Precision: Focuses on minimizing false positives in predictions.
Recall: Focuses on minimizing false negatives in predictions.
Log Loss: Evaluates the alignment between predicted probabilities and true class probabilities, aiming for lower values closer to 0.
F1 Score: Balances both precision and recall in a single metric, useful for scenarios where false positives and false negatives are equally important.