MACHINE LEARNING • LECTURE
Model Evaluation
Techniques
Understanding how to evaluate models using metrics like
accuracy, precision, recall, F1-score, AUC-ROC & more
Lecture Series • Data Science & ML • 2025
Why Model Evaluation Matters
A model that looks good on training data may fail in the real world. Evaluation metrics help us measure how well a model generalizes to unseen data — guiding us to select, tune, and trust our models before deployment.
Measure Performance
Quantify how well the model predicts on unseen data using objective metrics
Avoid Overfitting
Detect when a model memorizes training data but fails to generalize
Compare Models
Fairly compare different algorithms, hyperparameters and architectures
Inform Decisions
Choose the right metric for the problem — accuracy alone can be misleading
The Confusion Matrix
Predicted
Positive
Negative
Actual
Positive
Negative
TP
85
True Positive
FN
15
False Negative
(Type II Error)
FP
10
False Positive
(Type I Error)
TN
90
True Negative
TP
True Positive
Model predicted Positive → actually Positive ✓
TN
True Negative
Model predicted Negative → actually Negative ✓
FP
False Positive
Model predicted Positive → actually Negative ✗
FN
False Negative
Model predicted Negative → actually Positive ✗
Confusion matrix is the foundation for all classification metrics
Core Classification Metrics
Accuracy
(TP + TN) / (TP + TN + FP + FN) = 87.5%
Overall fraction of correct predictions. Simple but misleading on imbalanced datasets.
⚠ Fails on imbalanced data: 95% 'accuracy' by always predicting majority class!
Precision
TP / (TP + FP) = 89.5%
Of all predicted positives, how many were actually positive? Focus: minimize false alarms.
Use when False Positives are costly — e.g., spam filter, fraud alerts
Recall (Sensitivity)
TP / (TP + FN) = 85.0%
Of all actual positives, how many did we catch? Focus: minimize missed cases.
Use when False Negatives are costly — e.g., cancer screening, fault detection
F1-Score
2 × (Precision × Recall) / (Precision + Recall) = 87.2%
Harmonic mean of Precision and Recall. Best single metric when both matter.
Ideal for imbalanced classes where you need to balance precision & recall
ROC Curve & AUC Score
Key Concepts:
TPR (Recall)
TP / (TP + FN) | True Positive Rate — Y-axis of ROC curve
FPR
FP / (FP + TN) | False Positive Rate — X-axis of ROC curve
AUC = 1.0
Perfect Model | Classifies all examples correctly
AUC = 0.5
Random Classifier | No discriminative power (diagonal line)
AUC > 0.9
Excellent Model | Strong performance, well-separated classes
Regression Evaluation Metrics
MAE — Mean Absolute Error
(1/n) × Σ |yᵢ - ŷᵢ|
Average of absolute differences between predicted and actual values. Easy to interpret — same units as target.
✓ Robust to outliers, easy to interpret
✗ Does not penalize large errors heavily
MSE — Mean Squared Error
(1/n) × Σ (yᵢ - ŷᵢ)²
Average of squared differences. Penalizes large errors more heavily than MAE due to squaring.
✓ Penalizes large errors (differentiable)
✗ Units are squared — hard to interpret
RMSE — Root Mean Squared Error
√( (1/n) × Σ (yᵢ - ŷᵢ)² )
Square root of MSE. Returns to the same units as target — most commonly used regression metric.
✓ Same units as target, penalizes outliers
✗ Still sensitive to large errors
R² — Coefficient of Determination
1 - (SS_res / SS_tot)
Proportion of variance in target explained by the model. R²=1 is perfect; R²=0 means model = mean baseline.
✓ Scale-independent (0–1), intuitive
✗ Can be negative for very poor models
Cross-Validation
Cross-validation splits data multiple times to train and evaluate the model on different subsets — giving a more reliable estimate of performance than a single train/test split.
k-Fold Cross-Validation (k=5):
Fold 1
TEST
Train
Train
Train
Train
Fold 2
Train
TEST
Train
Train
Train
Fold 3
Train
Train
TEST
Train
Train
Fold 4
Train
Train
Train
TEST
Train
Fold 5
Train
Train
Train
Train
TEST
k-Fold CV:
General purpose, balanced classes
Split data into k equal folds. Each fold is used once as test set.
Stratified k-Fold:
Imbalanced classification datasets
Each fold preserves the percentage of samples for each class.
Leave-One-Out (LOO):
Small datasets where every sample matters
Each sample is used once as test set. Very expensive for large datasets.
Time-Series Split:
Time-series, sequential data
Respects temporal order — train only on past data to test on future.
Overfitting, Underfitting & Bias-Variance
Underfitting
High Bias, Low Variance
Train Error: High
Val Error: High
Too Simple
Model is too simple — misses patterns in training data and generalizes poorly.
Fix:
More complex model, more features, less regularization
Good Fit
Low Bias, Low Variance
Train Error: Low
Val Error: Low
Sweet Spot
Model captures the underlying pattern without memorizing noise.
Fix:
Ideal! Use cross-validation to confirm consistency
Overfitting
Low Bias, High Variance
Train Error: Very Low
Val Error: High
Too Complex
Model memorizes training data — performs great on train but poor on test.
Fix:
Regularization, more data, dropout, simpler model
Choosing the Right Metric
Problem Type
Use Case / Context
Recommended Metric
Classification
Balanced classes, general purpose
Accuracy
Classification
Imbalanced classes
F1-Score / ROC-AUC
Classification
FP is costly (spam filter, fraud alert)
Precision
Classification
FN is costly (cancer, fraud detection)
Recall / Sensitivity
Classification
Multi-class, probability output needed
Log Loss / ROC-AUC
Regression
Equal importance of all errors
MAE
Regression
Large errors should be penalized
RMSE / MSE
Regression
Explain variance (% explained)
R² Score
💡 There is no universal best metric — choose based on your problem, business cost, and data distribution!
Python Implementation
scikit-learn — Full Evaluation Pipeline
from sklearn.metrics import (
accuracy_score, precision_score,
recall_score, f1_score,
roc_auc_score, confusion_matrix,
classification_report,
mean_absolute_error,
mean_squared_error, r2_score)
from sklearn.model_selection import (
cross_val_score, StratifiedKFold)
import numpy as np
# --- Classification Metrics ---
print(accuracy_score(y_test, y_pred))
print(precision_score(y_test, y_pred))
print(recall_score(y_test, y_pred))
print(f1_score(y_test, y_pred))
print(roc_auc_score(y_test, y_prob))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
# --- Regression Metrics ---
print(mean_absolute_error(y_test, y_pred))
mse = mean_squared_error(y_test, y_pred)
print(np.sqrt(mse)) # RMSE
print(r2_score(y_test, y_pred))
# --- Cross-Validation ---
cv = StratifiedKFold(n_splits=5, shuffle=True)
scores = cross_val_score(
model, X, y, cv=cv, scoring='f1')
print(f"F1: {scores.mean():.3f}"
f" ± {scores.std():.3f}")
Quick Reference:
accuracy_score
sklearn.metrics
precision_score
sklearn.metrics
recall_score
sklearn.metrics
f1_score
sklearn.metrics
roc_auc_score
sklearn.metrics
r2_score
sklearn.metrics
Always evaluate multiple metrics — no single number tells the full story of your model's performance!