2 of 10

Why Model Evaluation Matters

A model that looks good on training data may fail in the real world. Evaluation metrics help us measure how well a model generalizes to unseen data — guiding us to select, tune, and trust our models before deployment.

Measure Performance

Quantify how well the model predicts on unseen data using objective metrics

Avoid Overfitting

Detect when a model memorizes training data but fails to generalize

Compare Models

Fairly compare different algorithms, hyperparameters and architectures

Inform Decisions

Choose the right metric for the problem — accuracy alone can be misleading

3 of 10

The Confusion Matrix

Predicted

Positive

Negative

Actual

Positive

Negative

True Positive

False Negative

(Type II Error)

False Positive

(Type I Error)

True Negative

True Positive

Model predicted Positive → actually Positive ✓

True Negative

Model predicted Negative → actually Negative ✓

False Positive

Model predicted Positive → actually Negative ✗

False Negative

Model predicted Negative → actually Positive ✗

Confusion matrix is the foundation for all classification metrics

4 of 10

Core Classification Metrics

Accuracy

(TP + TN) / (TP + TN + FP + FN) = 87.5%

Overall fraction of correct predictions. Simple but misleading on imbalanced datasets.

⚠ Fails on imbalanced data: 95% 'accuracy' by always predicting majority class!

Precision

TP / (TP + FP) = 89.5%

Of all predicted positives, how many were actually positive? Focus: minimize false alarms.

Use when False Positives are costly — e.g., spam filter, fraud alerts

Recall (Sensitivity)

TP / (TP + FN) = 85.0%

Of all actual positives, how many did we catch? Focus: minimize missed cases.

Use when False Negatives are costly — e.g., cancer screening, fault detection

F1-Score

2 × (Precision × Recall) / (Precision + Recall) = 87.2%

Harmonic mean of Precision and Recall. Best single metric when both matter.

Ideal for imbalanced classes where you need to balance precision & recall

5 of 10

ROC Curve & AUC Score

Key Concepts:

TPR (Recall)

TP / (TP + FN) | True Positive Rate — Y-axis of ROC curve

FPR

FP / (FP + TN) | False Positive Rate — X-axis of ROC curve

AUC = 1.0

Perfect Model | Classifies all examples correctly

AUC = 0.5

Random Classifier | No discriminative power (diagonal line)

AUC > 0.9

Excellent Model | Strong performance, well-separated classes

6 of 10

Regression Evaluation Metrics

MAE — Mean Absolute Error

(1/n) × Σ |yᵢ - ŷᵢ|

Average of absolute differences between predicted and actual values. Easy to interpret — same units as target.

✓ Robust to outliers, easy to interpret

✗ Does not penalize large errors heavily

MSE — Mean Squared Error

(1/n) × Σ (yᵢ - ŷᵢ)²

Average of squared differences. Penalizes large errors more heavily than MAE due to squaring.

✓ Penalizes large errors (differentiable)

✗ Units are squared — hard to interpret

RMSE — Root Mean Squared Error

√( (1/n) × Σ (yᵢ - ŷᵢ)² )

Square root of MSE. Returns to the same units as target — most commonly used regression metric.

✓ Same units as target, penalizes outliers

✗ Still sensitive to large errors

R² — Coefficient of Determination

1 - (SS_res / SS_tot)

Proportion of variance in target explained by the model. R²=1 is perfect; R²=0 means model = mean baseline.

✓ Scale-independent (0–1), intuitive

✗ Can be negative for very poor models

7 of 10

Cross-Validation

Cross-validation splits data multiple times to train and evaluate the model on different subsets — giving a more reliable estimate of performance than a single train/test split.

k-Fold Cross-Validation (k=5):

Fold 1

TEST

Train

Fold 2

Train

TEST

Train

Fold 3

Train

TEST

Train

Fold 4

Train

TEST

Train

Fold 5

Train

TEST

k-Fold CV:

General purpose, balanced classes

Split data into k equal folds. Each fold is used once as test set.

Stratified k-Fold:

Imbalanced classification datasets

Each fold preserves the percentage of samples for each class.

Leave-One-Out (LOO):

Small datasets where every sample matters

Each sample is used once as test set. Very expensive for large datasets.

Time-Series Split:

Time-series, sequential data

Respects temporal order — train only on past data to test on future.

8 of 10

Overfitting, Underfitting & Bias-Variance

Underfitting

High Bias, Low Variance

Train Error: High

Val Error: High

Too Simple

Model is too simple — misses patterns in training data and generalizes poorly.

Fix:

More complex model, more features, less regularization

Good Fit

Low Bias, Low Variance

Train Error: Low

Val Error: Low

Sweet Spot

Model captures the underlying pattern without memorizing noise.

Fix:

Ideal! Use cross-validation to confirm consistency

Overfitting

Low Bias, High Variance

Train Error: Very Low

Val Error: High

Too Complex

Model memorizes training data — performs great on train but poor on test.

Fix:

Regularization, more data, dropout, simpler model

9 of 10

Choosing the Right Metric

Problem Type

Use Case / Context

Recommended Metric

Classification

Balanced classes, general purpose

Accuracy

Classification

Imbalanced classes

F1-Score / ROC-AUC

Classification

FP is costly (spam filter, fraud alert)

Precision

Classification

FN is costly (cancer, fraud detection)

Recall / Sensitivity

Classification

Multi-class, probability output needed

Log Loss / ROC-AUC

Regression

Equal importance of all errors

MAE

Regression

Large errors should be penalized

RMSE / MSE

Regression

Explain variance (% explained)

R² Score

💡 There is no universal best metric — choose based on your problem, business cost, and data distribution!

10 of 10

Python Implementation

scikit-learn — Full Evaluation Pipeline

from sklearn.metrics import (

accuracy_score, precision_score,

recall_score, f1_score,

roc_auc_score, confusion_matrix,

classification_report,

mean_absolute_error,

mean_squared_error, r2_score)

from sklearn.model_selection import (

cross_val_score, StratifiedKFold)

import numpy as np

# --- Classification Metrics ---

print(accuracy_score(y_test, y_pred))

print(precision_score(y_test, y_pred))

print(recall_score(y_test, y_pred))

print(f1_score(y_test, y_pred))

print(roc_auc_score(y_test, y_prob))

print(confusion_matrix(y_test, y_pred))

print(classification_report(y_test, y_pred))

# --- Regression Metrics ---

print(mean_absolute_error(y_test, y_pred))

mse = mean_squared_error(y_test, y_pred)

print(np.sqrt(mse)) # RMSE

print(r2_score(y_test, y_pred))

# --- Cross-Validation ---

cv = StratifiedKFold(n_splits=5, shuffle=True)

scores = cross_val_score(

model, X, y, cv=cv, scoring='f1')

print(f"F1: {scores.mean():.3f}"

f" ± {scores.std():.3f}")

Quick Reference:

accuracy_score

sklearn.metrics

precision_score

sklearn.metrics

recall_score

sklearn.metrics

f1_score

sklearn.metrics

roc_auc_score

sklearn.metrics

r2_score

sklearn.metrics

Always evaluate multiple metrics — no single number tells the full story of your model's performance!

1 of 10

2 of 10

3 of 10

4 of 10

5 of 10

6 of 10

7 of 10

8 of 10

9 of 10

10 of 10