JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 137

Machine Learning�for English Analysis

Prof. Seungtaek Choi

2 of 137

Recap

Neural Networks

3 of 137

Recap: Neural Networks

4 of 137

https://www.youtube.com/watch?v=alfdI7S6wCY

5 of 137

https://www.youtube.com/watch?v=alfdI7S6wCY

6 of 137

Multiple Perceptrons

With one more layer…

First layer

Second layer

7 of 137

https://www.youtube.com/watch?v=alfdI7S6wCY

Feature Learning

Looks a lot like logistic regression

The only difference is, instead of input a feature vector, the features are just values calculated by the hidden layer

8 of 137

Multi Layer Perceptron (MLP) = Artificial Neural Networks

Why do we need multi-layers ?

…

Linear classification

Feature Learning

Nonlinear mappings

Linearly separable

https://iailab.kaist.ac.kr/teaching/machine-learning

9 of 137

Recap: Backpropagation

10 of 137

Gradients in ANN

https://iailab.kaist.ac.kr/teaching/machine-learning

11 of 137

Training Neural Networks: Backpropagation Learning

Forward propagation

the initial information propagates up to the hidden units at each layer and finally produces output

Backpropagation

allows the information from the cost to flow backwards through the network in order to compute the gradients

https://iailab.kaist.ac.kr/teaching/machine-learning

12 of 137

Backpropagation

These are what we need for GD

https://iailab.kaist.ac.kr/teaching/machine-learning

13 of 137

14 of 137

15 of 137

Recap: �Activation Function

16 of 137

17 of 137

18 of 137

19 of 137

20 of 137

21 of 137

22 of 137

23 of 137

24 of 137

https://towardsdatascience.com/softmax-activation-function-explained-a7e1bc3ad60/

https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/

25 of 137

Today

Supplementary Assignment2
Evaluation

Protocol
Reproducibility

Model Ensemble
Regularization (dropout)
Deep Neural Network w/ CNN & RNN

26 of 137

Supplementary for Assignment2

27 of 137

Google Colab

Google Colab: https://colab.google/

28 of 137

Google Colab

29 of 137

Google Colab

Ctrl + Enter to run the cell

or just click the button

30 of 137

Google Colab

Now it’s time to

practice with lecture�materials

31 of 137

Google Colab

Select GitHub option

to bring our material �from GitHub

32 of 137

Google Colab

Input link of the provided material�(.ipynb file)

Practice material: https://github.com/HUFS-LAI-Seungtaek/HUFS-LAI-ML4E-2025-2/blob/main/assignments/assignment2/ml4e-mnist.ipynb

33 of 137

Google Colab

Then click this!

34 of 137

Google Colab

On the right side �of your screen

35 of 137

Google Colab

You can use �free-tier GPU

for practice

36 of 137

Google Colab

Run all the cells

37 of 137

Google Colab

Cell types: Code and Text (markdown)

38 of 137

Google Colab

You can add modules

39 of 137

Google Colab

After adding this path

You can import the module

40 of 137

Optimizer

41 of 137

Gradient Descent

42 of 137

43 of 137

44 of 137

45 of 137

46 of 137

47 of 137

48 of 137

Mini-Batches

49 of 137

Gradient Descent

50 of 137

51 of 137

52 of 137

Mini-batches while training

More accurate estimation of gradient

Smoother convergence
Allows for larger learning rates

Mini-batches lead to fast training!

Can parallelize computation + achieve significant speed increases on GPU’s

53 of 137

Regularization

54 of 137

55 of 137

56 of 137

57 of 137

58 of 137

Lowering the capacity of the model

- discouraging the ability for the model to learn a singular pathway

- forcing the model to learn these multiple pathways to make a single decision.

59 of 137

60 of 137

Evaluation: Protocol

61 of 137

Evaluation Protocol

Separate Train / Validation / Test to prevent information leakage.

Train: fit model parameters.
Validation: pick hyper-params, early-stop, thresholds, feature choices, compare models.
Test (final only): one-time report, untouched during model/design decisions.

https://velog.io/@idam0424/TRAIN-VALIDATION-TEST-SET-veu15b6h

Or, It’s about model selection strategy.

62 of 137

Typical Leakage #1: Tuning on the test set

Problem: Selecting model/HPs by comparing “test” scores.
Why it inflates: You effectively overfit to the test idiosyncrasies.
Fix:

Choose with validation (or nested CV), then run test once.
Freeze all decisions before touching test.

Minimal pattern

Fit on TRAIN 🡪 Select on VAL 🡪 Report once on TEST

63 of 137

Typical Leakage #2: Fit Scalers/PCA on all data

Problem: Data-dependent transforms (standardization, PCA, feature selection) fit on train+val+test.
Why it inflates: Test distribution statistics are baked into the transform.
Fix:

Fit transforms on train only; transform val/test.

64 of 137

Typical Leakage #3: Augment/Oversample on val/test

Problem: Applying strong data augmentation or oversampling (e.g., SMOTE) to val/test.
Why it leaks/warps:

Augs can smooth randomness 🡪 overly stable/optimistic scores.
Oversampling changes class priors; you end up evaluating on synthetic/altered data.

Fix:

Train only: strong augmentation, oversampling, class-balanced sampling.
Val/Test: deterministic, minimal transforms (resize/center-crop/normalize).
If using Test-Time Augmentation (TTA) 🡪 report it explicitly.

65 of 137

If there is no validation split?

K-fold cross-validation

green = train
yellow = test

66 of 137

Reproducibility vs. Replicability

67 of 137

Dropout should be turned off at test time.

BatchNorm / Dropout left in train mode.

Train mode: Dropout active; BN uses batch stats 🡪 unstable, biased metrics.
Fix: model.eval() + torch.no_grad() for inference; return to model.train() afterward.

68 of 137

Where Randomness Bites

69 of 137

In Our Assignment (not mandatory)

Fix a split: MNIST (60k/10k) 🡪 e.g., 55k train / 5k val / 10k test, stratified by class.
Fix seeds (e.g., {0, 1, 2}); set random, numpy, torch seeds.
Log: code commit, package versions, GPU, batch size, LR, seed.
Never look at test until the very end.

For more,

Turn on deterministic options (may slow donw):�torch.use_deterministic_algorithms(True), torch.backends.cudnn.benchmark=False.
Keep DataLoader workers small (e.g., 0-2) for determinism.

70 of 137

In Our Assignment (mandatory)

Your network is randomly initialized.
So, your report cannot be reproduced in the same way.
To address this, you need to run multiple times and should report the mean/std.
And, I ask you to provide “tendency” rather than manual inspection.

71 of 137

Evaluation: Metric

72 of 137

Accuracy

73 of 137

Limitation of Accuracy?

One big assumption: classes are balanced & costs are symmetric.
🡪 It’s weak indicator when there is class imbalance or distribution shift exists.

Ex) 99% of images are “0”: a dumb classifier predicting “0” gets 99% accuracy but is useless.

Alternatives/adjuncts: confusion matrix, balanced accuracy (mean recall across classes), top-k accuracy.

74 of 137

Confusion Matrix (binary)

True Positive: predicted positive & actually positive
True Negative: predicted negative & actually negative
False Positive: predicted positive but actually negative
False Negative: predicted negative but actually positive

75 of 137

Precision & Recall

76 of 137

Why Important? (beyond Accuracy)

Accuracy can be misleading under class imbalance and asymmetric costs.
Precision controls false alarms (FP); Recall controls misses (FN).
Choose thresholds/metrics based on what is costly to your task.

77 of 137

Example 1 – Imbalanced Digits

Test set: 99% are “0.”
Trivial model predicts all “0.”

Accuracy = 99% (looks great)
For class “non-0”: Recall = 0% (never catches them), Precision undefined (no predicted positives)�🡪 Useless despite high accuracy.

78 of 137

Example 2 – Cancer Screening

10,000 screened; prevalence 1% 🡪 100 positives.
Confusion matrix at one threshold: �TP=70, FP=99, FN=30, TN=9,801

Accuracy = 98.71% (= TP + TN / all)
Recall = 70% (misses 30 cancers)
Precision = 41.4%

Lowering the threshold might give �TP=90, FP=400, FN=10, TN=9,500

Accuracy = 95.9%
Recall = 90%
Precision = 18.4%
🡪 Worse accuracy, worse precision, but far fewer missed cancers. �Choose threshold by costs, not accuracy.

79 of 137

80 of 137

3-Class Example

Confusion matrix (rows=true, cols=pred), N=100:

Per-class:

A: TP=24, FP=11, FP=6 🡪 P=0.686, R=0.800, F1=0.738
B: TP=10, FP=9, FN=10 🡪 P=0.526, R=0.500, F1=0.513
C: TP=40, FP=6, FN=10 🡪 P=0.870, R=0.800, F1=0.833

Averages:

Accuracy = 0.74
Macro-F1 = 0.695
Micro-F1 = 0.74

	Pred A	Pred B	Pred C	(row sum)
True A	24	4	2	30
True B	6	10	4	20
True C	5	5	40	50
(col sum)	35	19	46

81 of 137

Thresholds & Curves (quick note)

Many models output scores; choosing a threshold trades precision vs recall.
For class-imbalance or retrieval tasks, prefer PR curves/AU(PR)C over ROC-AUC.
In MNIST (argmax), thresholds are implicit; still report per-class metrics + confusion matrix.

82 of 137

One-Hot Vector

83 of 137

What is a One-Hot Vector?

84 of 137

Why One-Hot?

Preserves the non-ordinal nature of categorical variables (avoids fake order/size).
Simple, standard, widely used across ML (classification, recommender systems, NLP preprocessing).
Clean math: vectors are orthogonal; inner product is zero across different classes.

85 of 137

Classification with Softmax + CrossEntropy

86 of 137

From One-Hot to Embedding Lookup

87 of 137

Model Ensemble

88 of 137

What are Ensembles?

Definition: Combine predictions from multiple models to make a final decision.

Members: independently trained checkpoints/models (different seeds/architectures/data recipes/…).
Combiner: a rule to merge predictions (e.g., average probabilities, majority vote).

https://www.ibm.com/think/topics/ensemble-learning

89 of 137

Why Ensembles?

Idea: Reduce variance and improve generalization & calibration by averaging uncorrelated errors.

Benefits: higher accuracy; more stable across seeds; better calibration (lower ECE); robustness to spurious cues.
Caveats: inference cost grows with the number of models; reporting complexity; needs diversity to help.

90 of 137

Why Ensembles?

Idea: Reduce variance and improve generalization & calibration by averaging uncorrelated errors.

Benefits: higher accuracy; more stable across seeds; better calibration (lower ECE); robustness to spurious cues.
Caveats: inference cost grows with the number of models; reporting complexity; needs diversity to help.