1 of 137

Machine Learning�for English Analysis

Prof. Seungtaek Choi

2 of 137

Recap

  • Neural Networks

3 of 137

Recap: Neural Networks

4 of 137

5 of 137

6 of 137

Multiple Perceptrons

  • With one more layer…

6

First layer

Second layer

7 of 137

Feature Learning

Looks a lot like logistic regression

The only difference is, instead of input a feature vector, the features are just values calculated by the hidden layer

8 of 137

Multi Layer Perceptron (MLP) = Artificial Neural Networks

  • Why do we need multi-layers ?

8

Linear classification

Feature Learning

Nonlinear mappings

Linearly separable

9 of 137

Recap: Backpropagation

10 of 137

Gradients in ANN

  •  

10

11 of 137

Training Neural Networks: Backpropagation Learning

  • Forward propagation
    • the initial information propagates up to the hidden units at each layer and finally produces output

  • Backpropagation
    • allows the information from the cost to flow backwards through the network in order to compute the gradients

11

12 of 137

Backpropagation

  •  

12

These are what we need for GD

13 of 137

14 of 137

15 of 137

Recap: �Activation Function

16 of 137

17 of 137

18 of 137

19 of 137

20 of 137

21 of 137

22 of 137

23 of 137

24 of 137

25 of 137

Today

  • Supplementary Assignment2
  • Evaluation
    • Protocol
    • Reproducibility
  • Model Ensemble
  • Regularization (dropout)
  • Deep Neural Network w/ CNN & RNN

26 of 137

Supplementary for Assignment2

27 of 137

Google Colab

  • Google Colab: https://colab.google/

28 of 137

Google Colab

29 of 137

Google Colab

Ctrl + Enter to run the cell

or just click the button

30 of 137

Google Colab

Now it’s time to

practice with lecture�materials

31 of 137

Google Colab

Select GitHub option

to bring our material �from GitHub

32 of 137

Google Colab

Input link of the provided material�(.ipynb file)

33 of 137

Google Colab

Then click this!

34 of 137

Google Colab

On the right side �of your screen

35 of 137

Google Colab

You can use �free-tier GPU

for practice

36 of 137

Google Colab

Run all the cells

37 of 137

Google Colab

Cell types: Code and Text (markdown)

38 of 137

Google Colab

You can add modules

39 of 137

Google Colab

After adding this path

You can import the module

40 of 137

Optimizer

41 of 137

Gradient Descent

42 of 137

43 of 137

44 of 137

45 of 137

46 of 137

47 of 137

48 of 137

Mini-Batches

49 of 137

Gradient Descent

50 of 137

51 of 137

52 of 137

Mini-batches while training

  • More accurate estimation of gradient
    • Smoother convergence
    • Allows for larger learning rates

  • Mini-batches lead to fast training!
    • Can parallelize computation + achieve significant speed increases on GPU’s

53 of 137

Regularization

54 of 137

55 of 137

56 of 137

57 of 137

58 of 137

Lowering the capacity of the model

- discouraging the ability for the model to learn a singular pathway

- forcing the model to learn these multiple pathways to make a single decision.

59 of 137

60 of 137

Evaluation: Protocol

61 of 137

Evaluation Protocol

  • Separate Train / Validation / Test to prevent information leakage.
    • Train: fit model parameters.
    • Validation: pick hyper-params, early-stop, thresholds, feature choices, compare models.
    • Test (final only): one-time report, untouched during model/design decisions.

Or, It’s about model selection strategy.

62 of 137

Typical Leakage #1: Tuning on the test set

  • Problem: Selecting model/HPs by comparing “test” scores.
  • Why it inflates: You effectively overfit to the test idiosyncrasies.
  • Fix:
    • Choose with validation (or nested CV), then run test once.
    • Freeze all decisions before touching test.

  • Minimal pattern
    • Fit on TRAIN 🡪 Select on VAL 🡪 Report once on TEST

63 of 137

Typical Leakage #2: Fit Scalers/PCA on all data

  • Problem: Data-dependent transforms (standardization, PCA, feature selection) fit on train+val+test.
  • Why it inflates: Test distribution statistics are baked into the transform.
  • Fix:
    • Fit transforms on train only; transform val/test.

64 of 137

Typical Leakage #3: Augment/Oversample on val/test

  • Problem: Applying strong data augmentation or oversampling (e.g., SMOTE) to val/test.
  • Why it leaks/warps:
    • Augs can smooth randomness 🡪 overly stable/optimistic scores.
    • Oversampling changes class priors; you end up evaluating on synthetic/altered data.
  • Fix:
    • Train only: strong augmentation, oversampling, class-balanced sampling.
    • Val/Test: deterministic, minimal transforms (resize/center-crop/normalize).
    • If using Test-Time Augmentation (TTA) 🡪 report it explicitly.

65 of 137

If there is no validation split?

  • K-fold cross-validation
    • green = train
    • yellow = test

66 of 137

Reproducibility vs. Replicability

  •  

67 of 137

Dropout should be turned off at test time.

  • BatchNorm / Dropout left in train mode.
    • Train mode: Dropout active; BN uses batch stats 🡪 unstable, biased metrics.
    • Fix: model.eval() + torch.no_grad() for inference; return to model.train() afterward.

68 of 137

Where Randomness Bites

  •  

69 of 137

In Our Assignment (not mandatory)

  • Fix a split: MNIST (60k/10k) 🡪 e.g., 55k train / 5k val / 10k test, stratified by class.
  • Fix seeds (e.g., {0, 1, 2}); set random, numpy, torch seeds.
  • Log: code commit, package versions, GPU, batch size, LR, seed.
  • Never look at test until the very end.

  • For more,
    • Turn on deterministic options (may slow donw):�torch.use_deterministic_algorithms(True), torch.backends.cudnn.benchmark=False.
    • Keep DataLoader workers small (e.g., 0-2) for determinism.

70 of 137

In Our Assignment (mandatory)

  • Your network is randomly initialized.
  • So, your report cannot be reproduced in the same way.
  • To address this, you need to run multiple times and should report the mean/std.
  • And, I ask you to provide “tendency” rather than manual inspection.

71 of 137

Evaluation: Metric

72 of 137

Accuracy

  •  

73 of 137

Limitation of Accuracy?

  • One big assumption: classes are balanced & costs are symmetric.
  • 🡪 It’s weak indicator when there is class imbalance or distribution shift exists.
    • Ex) 99% of images are “0”: a dumb classifier predicting “0” gets 99% accuracy but is useless.
  • Alternatives/adjuncts: confusion matrix, balanced accuracy (mean recall across classes), top-k accuracy.

74 of 137

Confusion Matrix (binary)

  • True Positive: predicted positive & actually positive
  • True Negative: predicted negative & actually negative
  • False Positive: predicted positive but actually negative
  • False Negative: predicted negative but actually positive

75 of 137

Precision & Recall

  •  

76 of 137

Why Important? (beyond Accuracy)

  • Accuracy can be misleading under class imbalance and asymmetric costs.
  • Precision controls false alarms (FP); Recall controls misses (FN).
  • Choose thresholds/metrics based on what is costly to your task.

77 of 137

Example 1 – Imbalanced Digits

  • Test set: 99% are “0.”
  • Trivial model predicts all “0.”
    • Accuracy = 99% (looks great)
    • For class “non-0”: Recall = 0% (never catches them), Precision undefined (no predicted positives)�🡪 Useless despite high accuracy.

78 of 137

Example 2 – Cancer Screening

  • 10,000 screened; prevalence 1% 🡪 100 positives.
  • Confusion matrix at one threshold: �TP=70, FP=99, FN=30, TN=9,801
    • Accuracy = 98.71% (= TP + TN / all)
    • Recall = 70% (misses 30 cancers)
    • Precision = 41.4%
  • Lowering the threshold might give �TP=90, FP=400, FN=10, TN=9,500
    • Accuracy = 95.9%
    • Recall = 90%
    • Precision = 18.4%
    • 🡪 Worse accuracy, worse precision, but far fewer missed cancers. �Choose threshold by costs, not accuracy.

79 of 137

 

  •  

80 of 137

3-Class Example

  • Confusion matrix (rows=true, cols=pred), N=100:

  • Per-class:
    • A: TP=24, FP=11, FP=6 🡪 P=0.686, R=0.800, F1=0.738
    • B: TP=10, FP=9, FN=10 🡪 P=0.526, R=0.500, F1=0.513
    • C: TP=40, FP=6, FN=10 🡪 P=0.870, R=0.800, F1=0.833
  • Averages:
    • Accuracy = 0.74
    • Macro-F1 = 0.695
    • Micro-F1 = 0.74

Pred A

Pred B

Pred C

(row sum)

True A

24

4

2

30

True B

6

10

4

20

True C

5

5

40

50

(col sum)

35

19

46

81 of 137

Thresholds & Curves (quick note)

  • Many models output scores; choosing a threshold trades precision vs recall.
  • For class-imbalance or retrieval tasks, prefer PR curves/AU(PR)C over ROC-AUC.
  • In MNIST (argmax), thresholds are implicit; still report per-class metrics + confusion matrix.

82 of 137

One-Hot Vector

83 of 137

What is a One-Hot Vector?

  •  

84 of 137

Why One-Hot?

  • Preserves the non-ordinal nature of categorical variables (avoids fake order/size).
  • Simple, standard, widely used across ML (classification, recommender systems, NLP preprocessing).
  • Clean math: vectors are orthogonal; inner product is zero across different classes.

85 of 137

Classification with Softmax + CrossEntropy

  •  

86 of 137

From One-Hot to Embedding Lookup

  •  

87 of 137

Model Ensemble

88 of 137

What are Ensembles?

  • Definition: Combine predictions from multiple models to make a final decision.
    • Members: independently trained checkpoints/models (different seeds/architectures/data recipes/…).
    • Combiner: a rule to merge predictions (e.g., average probabilities, majority vote).

89 of 137

Why Ensembles?

  • Idea: Reduce variance and improve generalization & calibration by averaging uncorrelated errors.
    • Benefits: higher accuracy; more stable across seeds; better calibration (lower ECE); robustness to spurious cues.
    • Caveats: inference cost grows with the number of models; reporting complexity; needs diversity to help.

90 of 137

Why Ensembles?

  • Idea: Reduce variance and improve generalization & calibration by averaging uncorrelated errors.
    • Benefits: higher accuracy; more stable across seeds; better calibration (lower ECE); robustness to spurious cues.
    • Caveats: inference cost grows with the number of models; reporting complexity; needs diversity to help.

91 of 137

Where Does the Gain Come From?

  •  

92 of 137

93 of 137

94 of 137

95 of 137

How to Represent Text

96 of 137

Step 1. Word Vectors

97 of 137

Suggested Readings

98 of 137

99 of 137

100 of 137

101 of 137

102 of 137

103 of 137

104 of 137

105 of 137

106 of 137

107 of 137

108 of 137

109 of 137

110 of 137

111 of 137

112 of 137

113 of 137

114 of 137

115 of 137

116 of 137

117 of 137

118 of 137

119 of 137

120 of 137

121 of 137

122 of 137

123 of 137

124 of 137

125 of 137

126 of 137

127 of 137

128 of 137

129 of 137

130 of 137

131 of 137

132 of 137

133 of 137

134 of 137

135 of 137

136 of 137

137 of 137

Next

  • Language Models
  • Convolutional NN, Recurrent NN