JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 55

Machine Learning for English Analysis

Week 7 – Unsupervised Learning

Prof. Seungtaek Choi

2 of 55

Today

Plan for Remaining Semester
Announcement: Assignment #3 (project proposal)
Recap: Midterm
Unsupervised Learning

Clustering

3 of 55

Plan for Remaining Semester

	Week
	W9	W10	W11	W12	W13	W14	W15	W16
Assign#3 (project proposal)	start		end
Assign#4 (data collection and analysis)		start		end
Assign#5 (model training and evaluation)			start		end
Assign#6 (real usage & final report)				start		end
Assign#7 (presentation)						lecture
Lecture Topic	Unsup. Learning	Unsup.�Learning	Deep Learning	Deep Learning	Advanced�Topics	Advanced�Topics	X	Final Exam

4 of 55

Announcement: Assignment #3

Assignment #3: Write your project proposal and submit PR

Deadline: 11:59 PM at Nov 10^th (2 week)
Follow the instructions in https://github.com/HUFS-LAI-Seungtaek/HUFS-LAI-ML4E-2025-2/tree/main/assignments/assignment3

Almost same with the assignment #1

Examples

1) Paper summarization bot
2) Notice classification bot
3) Personalized translation tool
…

5 of 55

Midterm Explanation

6 of 55

Stats about Midterm

Full score: 100
Average score: 37
Median score: 40

7 of 55

Midterm: OX

8 of 55

Midterm: OX

9 of 55

Midterm: OX

10 of 55

Midterm: Multiple-Choice

11 of 55

Midterm: Multiple-Choice

12 of 55

Midterm: Multiple-Choice

13 of 55

Midterm: Multiple-Choice

test data leakage

optimistic but …

counter: 85 vs. 15

cannot be sure without “cost”

14 of 55

Midterm: Multiple-Choice

covered in OX

covered in assignment2 (random init vs. word2vec init)

approximation with random sampling

15 of 55

Midterm: Multiple-Choice

dropout is for regularization (better generalization)

too strong regularization = low representation power

for p = 0, validation loss will increase (overfitting)

not guaranteed

16 of 55

Midterm: Multiple-Choice

n classifiers will multiply computing n-times

[0.1, 0.3, 0.4, 0.1, 0.1]

+ [0.2, 0.2, 0.3, 0.2, 0.1]

= [0.3, 0.5, 0.7, 0.3, 0.2]

🡪 class C

= 3 * e^2 * (1 – e) + e^3 = 3 * 0.2^2 * 0.8 + 0.2^3 = 0.104 (error) 🡪 0.896 (accuracy)

17 of 55

Midterm: Multiple-Choice

Scores

- Complete confusion matrix 🡪 4 points

- Correct accuracy 🡪 4 points

		Pred
		A	B	C
True	A	5	1	0
	B	0	4	2
	C	2	0	4

Accuracy = (# correct) / (# total) = 13 / 18

18 of 55

Midterm: Multiple-Choice

Scores

- Correct order 🡪 7 points

- Correct format 🡪 1 points (uppercase/lowercase, numbers, etc.)

A > C > B

#1: A

#2: C

#3: B

Model A		Pred
Model A		P	N
True	P	160	40
True	N	160	640

Model B		Pred
Model B		P	N
True	P	100	100
True	N	25	775

Model C		Pred
Model C		P	N
True	P	120	80
True	N	60	740

Cost(A)

= 1 * 160 + 5 * 40

= 360

Cost(B)

= 1 * 25 + 5 * 100

= 525

Cost(C)

= 1 * 60 + 5 * 80

= 460

19 of 55

Midterm: Multiple-Choice

(i) [<PAD>, in, paris]

x: [0, 0, 0, 1, 1, 0]
s: [0.2, 0.1, 0]
BIO: O

(ii) [in, paris, spring]

x: [0, 1, 1, 0, 1, 1]
s: [0.2, -1.9, 1.0]
BIO: I-LOC

(iii) [paris, spring, <PAD>]

x: [1, 0, 1, 1, 0, 0]
s: [0.2, -1.9, 0]
BIO: O

For more interactive understanding, please see this URL

https://claude.ai/public/artifacts/33cc6421-a2f0-46c8-b429-94a9af7fed00

20 of 55

Midterm: Multiple-Choice

21 of 55

Midterm: Multiple-Choice

22 of 55

Midterm: Multiple-Choice

ReLU = max(0, z) cannot return negative numbers

23 of 55

Midterm: Multiple-Choice

Scores

- Correct (a) = dog, dog 🡪 4 points

- Correct (b) = park, park 🡪 4 points

24 of 55

Unsupervised Learning

Clustering

25 of 55

Recap: Supervised vs. Unsupervised Learning

26 of 55

Recap: Supervised vs. Unsupervised Learning

Ref: https://aiml.com/what-is-the-difference-between-supervised-and-unsupervised-learning/

27 of 55

Recap: Supervised vs. Unsupervised Learning

Supervised Learning	Unsupervised Learning
Building a model from labeled data	Clustering from unlabeled data

28 of 55

Data Clustering

29 of 55

Data Clustering: Similarity (~ Distance)

The only information clustering uses is the mutual similarity between samples
A good clustering is one that achieves:

high within-cluster similarity
low inter-cluster similarity

30 of 55

Basic Intuition

31 of 55

Basic Intuition

Assigning points

32 of 55

Basic Intuition

Assigning points

33 of 55

Basic Intuition

Recomputing the cluster centers

34 of 55

Basic Intuition

Assigning points

35 of 55

Basic Intuition

Recomputing the cluster centers

36 of 55

Basic Intuition

Repeat

37 of 55

Basic Intuition

Final clustering result

38 of 55

K-Means (Iterative) Algorithm

39 of 55

K-Means (Iterative) Algorithm

40 of 55

K-Means (Iterative) Algorithm

41 of 55

K-Means (Iterative) Algorithm

42 of 55

K-Means (Iterative) Algorithm

43 of 55

K-Means with Python

Data generation

Random initialization

44 of 55

K-Means with Python

Manual implementation

45 of 55

K-Means with Python

Use scikit-learn library

46 of 55

Example: K-Means on Word2Vec

k = 5

47 of 55

Example: K-Means on Word2Vec

k = 11

48 of 55

Some Issues of K-Means

Initialization issues

K-Means is extremely sensitive to cluster center initialization

Bad initialization can lead to

Poor convergence speed
Bad overall clustering

Safeguarding measures:

Choose first center as one of the examples, second which �is the farthest from the first, third which is the farthest from both, and so on.
Try multiple initialization and choose the best result

49 of 55

Some Issues of K-Means

Ref: https://www.ibm.com/think/topics/k-means-clustering

50 of 55

Some Issues of K-Means

Ref: https://www.baeldung.com/cs/k-means-flaws-improvements

51 of 55

Some Issues of K-Means

Choosing the number of clusters

52 of 55

Limitations of K-Means (1/4)

Hard assignments

A point either completely belongs to a cluster or not belongs at all
No notion of a soft assignment (i.e., probability of being assigned to each cluster)
Recap: classification yields like [0.4, 0.2, 0.3, 0.1] …

53 of 55

Limitations of K-Means (2/4)

Sensitive to outlier examples

Ref: https://www.researchgate.net/figure/K-means-clustering-with-15-in-bound-outlier-and-3-clusters_fig5_332296768

54 of 55

Limitations of K-Means (3/4)

Non-convex/non-round-shaped cluster: standard K-means fails!

Euclidean distance

Ref: https://jcsites.juniata.edu/faculty/rhodes/ids/kmeans.html

55 of 55

Limitations of K-Means (4/4)

Imbalanced cluster sizes (or clusters with different densities)

Ref: https://jcsites.juniata.edu/faculty/rhodes/ids/kmeans.html