1 of 55

Machine Learning for English Analysis

Week 7 – Unsupervised Learning

Prof. Seungtaek Choi

2 of 55

Today

  • Plan for Remaining Semester
  • Announcement: Assignment #3 (project proposal)
  • Recap: Midterm
  • Unsupervised Learning
    • Clustering

3 of 55

Plan for Remaining Semester

Week

W9

W10

W11

W12

W13

W14

W15

W16

Assign#3 (project proposal)

start

end

Assign#4 (data collection and analysis)

start

end

Assign#5 (model training and evaluation)

start

end

Assign#6 (real usage & final report)

start

end

Assign#7 (presentation)

lecture

Lecture Topic

Unsup. Learning

Unsup.�Learning

Deep Learning

Deep Learning

Advanced�Topics

Advanced�Topics

X

Final Exam

4 of 55

Announcement: Assignment #3

  • Assignment #3: Write your project proposal and submit PR
    • Deadline: 11:59 PM at Nov 10th (2 week)
    • Follow the instructions in https://github.com/HUFS-LAI-Seungtaek/HUFS-LAI-ML4E-2025-2/tree/main/assignments/assignment3
      • Almost same with the assignment #1
    • Examples
      • 1) Paper summarization bot
      • 2) Notice classification bot
      • 3) Personalized translation tool

5 of 55

Midterm Explanation

6 of 55

Stats about Midterm

  • Full score: 100
  • Average score: 37
  • Median score: 40

7 of 55

Midterm: OX

8 of 55

Midterm: OX

9 of 55

Midterm: OX

10 of 55

Midterm: Multiple-Choice

11 of 55

Midterm: Multiple-Choice

12 of 55

Midterm: Multiple-Choice

13 of 55

Midterm: Multiple-Choice

test data leakage

optimistic but …

counter: 85 vs. 15

cannot be sure without “cost”

14 of 55

Midterm: Multiple-Choice

covered in OX

covered in assignment2 (random init vs. word2vec init)

approximation with random sampling

15 of 55

Midterm: Multiple-Choice

dropout is for regularization (better generalization)

too strong regularization = low representation power

for p = 0, validation loss will increase (overfitting)

not guaranteed

16 of 55

Midterm: Multiple-Choice

n classifiers will multiply computing n-times

[0.1, 0.3, 0.4, 0.1, 0.1]

+ [0.2, 0.2, 0.3, 0.2, 0.1]

= [0.3, 0.5, 0.7, 0.3, 0.2]

🡪 class C

= 3 * e^2 * (1 – e) + e^3 = 3 * 0.2^2 * 0.8 + 0.2^3 = 0.104 (error) 🡪 0.896 (accuracy)

17 of 55

Midterm: Multiple-Choice

Scores

- Complete confusion matrix 🡪 4 points

- Correct accuracy 🡪 4 points

Pred

A

B

C

True

A

5

1

0

B

0

4

2

C

2

0

4

Accuracy = (# correct) / (# total) = 13 / 18

18 of 55

Midterm: Multiple-Choice

Scores

- Correct order 🡪 7 points

- Correct format 🡪 1 points (uppercase/lowercase, numbers, etc.)

A > C > B

#1: A

#2: C

#3: B

Model A

Pred

P

N

True

P

160

40

N

160

640

Model B

Pred

P

N

True

P

100

100

N

25

775

Model C

Pred

P

N

True

P

120

80

N

60

740

Cost(A)

= 1 * 160 + 5 * 40

= 360

Cost(B)

= 1 * 25 + 5 * 100

= 525

Cost(C)

= 1 * 60 + 5 * 80

= 460

19 of 55

Midterm: Multiple-Choice

 

 

 

 

(i) [<PAD>, in, paris]

  • x: [0, 0, 0, 1, 1, 0]
  • s: [0.2, 0.1, 0]
  • BIO: O

(ii) [in, paris, spring]

  • x: [0, 1, 1, 0, 1, 1]
  • s: [0.2, -1.9, 1.0]
  • BIO: I-LOC

(iii) [paris, spring, <PAD>]

  • x: [1, 0, 1, 1, 0, 0]
  • s: [0.2, -1.9, 0]
  • BIO: O

For more interactive understanding, please see this URL

https://claude.ai/public/artifacts/33cc6421-a2f0-46c8-b429-94a9af7fed00

20 of 55

Midterm: Multiple-Choice

 

 

 

21 of 55

Midterm: Multiple-Choice

 

 

22 of 55

Midterm: Multiple-Choice

 

ReLU = max(0, z) cannot return negative numbers

23 of 55

Midterm: Multiple-Choice

Scores

- Correct (a) = dog, dog 🡪 4 points

- Correct (b) = park, park 🡪 4 points

 

 

 

 

24 of 55

Unsupervised Learning

Clustering

25 of 55

Recap: Supervised vs. Unsupervised Learning

26 of 55

Recap: Supervised vs. Unsupervised Learning

27 of 55

Recap: Supervised vs. Unsupervised Learning

Supervised Learning

Unsupervised Learning

Building a model from labeled data

Clustering from unlabeled data

28 of 55

Data Clustering

  •  

29 of 55

Data Clustering: Similarity (~ Distance)

  • The only information clustering uses is the mutual similarity between samples
  • A good clustering is one that achieves:
    • high within-cluster similarity
    • low inter-cluster similarity

30 of 55

Basic Intuition

  •  

31 of 55

Basic Intuition

  • Assigning points

32 of 55

Basic Intuition

  • Assigning points

33 of 55

Basic Intuition

  • Recomputing the cluster centers

34 of 55

Basic Intuition

  • Assigning points

35 of 55

Basic Intuition

  • Recomputing the cluster centers

36 of 55

Basic Intuition

  • Repeat

37 of 55

Basic Intuition

  • Final clustering result

38 of 55

K-Means (Iterative) Algorithm

  •  

39 of 55

K-Means (Iterative) Algorithm

  •  

40 of 55

K-Means (Iterative) Algorithm

  •  

41 of 55

K-Means (Iterative) Algorithm

  •  

42 of 55

K-Means (Iterative) Algorithm

  •  

43 of 55

K-Means with Python

  • Data generation

  • Random initialization

44 of 55

K-Means with Python

  • Manual implementation

45 of 55

K-Means with Python

  • Use scikit-learn library

46 of 55

Example: K-Means on Word2Vec

  • k = 5

47 of 55

Example: K-Means on Word2Vec

  • k = 11

48 of 55

Some Issues of K-Means

  • Initialization issues
    • K-Means is extremely sensitive to cluster center initialization

    • Bad initialization can lead to
      • Poor convergence speed
      • Bad overall clustering

    • Safeguarding measures:
      • Choose first center as one of the examples, second which �is the farthest from the first, third which is the farthest from both, and so on.
      • Try multiple initialization and choose the best result

49 of 55

Some Issues of K-Means

  •  

50 of 55

Some Issues of K-Means

  •  

51 of 55

Some Issues of K-Means

  • Choosing the number of clusters

52 of 55

Limitations of K-Means (1/4)

  • Hard assignments
    • A point either completely belongs to a cluster or not belongs at all
    • No notion of a soft assignment (i.e., probability of being assigned to each cluster)
    • Recap: classification yields like [0.4, 0.2, 0.3, 0.1] …

53 of 55

Limitations of K-Means (2/4)

  • Sensitive to outlier examples

54 of 55

Limitations of K-Means (3/4)

  • Non-convex/non-round-shaped cluster: standard K-means fails!
    • Euclidean distance

55 of 55

Limitations of K-Means (4/4)

  • Imbalanced cluster sizes (or clusters with different densities)