JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 109

Machine Learning �for English Analysis�Week 9 – Deep Learning (2)

Seungtaek Choi

Division of Language & AI at HUFS

seungtaek.choi@hufs.ac.kr

2 of 109

Unsupervised Learning

Dimensionality Reduction – Principal Component Analysis (PCA)

3 of 109

Why reduce dimensions?

4 of 109

Principal Component Analysis (PCA)

The 1^st PC is the projection direction that maximizes the variance of the projected data
The 2^nd PC is the projection direction that is orthogonal to the 1^st PC and maximizes the variance

https://www.biorender.com/template/principal-component-analysis-pca-transformation

5 of 109

Principal Component Analysis (PCA)

Principal components are linear combinations of the original variables that have the maximum variance compared to other linear combinations.
Essentially, these components capture as much information from the original datasets as possible.

6 of 109

PCA: Conceptual Algorithm

7 of 109

PCA: Conceptual Algorithm

1. Maximize variance (most separable)
2. Minimize the sum-of-squares (minimum squared error)

8 of 109

PCA Algorithm: Pre-processing

Given data

Shifting (zero mean) and rescaling (unit variance)

1) Shift to zero mean

2) [optional] Rescaling (unit variance)

9 of 109

PCA Algorithm: Maximize Variance

10 of 109

Maximize Variance

In an optimization form

11 of 109

12 of 109

Dimensionality Reduction Using PCA

13 of 109

PCA is a useful preprocessing step

Helps reduce computational complexity.
Can help supervised learning.

Reduced dimension 🡺 simpler hypothesis space.

PCA can also be seen as noise reduction.
Caveats:

Fails when data consists of multiple separate clusters.
Directions of greatest variance may not be most informative (i.e., greatest classification power).

14 of 109

Unsupervised Learning

Dimensionality Reduction – Autoencoder

15 of 109

16 of 109

17 of 109

18 of 109

Deep Learning

Convolutional Neural Networks

19 of 109

What Computers “See”?

Images are numbers

An image is just a matrix of numbers [0, 255]!
i.e., 1080x1080x3 for an RGB image

https://www.researchgate.net/publication/330902210_The_visual_digital_turn_Using_neural_networks_to_study_historical_images

20 of 109

What Computers “See”?

Images are numbers

An image is just a matrix of numbers [0, 255]!
i.e., 1080x1080x3 for an RGB image

https://brandonrohrer.com/convert_rgb_to_grayscale.html

21 of 109

Tasks in Computer Vision

Regression: output variable takes continuous value
Classification: output variable takes class label. Can produce probability of belonging to a particular class

https://www.researchgate.net/publication/330902210_The_visual_digital_turn_Using_neural_networks_to_study_historical_images

Input Image

Pixel Representation

Lincoln

Washington

Jefferson

Obama

Trump

0.8

0.05

0.01

0.09

classification

22 of 109

Tasks in Computer Vision

Regression: output variable takes continuous value
Classification: output variable takes class label. Can produce probability of belonging to a particular class

https://www.researchgate.net/publication/330902210_The_visual_digital_turn_Using_neural_networks_to_study_historical_images

23 of 109

24 of 109

25 of 109

26 of 109

https://zdnet.co.kr/view/?no=20190813171109

27 of 109

High-level Feature Detection

Let’s identify key features in each image category

https://en.wikipedia.org/wiki/Lenna

Nose,

Eyes,

Mouth

Wheels,

License Plate,

Headlights

https://www.youtube.com/watch?v=oGpzWAlP5p0

Door,

Windows,

Steps

https://www.zillow.com/

28 of 109

29 of 109

“Learning” Feature Representations

Can we learn a hierarchy of features directly from the data instead of hand engineering?

https://introtodeeplearning.com/

30 of 109

“Learning” Feature Representations

Motivation

The bird occupies a local area and looks the same in different parts of an image.
We should construct neural networks which exploit these properties.

https://iailab.kaist.ac.kr/teaching/machine-learning

31 of 109

Deep Learning

That’s why we learn “Deep Learning” today.

https://stats.stackexchange.com/questions/182734/what-is-the-difference-between-a-neural-network-and-a-deep-neural-network-and-w

32 of 109

Fully Connected Neural Network

https://stats.stackexchange.com/questions/182734/what-is-the-difference-between-a-neural-network-and-a-deep-neural-network-and-w

33 of 109

Fully Connected Neural Network

Input:

2D image
Vector of pixel values

Fully Connected:

Connect neuron in hidden layer to all neurons in input layer
No spatial information!

Spatial organization of the input is destroyed by flatten.

And, many, many parameters!

How can we use spatial structure in the input to inform the�architecture of the network?

https://stats.stackexchange.com/questions/182734/what-is-the-difference-between-a-neural-network-and-a-deep-neural-network-and-w

34 of 109

Fully Connected Layer

https://www.cs.toronto.edu/~ranzato/files/ranzato_deeplearn17_lec1_vision.pdf

35 of 109

Locally Connected Layer

https://www.cs.toronto.edu/~ranzato/files/ranzato_deeplearn17_lec1_vision.pdf

36 of 109

Convolutional Layer

https://www.cs.toronto.edu/~ranzato/files/ranzato_deeplearn17_lec1_vision.pdf

37 of 109

Key Idea

A standard neural net applied to images:

Scales quadratically with the size of the input
Does not leverage stationarity

Solution:

Connect each hidden unit to a small patch of the input
Share the weight across space

This is called: convolutional layer.
A network with convolutional layers is called convolutional network.

38 of 109

Using Spatial Structure

Input: 2D image,

Array of pixel values

Idea: connect patches of input to neurons in hidden layer.

Neuron connected to region of input. Only “sees” these values.

39 of 109

Using Spatial Structure

Connect patch in input layer to a single neuron in subsequent layer.

Use a sliding window to define connections.

How can we weight the patch to detect particular features?

40 of 109

Feature Extraction with Convolution

Apply a set of weights – a filter – to extract local features

Use multiple filters to extract different features

Spatially share parameters of each filter

Filter of size 4x4: 16 different weights
Apply this same filter to 4x4 patches in input
Shift by 2 pixels for next patch

This “patchy” operation is convolution.

41 of 109

The Convolution Operation

Image

Kernel

Feature Map

42 of 109

Feature Extraction with Convolution: A Case Study

X or X?

Image is represented as matrix of pixel values… and computers are literal!

We want to be able to classify an X as an X even if it’s shifted, shrunk, rotated, deformed.

-1	-1	-1	-1	-1	-1	-1	-1	-1
-1	1	-1	-1	-1	-1	-1	1	-1
-1	-1	1	-1	-1	-1	1	-1	-1
-1	-1	-1	1	-1	1	-1	-1	-1
-1	-1	-1	-1	1	-1	-1	-1	-1
-1	-1	-1	1	-1	1	-1	-1	-1
-1	-1	1	-1	-1	-1	1	-1	-1
-1	1	-1	-1	-1	-1	-1	1	-1
-1	-1	-1	-1	-1	-1	-1	-1	-1

-1	-1	-1	-1	-1	-1	-1	-1	-1
-1	-1	-1	-1	-1	-1	1	-1	-1
-1	1	-1	-1	-1	1	-1	-1	-1
-1	-1	1	1	-1	1	-1	-1	-1
-1	-1	-1	-1	1	-1	-1	-1	-1
-1	-1	-1	1	-1	1	1	-1	-1
-1	-1	-1	1	-1	-1	-1	1	-1
-1	-1	1	-1	-1	-1	-1	-1	-1
-1	-1	-1	-1	-1	-1	-1	-1	-1

43 of 109

Feature Extraction with Convolution: A Case Study

Features of X

-1	-1	-1	-1	-1	-1	-1	-1	-1
-1	1	-1	-1	-1	-1	-1	1	-1
-1	-1	1	-1	-1	-1	1	-1	-1
-1	-1	-1	1	-1	1	-1	-1	-1
-1	-1	-1	-1	1	-1	-1	-1	-1
-1	-1	-1	1	-1	1	-1	-1	-1
-1	-1	1	-1	-1	-1	1	-1	-1
-1	1	-1	-1	-1	-1	-1	1	-1
-1	-1	-1	-1	-1	-1	-1	-1	-1

-1	-1	-1	-1	-1	-1	-1	-1	-1
-1	-1	-1	-1	-1	-1	1	-1	-1
-1	1	-1	-1	-1	1	-1	-1	-1
-1	-1	1	1	-1	1	-1	-1	-1
-1	-1	-1	-1	1	-1	-1	-1	-1
-1	-1	-1	1	-1	1	1	-1	-1
-1	-1	-1	1	-1	-1	-1	1	-1
-1	-1	1	-1	-1	-1	-1	-1	-1
-1	-1	-1	-1	-1	-1	-1	-1	-1

44 of 109

Feature Extraction with Convolution: A Case Study

Filters to detect X features

1	-1	-1
-1	1	-1
-1	-1	1

1	-1	1
-1	1	-1
1	-1	1

-1	-1	1
-1	1	-1
1	-1	-1

45 of 109

Feature Extraction with Convolution: A Case Study

1	X	1	=	1

-1	-1	-1	-1	-1	-1	-1	-1	-1
-1	1	-1	-1	-1	-1	-1	1	-1
-1	-1	1	-1	-1	-1	1	-1	-1
-1	-1	-1	1	-1	1	-1	-1	-1
-1	-1	-1	-1	1	-1	-1	-1	-1
-1	-1	-1	1	-1	1	-1	-1	-1
-1	-1	1	-1	-1	-1	1	-1	-1
-1	1	-1	-1	-1	-1	-1	1	-1
-1	-1	-1	-1	-1	-1	-1	-1	-1

1	-1	-1
-1	1	-1
-1	-1	1

1	1	1
1	1	1
1	1	1

=	9

element-wise multiply

add outputs

46 of 109

Producing Feature Maps

Original

Sharpen

Edge Detect

“Strong” Edge Detect

47 of 109

Convolutional Neural Networks

Neural network with specialized connectivity structure
Stack multiple stages of feature extractors
Higher stages compute more global, more irrelevant features

https://blog.naver.com/bananacco/221945844124

48 of 109

Convolutional Neural Networks

Feed-forward feature extraction:

Convolve input with learned filters: Apply filters to generate feature maps.
Non-linearity: Often ReLU.
Spatial pooling: Downsampling operation on each feature map.
Normalization

Supervised training of convolutional filters by back-propagating classification error

49 of 109

CNNs: Spatial Arrangement of Output Volume

50 of 109

CNNs: Spatial Arrangement of Output Volume

51 of 109

Spatial Pooling

Sum or max
Non-overlapping / overlapping regions
Role of pooling:

Invariance to small transformations
Larger receptive fields (see more of input)

52 of 109

Pooling Layer

53 of 109

Pooling Layer

54 of 109

Representation Learning in Deep CNNs

https://introtodeeplearning.com/

Conv Layer 1

Conv Layer 2

Conv Layer 3

55 of 109

CNNs for Classification: Feature Learning

Learn features in input image through convolution
Introduce non-linearity through activation function (real-world data is non-linear!)
Reduce dimensionality and preserve spatial invariance with pooling

56 of 109

CNNs for Classification: Class Probabilities

CONV and POOL layers output high-level features of input
Fully connected layer uses these features for classifying input image
Express output as probability of image belonging to a particular class

57 of 109

Practice 1: Feature Map Shape

https://velog.io/@gomgom17/%EB%94%A5%EB%9F%AC%EB%8B%9D-CNN-Convolutional-Neural-Network

58 of 109

Practice 1: Feature Map Shape

Stride = 1 (Default): Moves one pixel at a time

https://medium.com/@sushmita2310/understanding-convolutional-neural-networks-cnns-for-beginners-e85ad21fe432

Convolution with 3x3 kernel, zero padding and stride = 1

59 of 109

Practice 1: Feature Map Shape

Stride > 1: Moves multiple pixels at a time 🡪 Reduces the output size, leading to downsampling.

https://medium.com/@sushmita2310/understanding-convolutional-neural-networks-cnns-for-beginners-e85ad21fe432

Convolution with 3x3 kernel, zero padding and stride = 2

60 of 109

Practice 2: CNN in PyTorch

https://github.com/HUFS-LAI-Seungtaek/HUFS-LAI-ML4E-2025-2/tree/main/assignments/assignment2

61 of 109

Practice 2: CNN in PyTorch

https://github.com/HUFS-LAI-Seungtaek/HUFS-LAI-ML4E-2025-2/tree/main/assignments/assignment2

62 of 109

An Architecture for Many Applications

63 of 109

CNN for Text Classification

https://arxiv.org/abs/1408.5882

64 of 109

CNN for Speech Recognition

https://ieeexplore.ieee.org/document/6857341

65 of 109

Deep Learning

Recurrent Neural Networks

66 of 109

So Far

Regression, Classification, Dimension Reduction, …
Based on snapshot-type data

https://iailab.kaist.ac.kr/teaching/machine-learning

67 of 109

Sequence Matters

Given an image of a ball, can you predict where it will go next?

https://www.youtube.com/watch?v=GvezxUdLrEk

???

68 of 109

Sequence Matters

How about this? Can you predict where it will go next?

https://www.youtube.com/watch?v=GvezxUdLrEk

69 of 109

What is a Sequence?

Sentence

“This morning I took the dog for a walk.”

Medical signals / Speech waveform / Vibration measurement

https://iailab.kaist.ac.kr/teaching/machine-learning

70 of 109

Sequence Modeling

Sequence modeling is the task of predicting what comes next

E.g., “This morning I took my dog for a walk.”

E.g., given historical air quality, forecast air quality in next couple of hours.

https://yaoyichi.github.io/spatial-ai/w9w10.3-rnn.pdf

given previous words

predict the next word

71 of 109

A Sequence Modeling Example: Next Word Prediction

Idea #1: Use a fixed window

Limitation: Cannot model long-term dependencies

E.g., “France is where I grew up, but I now live in Boston. I speak fluent ___.”

We need information from the distant past to accurately predict the correct word.

“This morning I took my dog for a walk.”

given previous two words

predict the next word

https://yaoyichi.github.io/spatial-ai/w9w10.3-rnn.pdf

72 of 109

A Sequence Modeling Example: Next Word Prediction

Idea #2: Use entire sequence as set of counts

Bag-of-words model

Define a vocabulary and initialize a zero vector where each element represents for each word
Compute word frequency and update the correspond position in the vector