JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 112

Machine Learning�Week 9 – Deep Learning (2)

Seungtaek Choi

Division of Language & AI at HUFS

seungtaek.choi@hufs.ac.kr

2 of 112

Unsupervised Learning

Dimensionality Reduction – Principal Component Analysis (PCA)

3 of 112

Why reduce dimensions?

4 of 112

Principal Component Analysis (PCA)

The 1^st PC is the projection direction that maximizes the variance of the projected data
The 2^nd PC is the projection direction that is orthogonal to the 1^st PC and maximizes the variance

https://www.biorender.com/template/principal-component-analysis-pca-transformation

5 of 112

Principal Component Analysis (PCA)

Principal components are linear combinations of the original variables that have the maximum variance compared to other linear combinations.
Essentially, these components capture as much information from the original datasets as possible.

6 of 112

PCA: Conceptual Algorithm

7 of 112

PCA: Conceptual Algorithm

1. Maximize variance (most separable)
2. Minimize the sum-of-squares (minimum squared error)

8 of 112

PCA Algorithm: Pre-processing

Given data

Shifting (zero mean) and rescaling (unit variance)

1) Shift to zero mean

2) [optional] Rescaling (unit variance)

9 of 112

PCA Algorithm: Maximize Variance

10 of 112

Maximize Variance

In an optimization form

11 of 112

12 of 112

Dimensionality Reduction Using PCA

13 of 112

PCA is a useful preprocessing step

Helps reduce computational complexity.
Can help supervised learning.

Reduced dimension 🡺 simpler hypothesis space.

PCA can also be seen as noise reduction.
Caveats:

Fails when data consists of multiple separate clusters.
Directions of greatest variance may not be most informative (i.e., greatest classification power).

14 of 112

Unsupervised Learning

Dimensionality Reduction – Autoencoder

15 of 112

16 of 112

17 of 112

18 of 112

Deep Learning

Convolutional Neural Networks

19 of 112

What Computers “See”?

Images are numbers

An image is just a matrix of numbers [0, 255]!

https://www.researchgate.net/publication/330902210_The_visual_digital_turn_Using_neural_networks_to_study_historical_images

20 of 112

What Computers “See”?

Images are numbers

An image is just a matrix of numbers [0, 255]!
i.e., 1080x1080x3 for an RGB image

https://brandonrohrer.com/convert_rgb_to_grayscale.html

21 of 112

Tasks in Computer Vision

Regression: output variable takes continuous value
Classification: output variable takes class label. Can produce probability of belonging to a particular class

https://www.researchgate.net/publication/330902210_The_visual_digital_turn_Using_neural_networks_to_study_historical_images

Input Image

Pixel Representation

Lincoln

Washington

Jefferson

Obama

Trump

0.8

0.05

0.01

0.09

classification

22 of 112

Tasks in Computer Vision

Regression: output variable takes continuous value
Classification: output variable takes class label. Can produce probability of belonging to a particular class

https://www.researchgate.net/publication/330902210_The_visual_digital_turn_Using_neural_networks_to_study_historical_images

23 of 112

24 of 112

25 of 112

26 of 112

Optical Character Recognition (OCR)

https://zdnet.co.kr/view/?no=20190813171109

27 of 112

High-level Feature Detection

Let’s identify key features in each image category

https://en.wikipedia.org/wiki/Lenna

Nose,

Eyes,

Mouth

Wheels,

License Plate,

Headlights

https://www.youtube.com/watch?v=oGpzWAlP5p0

Door,

Windows,

Steps

https://www.zillow.com/

28 of 112

29 of 112

“Learning” Feature Representations

Can we learn a hierarchy of features directly from the data instead of hand engineering?

https://introtodeeplearning.com/

30 of 112

“Learning” Feature Representations

Motivation

The bird occupies a local area and looks the same in different parts of an image.
We should construct neural networks which exploit these properties.

https://iailab.kaist.ac.kr/teaching/machine-learning

31 of 112

Deep Learning

That’s why we learn “Deep Learning” today.

https://stats.stackexchange.com/questions/182734/what-is-the-difference-between-a-neural-network-and-a-deep-neural-network-and-w

32 of 112

Fully Connected Neural Network

https://stats.stackexchange.com/questions/182734/what-is-the-difference-between-a-neural-network-and-a-deep-neural-network-and-w

33 of 112

Fully Connected Neural Network

Input:

2D image
Vector of pixel values

Fully Connected:

Connect neuron in hidden layer to all neurons in input layer
No spatial information!

Spatial organization of the input is destroyed by flatten.

And, many, many parameters!

How can we use spatial structure in the input to inform the�architecture of the network?

https://stats.stackexchange.com/questions/182734/what-is-the-difference-between-a-neural-network-and-a-deep-neural-network-and-w

34 of 112

Fully Connected Layer

https://www.cs.toronto.edu/~ranzato/files/ranzato_deeplearn17_lec1_vision.pdf

35 of 112

Locally Connected Layer

https://www.cs.toronto.edu/~ranzato/files/ranzato_deeplearn17_lec1_vision.pdf

36 of 112

Convolutional Layer

https://www.cs.toronto.edu/~ranzato/files/ranzato_deeplearn17_lec1_vision.pdf

37 of 112

Key Idea

A standard neural net applied to images:

Scales quadratically with the size of the input
Does not leverage stationarity

Solution:

Connect each hidden unit to a small patch of the input
Share the weight across space

This is called: convolutional layer.
A network with convolutional layers is called convolutional network.

38 of 112

Using Spatial Structure

Input: 2D image,

Array of pixel values

Idea: connect patches of input to neurons in hidden layer.

Neuron connected to region of input. Only “sees” these values.

39 of 112

Using Spatial Structure

Connect patch in input layer to a single neuron in subsequent layer.

Use a sliding window to define connections.

How can we weight the patch to detect particular features?

40 of 112

Feature Extraction with Convolution

Apply a set of weights – a filter – to extract local features

Use multiple filters to extract different features

Spatially share parameters of each filter

Filter of size 4x4: 16 different weights
Apply this same filter to 4x4 patches in input
Shift by 2 pixels for next patch

This “patchy” operation is convolution.

41 of 112

The Convolution Operation

Image

Kernel

Feature Map

42 of 112

Feature Extraction with Convolution: A Case Study

X or X?

Image is represented as matrix of pixel values… and computers are literal!

We want to be able to classify an X as an X even if it’s shifted, shrunk, rotated, deformed.

-1	-1	-1	-1	-1	-1	-1	-1	-1
-1	1	-1	-1	-1	-1	-1	1	-1
-1	-1	1	-1	-1	-1	1	-1	-1
-1	-1	-1	1	-1	1	-1	-1	-1
-1	-1	-1	-1	1	-1	-1	-1	-1
-1	-1	-1	1	-1	1	-1	-1	-1
-1	-1	1	-1	-1	-1	1	-1	-1
-1	1	-1	-1	-1	-1	-1	1	-1
-1	-1	-1	-1	-1	-1	-1	-1	-1

-1	-1	-1	-1	-1	-1	-1	-1	-1
-1	-1	-1	-1	-1	-1	1	-1	-1
-1	1	-1	-1	-1	1	-1	-1	-1
-1	-1	1	1	-1	1	-1	-1	-1
-1	-1	-1	-1	1	-1	-1	-1	-1
-1	-1	-1	1	-1	1	1	-1	-1
-1	-1	-1	1	-1	-1	-1	1	-1
-1	-1	1	-1	-1	-1	-1	-1	-1
-1	-1	-1	-1	-1	-1	-1	-1	-1

43 of 112

Feature Extraction with Convolution: A Case Study

Features of X

-1	-1	-1	-1	-1	-1	-1	-1	-1
-1	1	-1	-1	-1	-1	-1	1	-1
-1	-1	1	-1	-1	-1	1	-1	-1
-1	-1	-1	1	-1	1	-1	-1	-1
-1	-1	-1	-1	1	-1	-1	-1	-1
-1	-1	-1	1	-1	1	-1	-1	-1
-1	-1	1	-1	-1	-1	1	-1	-1
-1	1	-1	-1	-1	-1	-1	1	-1
-1	-1	-1	-1	-1	-1	-1	-1	-1

-1	-1	-1	-1	-1	-1	-1	-1	-1
-1	-1	-1	-1	-1	-1	1	-1	-1
-1	1	-1	-1	-1	1	-1	-1	-1
-1	-1	1	1	-1	1	-1	-1	-1
-1	-1	-1	-1	1	-1	-1	-1	-1
-1	-1	-1	1	-1	1	1	-1	-1
-1	-1	-1	1	-1	-1	-1	1	-1
-1	-1	1	-1	-1	-1	-1	-1	-1
-1	-1	-1	-1	-1	-1	-1	-1	-1

44 of 112

Feature Extraction with Convolution: A Case Study

Filters to detect X features

1	-1	-1
-1	1	-1
-1	-1	1

1	-1	1
-1	1	-1
1	-1	1

-1	-1	1
-1	1	-1
1	-1	-1

45 of 112

Feature Extraction with Convolution: A Case Study

1	X	1	=	1

-1	-1	-1	-1	-1	-1	-1	-1	-1
-1	1	-1	-1	-1	-1	-1	1	-1
-1	-1	1	-1	-1	-1	1	-1	-1
-1	-1	-1	1	-1	1	-1	-1	-1
-1	-1	-1	-1	1	-1	-1	-1	-1
-1	-1	-1	1	-1	1	-1	-1	-1
-1	-1	1	-1	-1	-1	1	-1	-1
-1	1	-1	-1	-1	-1	-1	1	-1
-1	-1	-1	-1	-1	-1	-1	-1	-1

1	-1	-1
-1	1	-1
-1	-1	1

1	1	1
1	1	1
1	1	1

=	9

element-wise multiply

add outputs

46 of 112

Producing Feature Maps

Original

Sharpen

Edge Detect

“Strong” Edge Detect

47 of 112

Pooling Layer

48 of 112

Pooling Layer

49 of 112

Spatial Pooling

Sum, average, or max
Non-overlapping / overlapping regions
Role of pooling:

Invariance to small transformations
Larger receptive fields (see more of input)

50 of 112

Convolutional Neural Networks

Neural network with specialized connectivity structure
Stack multiple stages of feature extractors
Higher stages compute more global, more irrelevant features

https://blog.naver.com/bananacco/221945844124

51 of 112

Convolutional Neural Networks

Feed-forward feature extraction:

Convolve input with learned filters: Apply filters to generate feature maps.
Non-linearity: Often ReLU.
Spatial pooling: Downsampling operation on each feature map.
Normalization

Supervised training of convolutional filters by back-propagating classification error

52 of 112

Important Concepts in CNN

1. Convolution could have multiple filters.

53 of 112

Important Concepts in CNN

1. Convolution could have multiple filters.
2. For tensor (rank>=3), it still applies element-wise multiplication.

Intuitive understanding of 1D, 2D, and 3D convolutions in convolutional neural networks.

54 of 112

Important Concepts in CNN

1. Convolution could have multiple filters.
2. For tensor (rank>=3), it still applies element-wise multiplication.
3. Stride is the size of filter step in sliding window.

https://www.analyticsvidhya.com/blog/2022/03/basics-of-cnn-in-deep-learning/

55 of 112

Important Concepts in CNN

1. Convolution could have multiple filters.
2. For tensor (rank>=3), it still applies element-wise multiplication.
3. Stride is the size of filter step in sliding window.
4. By stacking convolutional layers, we can increase receptive field.

jaylala.tistory.com/entry/개념-정리-CNN에서-수용영역이란-Receptive-field란

56 of 112

Important Concepts in CNN

1. Convolution could have multiple filters.
2. For tensor (rank>=3), it still applies element-wise multiplication.
3. Stride is the size of filter step in sliding window.
4. By stacking convolutional layers, we can increase receptive field.
5. To include pixels/neurons around the boundary of image, we need padding.

jaylala.tistory.com/entry/개념-정리-CNN에서-수용영역이란-Receptive-field란

57 of 112

Representation Learning in Deep CNNs

https://introtodeeplearning.com/

Conv Layer 1

Conv Layer 2

Conv Layer 3

58 of 112

CNNs for Classification: Feature Learning

Learn features in input image through convolution
Introduce non-linearity through activation function (real-world data is non-linear!)
Reduce dimensionality and preserve spatial invariance with pooling

59 of 112

CNNs for Classification: Class Probabilities

CONV and POOL layers output high-level features of input
Fully connected layer uses these features for classifying input image
Express output as probability of image belonging to a particular class

60 of 112

Practice 1: Feature Map Shape

https://velog.io/@gomgom17/%EB%94%A5%EB%9F%AC%EB%8B%9D-CNN-Convolutional-Neural-Network

61 of 112

Practice 1: Feature Map Shape

Stride = 1 (Default): Moves one pixel at a time

https://medium.com/@sushmita2310/understanding-convolutional-neural-networks-cnns-for-beginners-e85ad21fe432

Convolution with 3x3 kernel, zero padding and stride = 1

62 of 112

Practice 1: Feature Map Shape

Stride > 1: Moves multiple pixels at a time 🡪 Reduces the output size, leading to downsampling.

https://medium.com/@sushmita2310/understanding-convolutional-neural-networks-cnns-for-beginners-e85ad21fe432

Convolution with 3x3 kernel, zero padding and stride = 2

63 of 112

Practice 2: CNN in PyTorch

https://github.com/HUFS-LAI-Seungtaek/HUFS-LAI-ML4E-2025-2/tree/main/assignments/assignment2

64 of 112

Practice 2: CNN in PyTorch

https://github.com/HUFS-LAI-Seungtaek/HUFS-LAI-ML4E-2025-2/tree/main/assignments/assignment2

65 of 112

An Architecture for Many Applications

66 of 112

CNN for Text Classification

https://arxiv.org/abs/1408.5882

67 of 112

CNN for Speech Recognition

https://ieeexplore.ieee.org/document/6857341

68 of 112

Deep Learning

Recurrent Neural Networks

69 of 112

So Far

Regression, Classification, Dimension Reduction, …
Based on snapshot-type data

https://iailab.kaist.ac.kr/teaching/machine-learning

70 of 112

Sequence Matters

Given an image of a ball, can you predict where it will go next?

https://www.youtube.com/watch?v=GvezxUdLrEk

???

71 of 112

Sequence Matters

How about this? Can you predict where it will go next?

https://www.youtube.com/watch?v=GvezxUdLrEk

72 of 112

What is a Sequence?

Sentence

“This morning I took the dog for a walk.”

Medical signals / Speech waveform / Vibration measurement

https://iailab.kaist.ac.kr/teaching/machine-learning

73 of 112

Sequence Modeling

Sequence modeling is the task of predicting what comes next

E.g., “This morning I took my dog for a walk.”

E.g., given historical air quality, forecast air quality in next couple of hours.

https://yaoyichi.github.io/spatial-ai/w9w10.3-rnn.pdf

given previous words

predict the next word

74 of 112

A Sequence Modeling Example: Next Word Prediction

Idea #1: Use a fixed window

Limitation: Cannot model long-term dependencies

E.g., “France is where I grew up, but I now live in Boston. I speak fluent ___.”

We need information from the distant past to accurately predict the correct word.

“This morning I took my dog for a walk.”

given previous two words

predict the next word

https://yaoyichi.github.io/spatial-ai/w9w10.3-rnn.pdf

75 of 112

A Sequence Modeling Example: Next Word Prediction

Idea #2: Use entire sequence as set of counts

Bag-of-words model

Define a vocabulary and initialize a zero vector where each element represents for each word
Compute word frequency and update the correspond position in the vector