1 of 112

Machine Learning�Week 9 – Deep Learning (2)

Seungtaek Choi

Division of Language & AI at HUFS

seungtaek.choi@hufs.ac.kr

2 of 112

Unsupervised Learning

Dimensionality Reduction – Principal Component Analysis (PCA)

3 of 112

Why reduce dimensions?

  •  

4 of 112

Principal Component Analysis (PCA)

  • The 1st PC is the projection direction that maximizes the variance of the projected data
  • The 2nd PC is the projection direction that is orthogonal to the 1st PC and maximizes the variance

5 of 112

Principal Component Analysis (PCA)

  • Principal components are linear combinations of the original variables that have the maximum variance compared to other linear combinations.
  • Essentially, these components capture as much information from the original datasets as possible.

6 of 112

PCA: Conceptual Algorithm

7 of 112

PCA: Conceptual Algorithm

  • 1. Maximize variance (most separable)
  • 2. Minimize the sum-of-squares (minimum squared error)

8 of 112

PCA Algorithm: Pre-processing

  • Given data

  • Shifting (zero mean) and rescaling (unit variance)
    • 1) Shift to zero mean

    • 2) [optional] Rescaling (unit variance)

9 of 112

PCA Algorithm: Maximize Variance

  •  

10 of 112

Maximize Variance

  • In an optimization form

11 of 112

 

  •  

12 of 112

Dimensionality Reduction Using PCA

  •  

13 of 112

PCA is a useful preprocessing step

  • Helps reduce computational complexity.
  • Can help supervised learning.
    • Reduced dimension 🡺 simpler hypothesis space.

  • PCA can also be seen as noise reduction.
  • Caveats:
    • Fails when data consists of multiple separate clusters.
    • Directions of greatest variance may not be most informative (i.e., greatest classification power).

14 of 112

Unsupervised Learning

Dimensionality Reduction – Autoencoder

15 of 112

16 of 112

17 of 112

18 of 112

Deep Learning

Convolutional Neural Networks

19 of 112

What Computers “See”?

  • Images are numbers
    • An image is just a matrix of numbers [0, 255]!

20 of 112

What Computers “See”?

  • Images are numbers
    • An image is just a matrix of numbers [0, 255]!
    • i.e., 1080x1080x3 for an RGB image

21 of 112

Tasks in Computer Vision

  • Regression: output variable takes continuous value
  • Classification: output variable takes class label. Can produce probability of belonging to a particular class

Input Image

Pixel Representation

Lincoln

Washington

Jefferson

Obama

Trump

0.8

0.05

0.05

0.01

0.09

classification

22 of 112

Tasks in Computer Vision

  • Regression: output variable takes continuous value
  • Classification: output variable takes class label. Can produce probability of belonging to a particular class

23 of 112

24 of 112

25 of 112

26 of 112

  • Optical Character Recognition (OCR)

27 of 112

High-level Feature Detection

  • Let’s identify key features in each image category

Nose,

Eyes,

Mouth

Wheels,

License Plate,

Headlights

Door,

Windows,

Steps

28 of 112

29 of 112

“Learning” Feature Representations

  • Can we learn a hierarchy of features directly from the data instead of hand engineering?

30 of 112

“Learning” Feature Representations

  • Motivation
    • The bird occupies a local area and looks the same in different parts of an image.
    • We should construct neural networks which exploit these properties.

31 of 112

Deep Learning

  • That’s why we learn “Deep Learning” today.

32 of 112

Fully Connected Neural Network

33 of 112

Fully Connected Neural Network

  • Input:
    • 2D image
    • Vector of pixel values
  • Fully Connected:
    • Connect neuron in hidden layer to all neurons in input layer
    • No spatial information!
      • Spatial organization of the input is destroyed by flatten.
    • And, many, many parameters!

  • How can we use spatial structure in the input to inform the�architecture of the network?

34 of 112

Fully Connected Layer

35 of 112

Locally Connected Layer

36 of 112

Convolutional Layer

37 of 112

Key Idea

  • A standard neural net applied to images:
    • Scales quadratically with the size of the input
    • Does not leverage stationarity

  • Solution:
    • Connect each hidden unit to a small patch of the input
    • Share the weight across space

  • This is called: convolutional layer.
  • A network with convolutional layers is called convolutional network.

38 of 112

Using Spatial Structure

Input: 2D image,

Array of pixel values

Idea: connect patches of input to neurons in hidden layer.

Neuron connected to region of input. Only “sees” these values.

39 of 112

Using Spatial Structure

Connect patch in input layer to a single neuron in subsequent layer.

Use a sliding window to define connections.

How can we weight the patch to detect particular features?

40 of 112

Feature Extraction with Convolution

  1. Apply a set of weights – a filter – to extract local features

  • Use multiple filters to extract different features

  • Spatially share parameters of each filter
  • Filter of size 4x4: 16 different weights
  • Apply this same filter to 4x4 patches in input
  • Shift by 2 pixels for next patch

This “patchy” operation is convolution.

41 of 112

The Convolution Operation

 

Image

Kernel

Feature Map

42 of 112

Feature Extraction with Convolution: A Case Study

  • X or X?

Image is represented as matrix of pixel values… and computers are literal!

We want to be able to classify an X as an X even if it’s shifted, shrunk, rotated, deformed.

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

1

-1

-1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

-1

-1

1

-1

1

-1

-1

-1

-1

-1

-1

-1

1

-1

-1

-1

-1

-1

-1

-1

1

-1

1

-1

-1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

-1

-1

1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

-1

-1

1

1

-1

1

-1

-1

-1

-1

-1

-1

-1

1

-1

-1

-1

-1

-1

-1

-1

1

-1

1

1

-1

-1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

?

43 of 112

Feature Extraction with Convolution: A Case Study

  • Features of X

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

1

-1

-1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

-1

-1

1

-1

1

-1

-1

-1

-1

-1

-1

-1

1

-1

-1

-1

-1

-1

-1

-1

1

-1

1

-1

-1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

-1

-1

1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

-1

-1

1

1

-1

1

-1

-1

-1

-1

-1

-1

-1

1

-1

-1

-1

-1

-1

-1

-1

1

-1

1

1

-1

-1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

44 of 112

Feature Extraction with Convolution: A Case Study

  • Filters to detect X features

1

-1

-1

-1

1

-1

-1

-1

1

1

-1

1

-1

1

-1

1

-1

1

-1

-1

1

-1

1

-1

1

-1

-1

45 of 112

Feature Extraction with Convolution: A Case Study

1

X

1

=

1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

1

-1

-1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

-1

-1

1

-1

1

-1

-1

-1

-1

-1

-1

-1

1

-1

-1

-1

-1

-1

-1

-1

1

-1

1

-1

-1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

-1

-1

1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

1

1

1

1

1

1

1

1

1

1

=

9

element-wise multiply

add outputs

46 of 112

Producing Feature Maps

Original

Sharpen

Edge Detect

“Strong” Edge Detect

47 of 112

Pooling Layer

48 of 112

Pooling Layer

49 of 112

Spatial Pooling

  • Sum, average, or max
  • Non-overlapping / overlapping regions
  • Role of pooling:
    • Invariance to small transformations
    • Larger receptive fields (see more of input)

50 of 112

Convolutional Neural Networks

  • Neural network with specialized connectivity structure
  • Stack multiple stages of feature extractors
  • Higher stages compute more global, more irrelevant features

51 of 112

Convolutional Neural Networks

  • Feed-forward feature extraction:
    • Convolve input with learned filters: Apply filters to generate feature maps.
    • Non-linearity: Often ReLU.
    • Spatial pooling: Downsampling operation on each feature map.
    • Normalization
  • Supervised training of convolutional filters by back-propagating classification error

52 of 112

Important Concepts in CNN

  • 1. Convolution could have multiple filters.

53 of 112

Important Concepts in CNN

  • 1. Convolution could have multiple filters.
  • 2. For tensor (rank>=3), it still applies element-wise multiplication.

54 of 112

Important Concepts in CNN

  • 1. Convolution could have multiple filters.
  • 2. For tensor (rank>=3), it still applies element-wise multiplication.
  • 3. Stride is the size of filter step in sliding window.

55 of 112

Important Concepts in CNN

  • 1. Convolution could have multiple filters.
  • 2. For tensor (rank>=3), it still applies element-wise multiplication.
  • 3. Stride is the size of filter step in sliding window.
  • 4. By stacking convolutional layers, we can increase receptive field.

56 of 112

Important Concepts in CNN

  • 1. Convolution could have multiple filters.
  • 2. For tensor (rank>=3), it still applies element-wise multiplication.
  • 3. Stride is the size of filter step in sliding window.
  • 4. By stacking convolutional layers, we can increase receptive field.
  • 5. To include pixels/neurons around the boundary of image, we need padding.

57 of 112

Representation Learning in Deep CNNs

Conv Layer 1

Conv Layer 2

Conv Layer 3

58 of 112

CNNs for Classification: Feature Learning

  1. Learn features in input image through convolution
  2. Introduce non-linearity through activation function (real-world data is non-linear!)
  3. Reduce dimensionality and preserve spatial invariance with pooling

59 of 112

CNNs for Classification: Class Probabilities

  1. CONV and POOL layers output high-level features of input
  2. Fully connected layer uses these features for classifying input image
  3. Express output as probability of image belonging to a particular class

 

60 of 112

Practice 1: Feature Map Shape

61 of 112

Practice 1: Feature Map Shape

  • Stride = 1 (Default): Moves one pixel at a time

Convolution with 3x3 kernel, zero padding and stride = 1

62 of 112

Practice 1: Feature Map Shape

  • Stride > 1: Moves multiple pixels at a time 🡪 Reduces the output size, leading to downsampling.

Convolution with 3x3 kernel, zero padding and stride = 2

63 of 112

Practice 2: CNN in PyTorch

64 of 112

Practice 2: CNN in PyTorch

65 of 112

An Architecture for Many Applications

66 of 112

CNN for Text Classification

67 of 112

CNN for Speech Recognition

68 of 112

Deep Learning

Recurrent Neural Networks

69 of 112

So Far

  • Regression, Classification, Dimension Reduction, …
  • Based on snapshot-type data

70 of 112

Sequence Matters

  • Given an image of a ball, can you predict where it will go next?

???

71 of 112

Sequence Matters

  • How about this? Can you predict where it will go next?

72 of 112

What is a Sequence?

  • Sentence
    • “This morning I took the dog for a walk.”

  • Medical signals / Speech waveform / Vibration measurement

73 of 112

Sequence Modeling

  • Sequence modeling is the task of predicting what comes next
    • E.g., “This morning I took my dog for a walk.”

    • E.g., given historical air quality, forecast air quality in next couple of hours.

given previous words

predict the next word

74 of 112

A Sequence Modeling Example: Next Word Prediction

  • Idea #1: Use a fixed window

  • Limitation: Cannot model long-term dependencies
    • E.g., “France is where I grew up, but I now live in Boston. I speak fluent ___.”

  • We need information from the distant past to accurately predict the correct word.

“This morning I took my dog for a walk.”

given previous two words

predict the next word

75 of 112

A Sequence Modeling Example: Next Word Prediction

  • Idea #2: Use entire sequence as set of counts

  • Bag-of-words model
    • Define a vocabulary and initialize a zero vector where each element represents for each word
    • Compute word frequency and update the correspond position in the vector

    • Use the vector for prediction

  • Limitation: Counts don’t preserve order
    • “The food was good, not bad at all.” vs. “The food was bad, not good at all.”

  • We need to preserve the information about order.

This morning I took my dog for a walk.”

predict the next word

Here 1 is the count for the word “a

[0 1 0 0 1 0 1 … … 0 0 1 1 0 0 0 1 0]

76 of 112

Sequence Modeling

  • To model sequences, we need to:
    • Handle variable-length sequences
    • Track long-term dependencies
    • Maintain information about order
    • Share parameters across the sequence

  • Solution:
    • Recurrent Neural Networks (RNNs)

77 of 112

Standard Feed-Forward Neural Network

78 of 112

Recurrent Neural Networks

… and many other architectures and applications

79 of 112

A Recurrent Neural Network (RNN)

  • Apply a recurrence relation at every time step to process a sequence:

  • Note: the same function and set of parameters �are used at every time step

output vector

input vector

 

cell state

 

old state

current input

80 of 112

RNN: State Update and Output

  •  

output vector

input vector

 

cell state

 

old state

current input

 

 

81 of 112

RNN: Computational Graph across Time

  • Represent as computational graph unrolled across time

82 of 112

RNN: Computational Graph across Time

  • Represent as computational graph unrolled across time

83 of 112

RNN: Computational Graph across Time

  • Represent as computational graph unrolled across time

84 of 112

RNN: Computational Graph across Time

  • Represent as computational graph unrolled across time

85 of 112

RNN: Computational Graph across Time

  • Re-use the same weight matrices at every time step

86 of 112

RNN: Computational Graph across Time

  •  

87 of 112

RNN: Computational Graph across Time

  •  

88 of 112

RNN: Backpropagation Through Time

  •  

89 of 112

Standard RNN Gradient Flow

  •  

90 of 112

Standard RNN Gradient Flow: Exploding Gradients

  •  

Many values > 1:

exploding gradients

Gradient clipping to�scale big gradients

Many values < 1:

vanishing gradients

  1. Activation function
  2. Weight initialization
  3. Network architecture

91 of 112

Standard RNN Gradient Flow: Vanishing Gradients

  •  

Many values > 1:

exploding gradients

Gradient clipping to�scale big gradients

Many values < 1:

vanishing gradients

  1. Activation function
  2. Weight initialization
  3. Network architecture

92 of 112

The Problem of Long-Term Dependencies

  • Why are vanishing gradients a problem?

Multiply many small numbers together

Errors due to further back time steps �have smaller and smaller gradients

Bias parameters to capture short-term dependencies

93 of 112

Gating Mechanisms in Neurons

  • Use a more complex recurrent unit with gates to control what information is passed through

  • Long Short-Term Memory (LSTM) networks rely on gated cells to track information throughout many time steps.

94 of 112

Standard RNNs

  • In a standard RNN, recurrent modules contain simple computation

95 of 112

Long Short-Term Memory (LSTM)

  • In an LSTM network, recurrent modules contain gated cells that control the information flow

96 of 112

Long Short-Term Memory (LSTM)

  •  

97 of 112

Long Short-Term Memory (LSTM)

  • Information is added or removed to cell state through structures called gates.

Gates optionally let information through, via a sigmoid layer and pointwise multiplication

98 of 112

LSTM: Forget Irrelevant Information

  •  

 

99 of 112

LSTM: Add New Information

  •  

 

100 of 112

LSTM: Update Cell State

  •  

 

101 of 112

LSTM: Output Filtered Version of Cell State

  •  

 

102 of 112

LSTM: Cell State Matters

  •  

 

103 of 112

LSTM: Mitigate Vanishing Gradient

  • Vanilla RNNs

 

 

 

… Vanish!

 

… Explode!

🡪

🡪

 

104 of 112

LSTM: Mitigate Vanishing Gradient

  • Vanilla RNNs

  • LSTM

 

 

 

 

 

So…

We can keep information if we want!�(by adjusting how much we forget)

105 of 112

LSTM: Key Concepts

  • Maintain a separate cell state from what is outputted

  • Use gates to control the flow of information
    • Forget gate gets rid of irrelevant information
    • Selectively updates cell state
    • Output gate returns a filtered version of the cell state

  • LSTM can mitigate vanishing gradient problem

106 of 112

RNN Applications & Limitations

107 of 112

RNN Applications & Limitations

108 of 112

RNN Applications & Limitations

109 of 112

RNN Applications & Limitations

  • Limitations
    • Encoding bottleneck
    • Slow, no parallelization

110 of 112

Next

  • Attention and Transformers

111 of 112

  • Language Model: A system that predicts the next word

  • Recurrent Neural Network: A family of neural networks that:
    • Take sequential input of any length; apply the sameweights on each step
    • Can optionally produce output on each step

  • Recurrent Neural Network != Language Model
    • RNNs can be used for many other things

  • Language Modeling is a traditional subcomponent of many NLP tasks, all those involving generating text or estimating the probability of text:
    • Now everything in NLP is being rebuilt upon Language Modeling: GPT-3 is an LM!

112 of 112

Assignment #4