1 of 109

Machine Learning �for English Analysis�Week 9 – Deep Learning (2)

Seungtaek Choi

Division of Language & AI at HUFS

seungtaek.choi@hufs.ac.kr

2 of 109

Unsupervised Learning

Dimensionality Reduction – Principal Component Analysis (PCA)

3 of 109

Why reduce dimensions?

  •  

4 of 109

Principal Component Analysis (PCA)

  • The 1st PC is the projection direction that maximizes the variance of the projected data
  • The 2nd PC is the projection direction that is orthogonal to the 1st PC and maximizes the variance

5 of 109

Principal Component Analysis (PCA)

  • Principal components are linear combinations of the original variables that have the maximum variance compared to other linear combinations.
  • Essentially, these components capture as much information from the original datasets as possible.

6 of 109

PCA: Conceptual Algorithm

7 of 109

PCA: Conceptual Algorithm

  • 1. Maximize variance (most separable)
  • 2. Minimize the sum-of-squares (minimum squared error)

8 of 109

PCA Algorithm: Pre-processing

  • Given data

  • Shifting (zero mean) and rescaling (unit variance)
    • 1) Shift to zero mean

    • 2) [optional] Rescaling (unit variance)

9 of 109

PCA Algorithm: Maximize Variance

  •  

10 of 109

Maximize Variance

  • In an optimization form

11 of 109

 

  •  

12 of 109

Dimensionality Reduction Using PCA

  •  

13 of 109

PCA is a useful preprocessing step

  • Helps reduce computational complexity.
  • Can help supervised learning.
    • Reduced dimension 🡺 simpler hypothesis space.

  • PCA can also be seen as noise reduction.
  • Caveats:
    • Fails when data consists of multiple separate clusters.
    • Directions of greatest variance may not be most informative (i.e., greatest classification power).

14 of 109

Unsupervised Learning

Dimensionality Reduction – Autoencoder

15 of 109

16 of 109

17 of 109

18 of 109

Deep Learning

Convolutional Neural Networks

19 of 109

What Computers “See”?

  • Images are numbers
    • An image is just a matrix of numbers [0, 255]!
    • i.e., 1080x1080x3 for an RGB image

20 of 109

What Computers “See”?

  • Images are numbers
    • An image is just a matrix of numbers [0, 255]!
    • i.e., 1080x1080x3 for an RGB image

21 of 109

Tasks in Computer Vision

  • Regression: output variable takes continuous value
  • Classification: output variable takes class label. Can produce probability of belonging to a particular class

Input Image

Pixel Representation

Lincoln

Washington

Jefferson

Obama

Trump

0.8

0.05

0.05

0.01

0.09

classification

22 of 109

Tasks in Computer Vision

  • Regression: output variable takes continuous value
  • Classification: output variable takes class label. Can produce probability of belonging to a particular class

23 of 109

24 of 109

25 of 109

26 of 109

27 of 109

High-level Feature Detection

  • Let’s identify key features in each image category

Nose,

Eyes,

Mouth

Wheels,

License Plate,

Headlights

Door,

Windows,

Steps

28 of 109

29 of 109

“Learning” Feature Representations

  • Can we learn a hierarchy of features directly from the data instead of hand engineering?

30 of 109

“Learning” Feature Representations

  • Motivation
    • The bird occupies a local area and looks the same in different parts of an image.
    • We should construct neural networks which exploit these properties.

31 of 109

Deep Learning

  • That’s why we learn “Deep Learning” today.

32 of 109

Fully Connected Neural Network

33 of 109

Fully Connected Neural Network

  • Input:
    • 2D image
    • Vector of pixel values
  • Fully Connected:
    • Connect neuron in hidden layer to all neurons in input layer
    • No spatial information!
      • Spatial organization of the input is destroyed by flatten.
    • And, many, many parameters!

  • How can we use spatial structure in the input to inform the�architecture of the network?

34 of 109

Fully Connected Layer

35 of 109

Locally Connected Layer

36 of 109

Convolutional Layer

37 of 109

Key Idea

  • A standard neural net applied to images:
    • Scales quadratically with the size of the input
    • Does not leverage stationarity

  • Solution:
    • Connect each hidden unit to a small patch of the input
    • Share the weight across space

  • This is called: convolutional layer.
  • A network with convolutional layers is called convolutional network.

38 of 109

Using Spatial Structure

Input: 2D image,

Array of pixel values

Idea: connect patches of input to neurons in hidden layer.

Neuron connected to region of input. Only “sees” these values.

39 of 109

Using Spatial Structure

Connect patch in input layer to a single neuron in subsequent layer.

Use a sliding window to define connections.

How can we weight the patch to detect particular features?

40 of 109

Feature Extraction with Convolution

  1. Apply a set of weights – a filter – to extract local features

  • Use multiple filters to extract different features

  • Spatially share parameters of each filter
  • Filter of size 4x4: 16 different weights
  • Apply this same filter to 4x4 patches in input
  • Shift by 2 pixels for next patch

This “patchy” operation is convolution.

41 of 109

The Convolution Operation

 

Image

Kernel

Feature Map

42 of 109

Feature Extraction with Convolution: A Case Study

  • X or X?

Image is represented as matrix of pixel values… and computers are literal!

We want to be able to classify an X as an X even if it’s shifted, shrunk, rotated, deformed.

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

1

-1

-1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

-1

-1

1

-1

1

-1

-1

-1

-1

-1

-1

-1

1

-1

-1

-1

-1

-1

-1

-1

1

-1

1

-1

-1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

-1

-1

1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

-1

-1

1

1

-1

1

-1

-1

-1

-1

-1

-1

-1

1

-1

-1

-1

-1

-1

-1

-1

1

-1

1

1

-1

-1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

?

43 of 109

Feature Extraction with Convolution: A Case Study

  • Features of X

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

1

-1

-1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

-1

-1

1

-1

1

-1

-1

-1

-1

-1

-1

-1

1

-1

-1

-1

-1

-1

-1

-1

1

-1

1

-1

-1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

-1

-1

1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

-1

-1

1

1

-1

1

-1

-1

-1

-1

-1

-1

-1

1

-1

-1

-1

-1

-1

-1

-1

1

-1

1

1

-1

-1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

44 of 109

Feature Extraction with Convolution: A Case Study

  • Filters to detect X features

1

-1

-1

-1

1

-1

-1

-1

1

1

-1

1

-1

1

-1

1

-1

1

-1

-1

1

-1

1

-1

1

-1

-1

45 of 109

Feature Extraction with Convolution: A Case Study

1

X

1

=

1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

1

-1

-1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

-1

-1

1

-1

1

-1

-1

-1

-1

-1

-1

-1

1

-1

-1

-1

-1

-1

-1

-1

1

-1

1

-1

-1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

-1

-1

1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

1

-1

-1

-1

1

-1

-1

-1

1

1

1

1

1

1

1

1

1

1

=

9

element-wise multiply

add outputs

46 of 109

Producing Feature Maps

Original

Sharpen

Edge Detect

“Strong” Edge Detect

47 of 109

Convolutional Neural Networks

  • Neural network with specialized connectivity structure
  • Stack multiple stages of feature extractors
  • Higher stages compute more global, more irrelevant features

48 of 109

Convolutional Neural Networks

  • Feed-forward feature extraction:
    • Convolve input with learned filters: Apply filters to generate feature maps.
    • Non-linearity: Often ReLU.
    • Spatial pooling: Downsampling operation on each feature map.
    • Normalization
  • Supervised training of convolutional filters by back-propagating classification error

49 of 109

CNNs: Spatial Arrangement of Output Volume

50 of 109

CNNs: Spatial Arrangement of Output Volume

51 of 109

Spatial Pooling

  • Sum or max
  • Non-overlapping / overlapping regions
  • Role of pooling:
    • Invariance to small transformations
    • Larger receptive fields (see more of input)

52 of 109

Pooling Layer

53 of 109

Pooling Layer

54 of 109

Representation Learning in Deep CNNs

Conv Layer 1

Conv Layer 2

Conv Layer 3

55 of 109

CNNs for Classification: Feature Learning

  1. Learn features in input image through convolution
  2. Introduce non-linearity through activation function (real-world data is non-linear!)
  3. Reduce dimensionality and preserve spatial invariance with pooling

56 of 109

CNNs for Classification: Class Probabilities

  1. CONV and POOL layers output high-level features of input
  2. Fully connected layer uses these features for classifying input image
  3. Express output as probability of image belonging to a particular class

 

57 of 109

Practice 1: Feature Map Shape

58 of 109

Practice 1: Feature Map Shape

  • Stride = 1 (Default): Moves one pixel at a time

Convolution with 3x3 kernel, zero padding and stride = 1

59 of 109

Practice 1: Feature Map Shape

  • Stride > 1: Moves multiple pixels at a time 🡪 Reduces the output size, leading to downsampling.

Convolution with 3x3 kernel, zero padding and stride = 2

60 of 109

Practice 2: CNN in PyTorch

61 of 109

Practice 2: CNN in PyTorch

62 of 109

An Architecture for Many Applications

63 of 109

CNN for Text Classification

64 of 109

CNN for Speech Recognition

65 of 109

Deep Learning

Recurrent Neural Networks

66 of 109

So Far

  • Regression, Classification, Dimension Reduction, …
  • Based on snapshot-type data

67 of 109

Sequence Matters

  • Given an image of a ball, can you predict where it will go next?

???

68 of 109

Sequence Matters

  • How about this? Can you predict where it will go next?

69 of 109

What is a Sequence?

  • Sentence
    • “This morning I took the dog for a walk.”

  • Medical signals / Speech waveform / Vibration measurement

70 of 109

Sequence Modeling

  • Sequence modeling is the task of predicting what comes next
    • E.g., “This morning I took my dog for a walk.”

    • E.g., given historical air quality, forecast air quality in next couple of hours.

given previous words

predict the next word

71 of 109

A Sequence Modeling Example: Next Word Prediction

  • Idea #1: Use a fixed window

  • Limitation: Cannot model long-term dependencies
    • E.g., “France is where I grew up, but I now live in Boston. I speak fluent ___.”

  • We need information from the distant past to accurately predict the correct word.

“This morning I took my dog for a walk.”

given previous two words

predict the next word

72 of 109

A Sequence Modeling Example: Next Word Prediction

  • Idea #2: Use entire sequence as set of counts

  • Bag-of-words model
    • Define a vocabulary and initialize a zero vector where each element represents for each word
    • Compute word frequency and update the correspond position in the vector

    • Use the vector for prediction

  • Limitation: Counts don’t preserve order
    • “The food was good, not bad at all.” vs. “The food was bad, not good at all.”

  • We need to preserve the information about order.

This morning I took my dog for a walk.”

predict the next word

Here 1 is the count for the word ”a

[0 1 0 0 1 0 1 … … 0 0 1 1 0 0 0 1 0]

73 of 109

Sequence Modeling

  • To model sequences, we need to:
    • Handle variable-length sequences
    • Track long-term dependencies
    • Maintain information about order
    • Share parameters across the sequence

  • Solution:
    • Recurrent Neural Networks (RNNs)

74 of 109

Standard Feed-Forward Neural Network

75 of 109

Recurrent Neural Networks

… and many other architectures and applications

76 of 109

A Recurrent Neural Network (RNN)

  • Apply a recurrence relation at every time step to process a sequence:

  • Note: the same function and set of parameters are used at every time step

output vector

input vector

 

cell state

 

old state

current input

77 of 109

RNN: State Update and Output

  •  

output vector

input vector

 

cell state

 

old state

current input

 

 

78 of 109

RNN: Computational Graph across Time

  • Represent as computational graph unrolled across time

79 of 109

RNN: Computational Graph across Time

  • Represent as computational graph unrolled across time

80 of 109

RNN: Computational Graph across Time

  • Represent as computational graph unrolled across time

81 of 109

RNN: Computational Graph across Time

  • Represent as computational graph unrolled across time

82 of 109

RNN: Computational Graph across Time

  • Re-use the same weight matrices at every time step

83 of 109

RNN: Computational Graph across Time

  •  

84 of 109

RNN: Computational Graph across Time

  •  

85 of 109

RNN: Backpropagation Through Time

  •  

86 of 109

Standard RNN Gradient Flow

  •  

87 of 109

Standard RNN Gradient Flow: Exploding Gradients

  •  

Many values > 1:

exploding gradients

Gradient clipping to�scale big gradients

Many values < 1:

vanishing gradients

  1. Activation function
  2. Weight initialization
  3. Network architecture

88 of 109

Standard RNN Gradient Flow: Vanishing Gradients

  •  

Many values > 1:

exploding gradients

Gradient clipping to�scale big gradients

Many values < 1:

vanishing gradients

  1. Activation function
  2. Weight initialization
  3. Network architecture

89 of 109

The Problem of Long-Term Dependencies

  • Why are vanishing gradients a problem?

Multiply many small numbers together

Errors due to further back time steps �have smaller and smaller gradients

Bias parameters to capture short-term dependencies

90 of 109

Gating Mechanisms in Neurons

  • Use a more complex recurrent unit with gates to control what information is passed through

  • Long Short-Term Memory (LSTM) networks rely on gated cells to track information throughout many time steps.

91 of 109

Standard RNNs

  • In a standard RNN, recurrent modules contain simple computation

92 of 109

Long Short-Term Memory (LSTM)

  • In an LSTM network, recurrent modules contain gated cells that control the information flow

93 of 109

Long Short-Term Memory (LSTM)

  •  

94 of 109

Long Short-Term Memory (LSTM)

  • Information is added or removed to cell state through structures called gates.

Gates optionally let information through, via a sigmoid layer and pointwise multiplication

95 of 109

LSTM: Forget Irrelevant Information

  •  

 

96 of 109

LSTM: Add New Information

  •  

 

97 of 109

LSTM: Update Cell State

  •  

 

98 of 109

LSTM: Output Filtered Version of Cell State

  •  

 

99 of 109

LSTM: Cell State Matters

  •  

 

100 of 109

LSTM: Mitigate Vanishing Gradient

  • Vanilla RNNs

 

 

 

… Vanish!

 

… Explode!

🡪

🡪

 

101 of 109

LSTM: Mitigate Vanishing Gradient

  • Vanilla RNNs

  • LSTM

 

 

 

 

 

So…

We can keep information if we want!�(by adjusting how much we forget)

102 of 109

LSTM: Key Concepts

  • Maintain a separate cell state from what is outputted

  • Use gates to control the flow of information
    • Forget gate gets rid of irrelevant information
    • Selectively updates cell state
    • Output gate returns a filtered version of the cell state

  • LSTM can mitigate vanishing gradient problem

103 of 109

RNN Applications & Limitations

104 of 109

RNN Applications & Limitations

105 of 109

RNN Applications & Limitations

106 of 109

RNN Applications & Limitations

  • Limitations
    • Encoding bottleneck
    • Slow, no parallelization

107 of 109

Next

  • Attention and Transformers

108 of 109

  • Language Model: A system that predicts the next word

  • Recurrent Neural Network: A family of neural networks that:
    • Take sequential input of any length; apply the sameweights on each step
    • Can optionally produce output on each step

  • Recurrent Neural Network != Language Model
    • RNNs can be used for many other things

  • Language Modeling is a traditional subcomponent of many NLP tasks, all those involving generating text or estimating the probability of text:
    • Now everything in NLP is being rebuilt upon Language Modeling: GPT-3 is an LM!

109 of 109

Assignment #4