Machine Learning �for English Analysis�Week 9 – Deep Learning (2)
Unsupervised Learning
Dimensionality Reduction – Principal Component Analysis (PCA)
Why reduce dimensions?
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)
PCA: Conceptual Algorithm
PCA: Conceptual Algorithm
PCA Algorithm: Pre-processing
PCA Algorithm: Maximize Variance
Maximize Variance
Dimensionality Reduction Using PCA
PCA is a useful preprocessing step
Unsupervised Learning
Dimensionality Reduction – Autoencoder
Deep Learning
Convolutional Neural Networks
What Computers “See”?
What Computers “See”?
Tasks in Computer Vision
Input Image
Pixel Representation
Lincoln
Washington
Jefferson
Obama
Trump
0.8
0.05
0.05
0.01
0.09
classification
Tasks in Computer Vision
High-level Feature Detection
Nose,
Eyes,
Mouth
Wheels,
License Plate,
Headlights
Door,
Windows,
Steps
“Learning” Feature Representations
“Learning” Feature Representations
Deep Learning
Fully Connected Neural Network
Fully Connected Neural Network
Fully Connected Layer
Locally Connected Layer
Convolutional Layer
Key Idea
Using Spatial Structure
Input: 2D image,
Array of pixel values
Idea: connect patches of input to neurons in hidden layer.
Neuron connected to region of input. Only “sees” these values.
Using Spatial Structure
Connect patch in input layer to a single neuron in subsequent layer.
Use a sliding window to define connections.
How can we weight the patch to detect particular features?
Feature Extraction with Convolution
This “patchy” operation is convolution.
The Convolution Operation
Image
Kernel
Feature Map
Feature Extraction with Convolution: A Case Study
Image is represented as matrix of pixel values… and computers are literal!
We want to be able to classify an X as an X even if it’s shifted, shrunk, rotated, deformed.
-1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 |
-1 | 1 | -1 | -1 | -1 | -1 | -1 | 1 | -1 |
-1 | -1 | 1 | -1 | -1 | -1 | 1 | -1 | -1 |
-1 | -1 | -1 | 1 | -1 | 1 | -1 | -1 | -1 |
-1 | -1 | -1 | -1 | 1 | -1 | -1 | -1 | -1 |
-1 | -1 | -1 | 1 | -1 | 1 | -1 | -1 | -1 |
-1 | -1 | 1 | -1 | -1 | -1 | 1 | -1 | -1 |
-1 | 1 | -1 | -1 | -1 | -1 | -1 | 1 | -1 |
-1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 |
-1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 |
-1 | -1 | -1 | -1 | -1 | -1 | 1 | -1 | -1 |
-1 | 1 | -1 | -1 | -1 | 1 | -1 | -1 | -1 |
-1 | -1 | 1 | 1 | -1 | 1 | -1 | -1 | -1 |
-1 | -1 | -1 | -1 | 1 | -1 | -1 | -1 | -1 |
-1 | -1 | -1 | 1 | -1 | 1 | 1 | -1 | -1 |
-1 | -1 | -1 | 1 | -1 | -1 | -1 | 1 | -1 |
-1 | -1 | 1 | -1 | -1 | -1 | -1 | -1 | -1 |
-1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 |
?
Feature Extraction with Convolution: A Case Study
-1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 |
-1 | 1 | -1 | -1 | -1 | -1 | -1 | 1 | -1 |
-1 | -1 | 1 | -1 | -1 | -1 | 1 | -1 | -1 |
-1 | -1 | -1 | 1 | -1 | 1 | -1 | -1 | -1 |
-1 | -1 | -1 | -1 | 1 | -1 | -1 | -1 | -1 |
-1 | -1 | -1 | 1 | -1 | 1 | -1 | -1 | -1 |
-1 | -1 | 1 | -1 | -1 | -1 | 1 | -1 | -1 |
-1 | 1 | -1 | -1 | -1 | -1 | -1 | 1 | -1 |
-1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 |
-1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 |
-1 | -1 | -1 | -1 | -1 | -1 | 1 | -1 | -1 |
-1 | 1 | -1 | -1 | -1 | 1 | -1 | -1 | -1 |
-1 | -1 | 1 | 1 | -1 | 1 | -1 | -1 | -1 |
-1 | -1 | -1 | -1 | 1 | -1 | -1 | -1 | -1 |
-1 | -1 | -1 | 1 | -1 | 1 | 1 | -1 | -1 |
-1 | -1 | -1 | 1 | -1 | -1 | -1 | 1 | -1 |
-1 | -1 | 1 | -1 | -1 | -1 | -1 | -1 | -1 |
-1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 |
Feature Extraction with Convolution: A Case Study
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
1 | -1 | -1 |
-1 | 1 | -1 |
-1 | -1 | 1 |
1 | -1 | 1 |
-1 | 1 | -1 |
1 | -1 | 1 |
-1 | -1 | 1 |
-1 | 1 | -1 |
1 | -1 | -1 |
Feature Extraction with Convolution: A Case Study
1 | X | 1 | = | 1 |
-1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 |
-1 | 1 | -1 | -1 | -1 | -1 | -1 | 1 | -1 |
-1 | -1 | 1 | -1 | -1 | -1 | 1 | -1 | -1 |
-1 | -1 | -1 | 1 | -1 | 1 | -1 | -1 | -1 |
-1 | -1 | -1 | -1 | 1 | -1 | -1 | -1 | -1 |
-1 | -1 | -1 | 1 | -1 | 1 | -1 | -1 | -1 |
-1 | -1 | 1 | -1 | -1 | -1 | 1 | -1 | -1 |
-1 | 1 | -1 | -1 | -1 | -1 | -1 | 1 | -1 |
-1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 |
1 | -1 | -1 |
-1 | 1 | -1 |
-1 | -1 | 1 |
1 | 1 | 1 |
1 | 1 | 1 |
1 | 1 | 1 |
= | 9 |
element-wise multiply
add outputs
Producing Feature Maps
Original
Sharpen
Edge Detect
“Strong” Edge Detect
Convolutional Neural Networks
Convolutional Neural Networks
CNNs: Spatial Arrangement of Output Volume
CNNs: Spatial Arrangement of Output Volume
Spatial Pooling
Pooling Layer
Pooling Layer
Representation Learning in Deep CNNs
Conv Layer 1
Conv Layer 2
Conv Layer 3
CNNs for Classification: Feature Learning
CNNs for Classification: Class Probabilities
Practice 1: Feature Map Shape
Practice 1: Feature Map Shape
Convolution with 3x3 kernel, zero padding and stride = 1
Practice 1: Feature Map Shape
Convolution with 3x3 kernel, zero padding and stride = 2
Practice 2: CNN in PyTorch
Practice 2: CNN in PyTorch
An Architecture for Many Applications
CNN for Text Classification
CNN for Speech Recognition
Deep Learning
Recurrent Neural Networks
So Far
Sequence Matters
???
Sequence Matters
What is a Sequence?
Sequence Modeling
given previous words
predict the next word
A Sequence Modeling Example: Next Word Prediction
“This morning I took my dog for a walk.”
given previous two words
predict the next word
A Sequence Modeling Example: Next Word Prediction
“This morning I took my dog for a walk.”
predict the next word
Here 1 is the count for the word ”a”
[0 1 0 0 1 0 1 … … 0 0 1 1 0 0 0 1 0]
Sequence Modeling
Standard Feed-Forward Neural Network
Recurrent Neural Networks
… and many other architectures and applications
A Recurrent Neural Network (RNN)
output vector
input vector
cell state
old state
current input
RNN: State Update and Output
output vector
input vector
cell state
old state
current input
RNN: Computational Graph across Time
RNN: Computational Graph across Time
RNN: Computational Graph across Time
RNN: Computational Graph across Time
RNN: Computational Graph across Time
RNN: Computational Graph across Time
RNN: Computational Graph across Time
RNN: Backpropagation Through Time
Standard RNN Gradient Flow
Standard RNN Gradient Flow: Exploding Gradients
Many values > 1:
exploding gradients
Gradient clipping to�scale big gradients
Many values < 1:
vanishing gradients
Standard RNN Gradient Flow: Vanishing Gradients
Many values > 1:
exploding gradients
Gradient clipping to�scale big gradients
Many values < 1:
vanishing gradients
The Problem of Long-Term Dependencies
Multiply many small numbers together
Errors due to further back time steps �have smaller and smaller gradients
Bias parameters to capture short-term dependencies
Gating Mechanisms in Neurons
Standard RNNs
Long Short-Term Memory (LSTM)
Long Short-Term Memory (LSTM)
Long Short-Term Memory (LSTM)
Gates optionally let information through, via a sigmoid layer and pointwise multiplication
LSTM: Forget Irrelevant Information
LSTM: Add New Information
LSTM: Update Cell State
LSTM: Output Filtered Version of Cell State
LSTM: Cell State Matters
LSTM: Mitigate Vanishing Gradient
… Vanish!
… Explode!
🡪
🡪
LSTM: Mitigate Vanishing Gradient
So…
We can keep information if we want!�(by adjusting how much we forget)
LSTM: Key Concepts
RNN Applications & Limitations
RNN Applications & Limitations
RNN Applications & Limitations
RNN Applications & Limitations
Next
Assignment #4