Contents
1
From images to text
IMDb
Images are already grids of numbers (pixels). Text is a sequence of words.
→ How do we turn a document into a feature vector?
2
“This has to be one of the worst films of the 1990s. When my friends and I were watching this film (being the target audience it was aimed at) we just sat & watched the first half an hour with our jaws touching the floor at how bad it really was. The rest of the time, everyone else in the theater just started talking to each other, leaving or generally crying into their popcorn …”
Bag-of-words featurization
What’s lost? Word ORDER !!!
“not good” = “good not” = [..., 1, …, 1, …]
Bag-of-words throws order completely way
3
Dict
movie
acting
great
enjoyed
fresh
plot
the
…
Original review:�“I really enjoyed this movie …”
Encode as binary vector:�[1, 0, 0, 1, 0, 0, 0, …]
Lasso vs. NN
Both models achieve a test-set accuracy of about 88%.
4
When order matters
Many real-word data sources are sequential:
“dog bites man” ≠ “man bites dog”
5 > 10 > 15 ≠ 15 > 10 > 5
Recurrent Neural Networks (RNNs)
: designed to exploit sequential structure in data
5
Order matters
Recurrent Neural Network
L = word Length or Lag
time step
6
dog bites man
Recurrent Neural Network
7
Ordinary Neural Network
Recurrent Neural Network
Recurrent Neural Network
For each hidden unit k at time step l:
8
: input → hidden
: hidden → hidden
: hidden → output
bias
current
input
previous
state
Ordinary Neural Network
Recurrent Neural Network
For each hidden unit k at time step l:
Output at step l:
9
Word embedding
10
IMDB results
Setup
11
Model | Test accuracy |
Bag-of-words + lasso | ~88% |
Simple RNN | ~76% |
IMDB results
“Vanishing gradient problem”
Why is the simple RNN worse than bag-of-words?
→ Signals from Early words get DILUTED (washed out) by the time they reach finish line.
12
LSTM
Results on IMDb
Field evolution
Long Short-Term Memory
long-term memory track�short-term track
13
Model | Test accuracy |
Simple RNN | ~76% |
LSTM | ~87% |
ISLP - 2020 | ~95% with tuned RNN/LSTM |
Today | 97%+ with Transformers (BERT, GPT, …) |
Thank you!
14
Recap
Bag of words → RNN(word embedding) → LSTM
https://calvinfeng.gitbook.io/machine-learning-notebook/supervised-learning/recurrent-neural-network/recurrent_neural_networks
many to one: Input sequence, output scalar
sentiment, classify
one to many: Input scalar, output sequence
image to image captioning
many to many: input, output sequence
Typically, RNN refers to a neural network technique that uses the entire sequence. LSTM is also included within this category. (RNNs)
In the context of this textbook, it refers to early RNN models, specifically vanilla RNNs.
l=1, highest
Correlation between past and present of itself
Correlation: Direction of a relation betweenvariables
Input
Originallly: 6051*3
reshaped:
T-L(input size), L(lag), p(feature)
=6046, 5, 3
Output : 1(v_t)
many to one
What is a "Strawman"?
A simple prediction used as a baseline for comparison.
Input: (T-L), (p*L +1)
add bias for Ordinary Least Squares
shifting the regression line up or down
RNN vs Autoregression (AR)
RNN (Recurrent Neural Network)
Input: 3D Tensor
[T-L, L, p]�= [total sample, lag, feature]
Output
[T-L, 1]
Preserves the 3D sequence structure to fundamentally learn the sequence of time.
Autoregression
Input: 2D Matrix
[(T-L), (p*L)+1]�= [total sample, flatten of past feature]
Output
[T-L, 1]
Simply flattens past features into a 1D vector. While no raw data is lost, it ignores the sequential structure.
similar to ISLP applied exercise 10
No big differences between AR and RNN
SNR
The ratio of meaningful, underlying patterns (Signal) to random, unwanted variations (Noise) in a dataset.
"RNN" Terminology
Broad sense(RNN family RNNs):
the entire category of neural networks designed to process sequential data.
Handling Noisy Data (Low SNR)
IMDB Reviews : A simple Lasso logistic regression matches or outperforms Neural Networks.
Time Series Forecasting: Simply adding a day_of_week feature increases R² more than changing the entire model architecture. (Feature engineering beats complexity)
Summarize: When to use Deep Learning
Deep Learning Models (RNN, CNN)
Traditional Models (AR, Lasso)
Occam’s Razor Principle: Prefer simpler models if they work as well.
But!! In modern Deep Learning, the traditional bias-variance trade-off doesn't always hold. Models can improve when overparameterized beyond the interpolation threshold.
→ Double Descent
Saddle point: gradient = 0, different direction of left side and right side
R (Risk or Loss function): Objective to minimize
𝜃 (Parameters/Weights): Current position.
t (Iteration/Time step index): Current step.
∇R(gradient of R): Direction of steepest ascent
⍴(rho): learning rate(step size of movement
local minimum
35/46
delta
The Intuition of Gradient (∇R)
The Gradient vector always points to the steepest ascent (uphill).
To find the minimum, we must move in the opposite direction of the gradient.
If ∇R> 0: Uphill is to the right → Move left (subtract).
If ∇R < 0: Uphill is to the left → Move right (add).
As we’ve learned least square in ch2,3
we define the objective function to seek error by same way MSE and RSS
Error→Output→Activation(g)→Weight(w)
36/46
w: weights connecting inputs to hidden layer and past hidden activations to current ones
Beta : Used exclusively for the Output Layer. Same with regression coefficients
what is changing? : parameters
what is being differentiated? : Risk function
Update Rule:
36/46
Quiz This is a simple scalar network (1 neuron, 1 weight per layer), fill in the blanks!
What is ReLU (Rectified Linear Unit)? If the input is positive, it passes right through If negative, it becomes zero. max(0,x)
Error Loss:
Slow learning
Problem: if the learning go further, Training too long makes the model memorize noise (Overfitting).
Solution: Use a small learning rate but stop training early before it overfits.
SGD
Problem: Standard GD calculates the error for all data (e.g., 60,000 images) just to take one step. It is extremely slow and easily gets stuck in local minimum.
Solution: Use small random subsets (Minibatches, e.g., 128 images) for each step.
with probability phi
Reduce dependency on strong features
→ prevent overfitting
→ reduce variance
Summarize: when to use deep learning
Occam’s razor principal
finding minimum norm is theoretically similar to Ridge Regression
Ridge
Implicit regularization