2 of 49

From images to text

IMDb

task: sentiment classification
input: a movie review (free text)
output: positive (1) or negative (0)
25,000 training reviews & 25,000 test reviews

Images are already grids of numbers (pixels). Text is a sequence of words.

→ How do we turn a document into a feature vector?

“This has to be one of the worst films of the 1990s. When my friends and I were watching this film (being the target audience it was aimed at) we just sat & watched the first half an hour with our jaws touching the floor at how bad it really was. The rest of the time, everyone else in the theater just started talking to each other, leaving or generally crying into their popcorn …”

3 of 49

Bag-of-words featurization

Build dictionary: top 10,000 most frequent words
For each document, create a length - 10,000 binary vector� → 1 if the word appears, 0 otherwise
Stack into a matrix� 25,000 reviews x 10,000 words : ~99% of entries are zero� → store as sparse matrix

What’s lost? Word ORDER !!!

“not good” = “good not” = [..., 1, …, 1, …]

Bag-of-words throws order completely way

Dict

movie

acting

great

enjoyed

fresh

plot

the

…

Original review:�“I really enjoyed this movie …”

Encode as binary vector:�[1, 0, 0, 1, 0, 0, 0, …]

4 of 49

Lasso vs. NN

Both models achieve a test-set accuracy of about 88%.

Deep learning is not a universal hammer.
When features are already informative ina roughly linear way, simpler methods often match and they’re faster, more interpretable, easier to deploy
glmnet: package that applies regularization to highly sparse data (Bag-of-Words)

5 of 49

When order matters

Many real-word data sources are sequential:

Documents: sequence of words
Time series: temperature, stock prices, sensors
Speech, music: sequences of sound waves
Handwriting: sequence of pen strokes

“dog bites man” ≠ “man bites dog”

5 > 10 > 15 ≠ 15 > 10 > 5

Recurrent Neural Networks (RNNs)

: designed to exploit sequential structure in data

Order matters

6 of 49

Recurrent Neural Network

L = word Length or Lag

time step

dog bites man

7 of 49

Recurrent Neural Network

Ordinary Neural Network

Recurrent Neural Network

8 of 49

Recurrent Neural Network

For each hidden unit k at time step l:

: input → hidden

: hidden → hidden

: hidden → output

bias

current

input

state

Ordinary Neural Network

9 of 49

Recurrent Neural Network

For each hidden unit k at time step l:

Output at step l:

For binary classification: pass O through sigmoid
For multi-class: pass through softmax

10 of 49

Word embedding

Represent each word as a dense vector in a lower-dimensional space

Each component is a real number, mostly nonzero

Embedding matrix E (m x 10,000)

Each column = one word’s embedding

11 of 49

IMDB results

Setup

Reviews truncated/padded to L=500 words
Embedding dimension m=32 (learned with the task)
One recurrent layer, K=32 hidden units
Dropout regularization

Model	Test accuracy
Bag-of-words + lasso	~88%
Simple RNN	~76%

12 of 49

IMDB results

“Vanishing gradient problem”

Why is the simple RNN worse than bag-of-words?

Reviews can be up to 500 words long.
The same W, U applied 500 times in a row.

→ Signals from Early words get DILUTED (washed out) by the time they reach finish line.

13 of 49

LSTM

Results on IMDb

Field evolution

The principles we learned (recurrence, weight sharing, embeddings) are foundational.
Specific architectures keep evolving.

Long Short-Term Memory

long-term memory track�short-term track

Model	Test accuracy
Simple RNN	~76%
LSTM	~87%

ISLP - 2020	~95% with tuned RNN/LSTM
Today	97%+ with Transformers (BERT, GPT, …)

14 of 49

Thank you!

16 of 49

Bag of words → RNN(word embedding) → LSTM

https://calvinfeng.gitbook.io/machine-learning-notebook/supervised-learning/recurrent-neural-network/recurrent_neural_networks

many to one: Input sequence, output scalar

sentiment, classify

one to many: Input scalar, output sequence

image to image captioning

many to many: input, output sequence

Typically, RNN refers to a neural network technique that uses the entire sequence. LSTM is also included within this category. (RNNs)

In the context of this textbook, it refers to early RNN models, specifically vanilla RNNs.

19 of 49

l=1, highest

Correlation between past and present of itself

Correlation: Direction of a relation betweenvariables

20 of 49

Input

Originallly: 6051*3

reshaped:

T-L(input size), L(lag), p(feature)

=6046, 5, 3

Output : 1(v_t)

many to one

21 of 49

What is a "Strawman"?

A simple prediction used as a baseline for comparison.

22 of 49

Input: (T-L), (p*L +1)

add bias for Ordinary Least Squares

shifting the regression line up or down

23 of 49

RNN vs Autoregression (AR)

RNN (Recurrent Neural Network)

Input: 3D Tensor

[T-L, L, p]�= [total sample, lag, feature]

Output

[T-L, 1]

Preserves the 3D sequence structure to fundamentally learn the sequence of time.

Autoregression

Input: 2D Matrix

[(T-L), (p*L)+1]�= [total sample, flatten of past feature]

Output

[T-L, 1]

Simply flattens past features into a 1D vector. While no raw data is lost, it ignores the sequential structure.

similar to ISLP applied exercise 10

24 of 49

No big differences between AR and RNN

26 of 49

SNR

The ratio of meaningful, underlying patterns (Signal) to random, unwanted variations (Noise) in a dataset.

"RNN" Terminology

Broad sense(RNN family RNNs):

the entire category of neural networks designed to process sequential data.

27 of 49

Handling Noisy Data (Low SNR)

IMDB Reviews : A simple Lasso logistic regression matches or outperforms Neural Networks.

Time Series Forecasting: Simply adding a day_of_week feature increases R² more than changing the entire model architecture. (Feature engineering beats complexity)

28 of 49

Summarize: When to use Deep Learning

Deep Learning Models (RNN, CNN)

Low Bias, High Variance
Optimal for High SNR
Complex patterns, large datasets

Traditional Models (AR, Lasso)

High Bias, Low Variance
Preferred for Low SNR
Example: IMDB reviews, NYSE data

Occam’s Razor Principle: Prefer simpler models if they work as well.

But!! In modern Deep Learning, the traditional bias-variance trade-off doesn't always hold. Models can improve when overparameterized beyond the interpolation threshold.

→ Double Descent

30 of 49

Saddle point: gradient = 0, different direction of left side and right side

32 of 49

R (Risk or Loss function): Objective to minimize

𝜃 (Parameters/Weights): Current position.

t (Iteration/Time step index): Current step.

∇R(gradient of R): Direction of steepest ascent

⍴(rho): learning rate(step size of movement

local minimum

35/46

delta

33 of 49

The Intuition of Gradient (∇R)

The Gradient vector always points to the steepest ascent (uphill).

To find the minimum, we must move in the opposite direction of the gradient.

If ∇R> 0: Uphill is to the right → Move left (subtract).

If ∇R < 0: Uphill is to the left → Move right (add).

35 of 49

As we’ve learned least square in ch2,3

we define the objective function to seek error by same way MSE and RSS

Error→Output→Activation(g)→Weight(w)

36/46

36 of 49

w: weights connecting inputs to hidden layer and past hidden activations to current ones

Beta : Used exclusively for the Output Layer. Same with regression coefficients

what is changing? : parameters

what is being differentiated? : Risk function

Update Rule:

36/46

37 of 49

Quiz This is a simple scalar network (1 neuron, 1 weight per layer), fill in the blanks!

What is ReLU (Rectified Linear Unit)? If the input is positive, it passes right through If negative, it becomes zero. max(0,x)

Error Loss:

39 of 49

Slow learning

Problem: if the learning go further, Training too long makes the model memorize noise (Overfitting).

Solution: Use a small learning rate but stop training early before it overfits.

SGD

Problem: Standard GD calculates the error for all data (e.g., 60,000 images) just to take one step. It is extremely slow and easily gets stuck in local minimum.

Solution: Use small random subsets (Minibatches, e.g., 128 images) for each step.

40 of 49

with probability phi

Reduce dependency on strong features

→ prevent overfitting

→ reduce variance

44 of 49

Summarize: when to use deep learning

Deep learning model(RNN, CNN) -> Low Bias, High Variance -> high SNR
AR, Lasso -> high bias, low variance -> low SNR (IMDB, NYSE)

Occam’s razor principal

But! sometimes traditional bias-Variance trade-off doesn’t always hold in modern Deep learning(p>>n) -> Double Descent

1 of 49

2 of 49

3 of 49

4 of 49

5 of 49

6 of 49

7 of 49

8 of 49

9 of 49

10 of 49

11 of 49

12 of 49

13 of 49

14 of 49

15 of 49

16 of 49

17 of 49

18 of 49

19 of 49

20 of 49

21 of 49

22 of 49

23 of 49

24 of 49

25 of 49

26 of 49

27 of 49

28 of 49

29 of 49

30 of 49

31 of 49

32 of 49

33 of 49

34 of 49

35 of 49

36 of 49

37 of 49

38 of 49

39 of 49

40 of 49

41 of 49

42 of 49

43 of 49

44 of 49

45 of 49

46 of 49

47 of 49

48 of 49

49 of 49