1 of 49

Contents

  • Introduction
  • Neural Networks
    • Single-layer Neural Networks
    • Multi-layer Neural Networks
  • Convolutional Neural Networks
  • Recurrent Neural Networks

1

2 of 49

From images to text

IMDb

  • task: sentiment classification
  • input: a movie review (free text)
  • output: positive (1) or negative (0)
  • 25,000 training reviews & 25,000 test reviews

Images are already grids of numbers (pixels). Text is a sequence of words.

How do we turn a document into a feature vector?

2

“This has to be one of the worst films of the 1990s. When my friends and I were watching this film (being the target audience it was aimed at) we just sat & watched the first half an hour with our jaws touching the floor at how bad it really was. The rest of the time, everyone else in the theater just started talking to each other, leaving or generally crying into their popcorn …”

3 of 49

Bag-of-words featurization

  1. Build dictionary: top 10,000 most frequent words
  2. For each document, create a length - 10,000 binary vector� → 1 if the word appears, 0 otherwise
  3. Stack into a matrix� 25,000 reviews x 10,000 words : ~99% of entries are zero� → store as sparse matrix

What’s lost? Word ORDER !!!

“not good” = “good not” = [..., 1, …, 1, …]

Bag-of-words throws order completely way

3

Dict

movie

acting

great

enjoyed

fresh

plot

the

Original review:�“I really enjoyed this movie …”

Encode as binary vector:�[1, 0, 0, 1, 0, 0, 0, …]

4 of 49

Lasso vs. NN

Both models achieve a test-set accuracy of about 88%.

  • Deep learning is not a universal hammer.
  • When features are already informative ina roughly linear way, simpler methods often match and they’re faster, more interpretable, easier to deploy
  • glmnet: package that applies regularization to highly sparse data (Bag-of-Words)

4

5 of 49

When order matters

Many real-word data sources are sequential:

  • Documents: sequence of words
  • Time series: temperature, stock prices, sensors
  • Speech, music: sequences of sound waves
  • Handwriting: sequence of pen strokes

“dog bites man” ≠ “man bites dog”

5 > 10 > 15 ≠ 15 > 10 > 5

Recurrent Neural Networks (RNNs)

: designed to exploit sequential structure in data

5

Order matters

6 of 49

Recurrent Neural Network

L = word Length or Lag

time step

6

dog bites man

7 of 49

Recurrent Neural Network

7

Ordinary Neural Network

Recurrent Neural Network

8 of 49

Recurrent Neural Network

For each hidden unit k at time step l:

8

: input → hidden

: hidden → hidden

: hidden → output

bias

current

input

previous

state

Ordinary Neural Network

9 of 49

Recurrent Neural Network

For each hidden unit k at time step l:

Output at step l:

9

  • For binary classification: pass O through sigmoid
  • For multi-class: pass through softmax

10 of 49

Word embedding

  • Represent each word as a dense vector in a lower-dimensional space
    • Each component is a real number, mostly nonzero
  • Embedding matrix E (m x 10,000)
    • Each column = one word’s embedding

10

11 of 49

IMDB results

Setup

  • Reviews truncated/padded to L=500 words
  • Embedding dimension m=32 (learned with the task)
  • One recurrent layer, K=32 hidden units
  • Dropout regularization

11

Model

Test accuracy

Bag-of-words + lasso

~88%

Simple RNN

~76%

12 of 49

IMDB results

“Vanishing gradient problem”

Why is the simple RNN worse than bag-of-words?

  • Reviews can be up to 500 words long.
  • The same W, U applied 500 times in a row.

→ Signals from Early words get DILUTED (washed out) by the time they reach finish line.

12

13 of 49

LSTM

Results on IMDb

Field evolution

  • The principles we learned (recurrence, weight sharing, embeddings) are foundational.
  • Specific architectures keep evolving.

Long Short-Term Memory

long-term memory track�short-term track

13

Model

Test accuracy

Simple RNN

~76%

LSTM

~87%

ISLP - 2020

~95% with tuned RNN/LSTM

Today

97%+ with Transformers

(BERT, GPT, …)

14 of 49

Thank you!

14

15 of 49

Recap

16 of 49

Bag of words RNN(word embedding) LSTM

https://calvinfeng.gitbook.io/machine-learning-notebook/supervised-learning/recurrent-neural-network/recurrent_neural_networks

many to one: Input sequence, output scalar

sentiment, classify

one to many: Input scalar, output sequence

image to image captioning

many to many: input, output sequence

Typically, RNN refers to a neural network technique that uses the entire sequence. LSTM is also included within this category. (RNNs)

In the context of this textbook, it refers to early RNN models, specifically vanilla RNNs.

17 of 49

18 of 49

19 of 49

l=1, highest

Correlation between past and present of itself

Correlation: Direction of a relation betweenvariables

20 of 49

Input

Originallly: 6051*3

reshaped:

T-L(input size), L(lag), p(feature)

=6046, 5, 3

Output : 1(v_t)

many to one

21 of 49

What is a "Strawman"?

A simple prediction used as a baseline for comparison.

22 of 49

Input: (T-L), (p*L +1)

add bias for Ordinary Least Squares

shifting the regression line up or down

23 of 49

RNN vs Autoregression (AR)

RNN (Recurrent Neural Network)

Input: 3D Tensor

[T-L, L, p]�= [total sample, lag, feature]

Output

[T-L, 1]

Preserves the 3D sequence structure to fundamentally learn the sequence of time.

Autoregression

Input: 2D Matrix

[(T-L), (p*L)+1]�= [total sample, flatten of past feature]

Output

[T-L, 1]

Simply flattens past features into a 1D vector. While no raw data is lost, it ignores the sequential structure.

similar to ISLP applied exercise 10

24 of 49

No big differences between AR and RNN

25 of 49

26 of 49

SNR

The ratio of meaningful, underlying patterns (Signal) to random, unwanted variations (Noise) in a dataset.

"RNN" Terminology

Broad sense(RNN family RNNs):

the entire category of neural networks designed to process sequential data.

27 of 49

Handling Noisy Data (Low SNR)

IMDB Reviews : A simple Lasso logistic regression matches or outperforms Neural Networks.

Time Series Forecasting: Simply adding a day_of_week feature increases R² more than changing the entire model architecture. (Feature engineering beats complexity)

28 of 49

Summarize: When to use Deep Learning

Deep Learning Models (RNN, CNN)

  • Low Bias, High Variance
  • Optimal for High SNR
  • Complex patterns, large datasets

Traditional Models (AR, Lasso)

  • High Bias, Low Variance
  • Preferred for Low SNR
  • Example: IMDB reviews, NYSE data

Occam’s Razor Principle: Prefer simpler models if they work as well.

But!! In modern Deep Learning, the traditional bias-variance trade-off doesn't always hold. Models can improve when overparameterized beyond the interpolation threshold.

Double Descent

29 of 49

30 of 49

Saddle point: gradient = 0, different direction of left side and right side

31 of 49

32 of 49

R (Risk or Loss function): Objective to minimize

𝜃 (Parameters/Weights): Current position.

t (Iteration/Time step index): Current step.

∇R(gradient of R): Direction of steepest ascent

(rho): learning rate(step size of movement

local minimum

35/46

delta

33 of 49

The Intuition of Gradient (∇R)

The Gradient vector always points to the steepest ascent (uphill).

To find the minimum, we must move in the opposite direction of the gradient.

If ∇R> 0: Uphill is to the right → Move left (subtract).

If ∇R < 0: Uphill is to the left → Move right (add).

34 of 49

35 of 49

As we’ve learned least square in ch2,3

we define the objective function to seek error by same way MSE and RSS

ErrorOutputActivation(g)Weight(w)

36/46

36 of 49

w: weights connecting inputs to hidden layer and past hidden activations to current ones

Beta : Used exclusively for the Output Layer. Same with regression coefficients

what is changing? : parameters

what is being differentiated? : Risk function

Update Rule:

36/46

37 of 49

Quiz This is a simple scalar network (1 neuron, 1 weight per layer), fill in the blanks!

What is ReLU (Rectified Linear Unit)? If the input is positive, it passes right through If negative, it becomes zero. max(0,x)

Error Loss:

38 of 49

39 of 49

Slow learning

Problem: if the learning go further, Training too long makes the model memorize noise (Overfitting).

Solution: Use a small learning rate but stop training early before it overfits.

SGD

Problem: Standard GD calculates the error for all data (e.g., 60,000 images) just to take one step. It is extremely slow and easily gets stuck in local minimum.

Solution: Use small random subsets (Minibatches, e.g., 128 images) for each step.

40 of 49

with probability phi

Reduce dependency on strong features

→ prevent overfitting

→ reduce variance

41 of 49

42 of 49

43 of 49

44 of 49

Summarize: when to use deep learning

  • Deep learning model(RNN, CNN) -> Low Bias, High Variance -> high SNR
  • AR, Lasso -> high bias, low variance -> low SNR (IMDB, NYSE)

Occam’s razor principal

  • But! sometimes traditional bias-Variance trade-off doesn’t always hold in modern Deep learning(p>>n) -> Double Descent

45 of 49

finding minimum norm is theoretically similar to Ridge Regression

Ridge

46 of 49

47 of 49

Implicit regularization

48 of 49

49 of 49