1 of 78

SDS4299

Natural Language Processing 2

Going Deeper with Neural Nets

2 of 78

Any Burning Questions for us?

Slido Code: #4299

3 of 78

Our Team

Keith

Y2 BZA

Bharath Shankar

Y2 DSA

Vinod

Y3 BZA

4 of 78

Recap

01

5 of 78

Last Time, In SDS1101

We covered:

Text Preprocessing

Naive Bayes Classifier

6 of 78

We got a good result from Naive Bayes..

… But we can do better!

7 of 78

NLP is Hard…

Machines can’t understand language, but we can!

It’s all thanks to this!

So, why don’t we try to mimic it?

8 of 78

Introduction to Neural Networks

9 of 78

Neural networks were inspired by Neurons

10 of 78

Neural networks were inspired by Neurons

Signal

11 of 78

12 of 78

Input layer

Hidden layer 1

Hidden layer 2

Output layer

13 of 78

𝑤₁

𝑤₂

Inputs

𝑥₁

𝑥₂

𝑥₀= 1

Bias

𝑤₀

Perceptron

Weights

Output:

1 or 0

Get weighted sum

Σ 𝑥ᐧ𝑤 = 𝑤₀+ 𝑥 ₁ᐧ𝑤₁ + 𝑥₂ᐧ𝑤₂

2. Apply weighted sum to activation function

ŷ＝𝘨 ( Σ 𝑥ᐧ𝑤 )

14 of 78

Activation functions

What is an activation function?

Takes in the weighted sum and decides if the neuron should “fire” (activate)

In this example, the neuron would activate and output 1 if 𝑤₀+ 𝑥 ₁ᐧ𝑤₁ + 𝑥₂ᐧ𝑤₂ ≥ 0

15 of 78

Non-linear activation functions

Sigmoid or Logistic Activation Function

Output is bound to [0,1] and can be used to predict probability

16 of 78

Non-linear activation functions

Tanh or Hyperbolic Tangent Activation Function

Similar to sigmoid, but ranges from -1 to 1

17 of 78

Non-linear activation functions

ReLU (Rectified Linear Unit) Activation Function

18 of 78

Quantifying Loss

How do we do this?
Regression	Classification
Mean Squared Error (MSE) Mean Absolute Error (MAE)	Binary Cross Entropy Hinge Loss

For the rest of the workshop, we will assume a classification task with binary cross entropy as our cost function

19 of 78

Gradient Descent

An iterative solution to optimize the loss

We cannot jump to the value, so let’s walk there instead.

Let us say that the direction of steepest descent is represented by a vector v.

Then, we have to move some distance α in direction v

Once we finish the step, let us again check for direction of steepest descent!

We then stop at the minima - Our Goal!

20 of 78

Gradient Descent

Choosing the learning rate α

Takes too much time	Just right	Can diverge

21 of 78

Gradient Descent

Weight update

Let us consolidate every individual weight in the network into a huge column vector: Ө

Then, Ө_updated = Ө - αv

If α is set right, we should be able to reach a minimum!

We should be able to tweak our weights to minimise the cost function now!

22 of 78

Backpropagation

We need the ability to calculate the direction of steepest descent!

But how?

Derivative -

Slope at a point for 2D

Gradient:

Direction of steepest ascent

Generalize to n dimensions

23 of 78

Backpropagation

z₁

ŷ

w₁

w₂

L(W)

∂L(W)
∂w₁

=

∂L(W)
∂ŷ

∂ŷ
∂z₁

∂z₁
∂w₁

*

Repeat this for every weight in the network using gradients from later layers

How does a small change in one weight (w₁) affect the final loss L(W)

x

24 of 78

Backpropagation

Flow of information

Backpropagation algorithm:

Take the derivative (gradient) of the loss with respect to each parameter
Shift parameters in order to minimize loss

25 of 78

Backpropagation

We just need to run the backpropagation algo for each weight in the neural net, to get our gradient

We use this gradient to take a step, improving our net slightly

We finally stop when our cost function is minimized!

26 of 78

Dropout

At every iteration, randomly selects some neurons and removes it along with all of its incoming & outgoing connections

The probability that these neurons will be selected depends on the rate specified (also a hyperparameter)

27 of 78

Embeddings

02

28 of 78

What are Embeddings? �Why do we need them?

What?

Low dimensional continuous vector representations

Why?

‘Hello!’

1

[ 6

1

4

3

5

7

9 ]

Input

Output

Machine readable format

29 of 78

Can I use one-hot encoding to ‘embed’ my words?

30 of 78

One-hot Encoding

1 Step: Create a vector for each unique word

E.g

Is this machine readable?

Is this a good embedding?

YES!

NO :(

I	love	data	science

1	0	0	0
0	1	0	0
0	0	1	0
0	0	0	1
I		data

31 of 78

Why not?

As the number of words INCREASES, the number of dimension of our vector INCREASES, computational complexity INCREASES, time taken INCREASES

1000 words (in our corpus) = 1000 dimensional vector

High dimensional sparse matrix don’t work well with many machine learning models

No semantic similarity between any of the vectors

32 of 78

Semantic Similarity

Ideally, the vectors used should ‘capture the meaning’ of our words

Man-made equipment are close together

Animals are close together

33 of 78

Types of Embeddings available

Custom Embeddings

Pre-trained Embeddings

Embeddings built specifically for your dataset
Build it on your own training data using word2Vec, TF-IDF etc

(Note: one-hot encoding is �also in this category!)

Already trained on a large dataset
Use the pre-defined vectors corresponding to the words in our dataset

Examples:GloVe, Word2Vec

34 of 78

What is Word2Vec?

Single-layered Neural Network based embedding model created in 2013 by researchers at Google

35 of 78

CBOW

In Continuous Bag of Words,

The words before & after the ‘target’ word are given as inputs to the model�
Model predicts the probability of the ‘target’ word

I	really	love	studying	data	science

I	really	love	studying	data	science

36 of 78

Skip-gram

In Skip-gram,

The ‘target’ word is given as inputs to the model�
Model predicts the probability of the neighbouring words before & after it

I	really	love	studying	data	science

I	really	love	studying	data	science

37 of 78

How do we generate vectors?

In both models, we are interested in the weights of the hidden layer!

38 of 78

What is GloVe?

In addition to semantic similarity, it takes into account the frequency of each word in the corpus

Created in 2014 by Stanford

39 of 78

Let’s work it out in our Colab file:�

tinyurl.com/nlp2colab�

40 of 78

RNN

03

Recurrent Neural Networks

41 of 78

Recurrent Neural Networks (RNN)

RNN is a special type of artificial neural network that is adapted to work for time series or data that involves sequences

RNNs have the concept of “memory” that helps them store the states or information of previous inputs to generate the next output of the sequence

42 of 78

Recurrent Neural Network

Standard neural networks go from input to output in one direction

RNN

x_t

y_t

h_t

In contrast, RNNs has loops in them which allows information to persists over time.

At time step t, the RNN takes in as input x_t, and computes y_t. In addition to the output, it computes an internal state h_t, and passes this information from this time step to the next

43 of 78

Unfolding an RNN

RNN

h_t

x_t

y_t

RNN

x_t-1

y_t-1

RNN

x_t

y_t

RNN

x_t+1

y_t+1

h_t-1

h_t

44 of 78

Recurrence relation

A recurrence relation is applied at every time step to process a sequence:

State at previous time step

The same function and set of parameters are used at every time step

45 of 78

Backpropagation Through Time (BPTT)

RNN

h_t

x_t

y_t

RNN

x_t_-1

y_t_-1

RNN

x_t

y_t

RNN

x_t₊₁

y_t₊₁

h_t_-1

h_t

L₀

L₁

L₂

L

46 of 78

Let’s go back to the Colab and see how a simple RNN model is implemented!

47 of 78

LSTM

04

Long Short Term Memory

48 of 78

Problems with RNNs

Backpropagation Through Time (BPTT)

Gradient

Gradient Slowly Disappears!!

49 of 78

So what?

Neural Nets learn using the gradient

So, what happens if the gradient vanishes??

The Network Doesn’t Learn!

50 of 78

Specifically, the network becomes focused on the last few inputs

Remembers this

Forgets this

The network has an extremely short term memory!

51 of 78

How do we fix this?

We have to add some memory!

“Forget” the less useful stuff, and implement some memory!

52 of 78

RNNs

h₁

h₂

h₃

Information carried along by the hidden state

Hidden state is dominated by recency bias

53 of 78

Implementing Memory

Key Idea

h

c

While passing along the hidden state, pass along a “cell state” as well!

Carries local info

Carries important info from the past!

54 of 78

Anatomy of an LSTM Cell

55 of 78

Seems complicated!

Let’s break it down.

3 parts to it!

Forget

Input

Output

Get rid of irrelevant info

Update Cell State

Get new hidden vector to pass along

56 of 78

Forget Gate

h_i-1

x_i

h_i-1+ x_i

Sigmoid

f_i

Vector encodes what previous info we need to forget!

Forget vector

57 of 78

Input Gate

h_i-1

x_i

h_i-1+ x_i

Sigmoid

i_i

Input vector

tanh

c’_i

Candidate vector

i_i*c’_i

Update applied to cell state

58 of 78

Applying the Update

f_i

c_i-1

f_i*c_i-1

i_i*c’_i

f_i*c_i-1+ i_i*c’_i

c_i

New cell state

59 of 78

Output Gate

c_i

h_i-1+ x_i

tanh

Sigmoid

Multiply

h_i

60 of 78

Putting it all together

X

+

X

Forget Gate

Input Gate

Output Gate

c_i-1

h_i-1

x_i

h_i-1+ x_i

f_i

i_i

c’_i

i_i*c’_i

c_i

h_i

c_i

61 of 78

Time for a practical demo!

62 of 78

Transformers

05

63 of 78

Pay Attention!

LSTMs are great, but…..

They take ages to train

Still have some issues in training longer sequences

They still use ALL the info of the sequence!

64 of 78

Wait what?

Consider the following:

I

am

a

human

Input

Eu

sou

um

humano

Output

Machine

Translation

65 of 78

Position matters!

For a sequence to sequence mapping, the order and position matters!

I

am

a

human

Input

Output

This matters ..

…more than this!

We need to account for each individual input along with the context!

66 of 78

Encoder - Decoder Models Explained

Key Idea: Use 2 LSTMs!

Encoder LSTM

Converts Input to Fixed Length representation

Decoder LSTM

Converts Fixed Length Representation to Output

Great for sequence to sequence tasks!

67 of 78

A 2-part model

Encoder

Decoder

Context

Input

Output

+ All hidden states!

Why transformers are so good

68 of 78

Local and Global!

Idea: Use context vector + hidden state at each timestep in the decoder!

Decoder hidden state

h₁

h₂

h₃

Hidden states at every time

Compute context

69 of 78

How Do We Know What’s Relevant?

We get inspired by search engines!

Query

Keys

Values

Mapped onto

Get Best Matches

Multiply

Score

From Encoder

From Decoder

70 of 78

Computing an Output

For a particular output:

Assign a score to each hidden state!

Softmax the scores

Scale the hidden states by the softmaxed scores

+

Add the hidden states!

71 of 78

Making a prediction!

+

Hidden State: Contains global info

Context Vector - contains local info

+

Output

72 of 78

Let’s Move to the Colab

73 of 78

Current State of Transformer Models

Transformers are the current state of the art models for NLP

2017- Transformers introduced
2018 - Improved encoding with BERT
2019 - XLNet and ERNIE, as well as GPT2
2020 - GPT3 released

Further research is still ongoing!

All of these models use the attention mechanism!

74 of 78

Conclusion

06

75 of 78

Summary of Results

Simple RNN

LSTM

Basic NN

78%

57%

87%

Keep in mind that all these networks are shallow!

Transformers

93%

76 of 78

Quick Summary

Recap on Deep Learning & NLP 1 Workshop��Embeddings��RNNs

LSTMs

77 of 78

What’s next?

Recap on Deep Learning & NLP 1 Workshop��Embeddings��RNNs

LSTMs

Own NLP Project?

78 of 78

Any Further Questions for us?

Slido Code: #4299