1 of 78

SDS4299

Natural Language Processing 2

Going Deeper with Neural Nets

2 of 78

Any Burning Questions for us?

Slido Code: #4299

3 of 78

Our Team

Keith

Y2 BZA

Bharath Shankar

Y2 DSA

Vinod

Y3 BZA

4 of 78

Recap

01

5 of 78

Last Time, In SDS1101

We covered:

Text Preprocessing

Naive Bayes Classifier

6 of 78

We got a good result from Naive Bayes..

… But we can do better!

7 of 78

NLP is Hard…

Machines can’t understand language, but we can!

It’s all thanks to this!

So, why don’t we try to mimic it?

8 of 78

Introduction to Neural Networks

9 of 78

Neural networks were inspired by Neurons

10 of 78

Neural networks were inspired by Neurons

Signal

Signal

Signal

Signal

Signal

11 of 78

12 of 78

Input layer

Hidden layer 1

Hidden layer 2

Output layer

13 of 78

𝑤1

𝑤2

Inputs

𝑥1

𝑥2

𝑥0= 1

Bias

𝑤0

Perceptron

Weights

Output:

1 or 0

  1. Get weighted sum

Σ 𝑥ᐧ𝑤 = 𝑤0 + 𝑥 1𝑤1 + 𝑥2𝑤2

2. Apply weighted sum to activation function

ŷ=𝘨 ( Σ 𝑥ᐧ𝑤 )

14 of 78

Activation functions

What is an activation function?

  • Takes in the weighted sum and decides if the neuron should “fire” (activate)

In this example, the neuron would activate and output 1 if 𝑤0 + 𝑥 1𝑤1 + 𝑥2𝑤2 ≥ 0

15 of 78

Non-linear activation functions

Sigmoid or Logistic Activation Function

Output is bound to [0,1] and can be used to predict probability

16 of 78

Non-linear activation functions

Tanh or Hyperbolic Tangent Activation Function

Similar to sigmoid, but ranges from -1 to 1

17 of 78

Non-linear activation functions

ReLU (Rectified Linear Unit) Activation Function

18 of 78

Quantifying Loss

How do we do this?

Regression

Classification

Mean Squared Error (MSE)

Mean Absolute Error (MAE)

Binary Cross Entropy

Hinge Loss

For the rest of the workshop, we will assume a classification task with binary cross entropy as our cost function

19 of 78

Gradient Descent

An iterative solution to optimize the loss

We cannot jump to the value, so let’s walk there instead.

Let us say that the direction of steepest descent is represented by a vector v.

Then, we have to move some distance α in direction v

Once we finish the step, let us again check for direction of steepest descent!

We then stop at the minima - Our Goal!

20 of 78

Gradient Descent

Choosing the learning rate α

Takes too much time

Just right

Can diverge

21 of 78

Gradient Descent

Weight update

Let us consolidate every individual weight in the network into a huge column vector: Ө

Then, Өupdated = Ө - αv

If α is set right, we should be able to reach a minimum!

We should be able to tweak our weights to minimise the cost function now!

22 of 78

Backpropagation

We need the ability to calculate the direction of steepest descent!

But how?

Derivative -

Slope at a point for 2D

Gradient:

Direction of steepest ascent

Generalize to n dimensions

23 of 78

Backpropagation

z1

ŷ

w1

w2

L(W)

L(W)

∂w1

=

L(W)

ŷ

ŷ

∂z1

∂z1

∂w1

*

*

Repeat this for every weight in the network using gradients from later layers

How does a small change in one weight (w1) affect the final loss L(W)

x

24 of 78

Backpropagation

Flow of information

Backpropagation algorithm:

  1. Take the derivative (gradient) of the loss with respect to each parameter
  2. Shift parameters in order to minimize loss

25 of 78

Backpropagation

We just need to run the backpropagation algo for each weight in the neural net, to get our gradient

We use this gradient to take a step, improving our net slightly

We finally stop when our cost function is minimized!

26 of 78

Dropout

At every iteration, randomly selects some neurons and removes it along with all of its incoming & outgoing connections

The probability that these neurons will be selected depends on the rate specified (also a hyperparameter)

27 of 78

Embeddings

02

28 of 78

What are Embeddings? �Why do we need them?

What?

Low dimensional continuous vector representations

Why?

‘Hello!’

1

[ 6

1

4

3

5

7

9 ]

Input

Output

Machine readable format

29 of 78

Can I use one-hot encoding to ‘embed’ my words?

30 of 78

One-hot Encoding

1 Step: Create a vector for each unique word

E.g

Is this machine readable?

Is this a good embedding?

YES!

NO :(

I

love

data

science

1

0

0

0

0

1

0

0

0

0

1

0

0

0

0

1

I

data

31 of 78

Why not?

  1. As the number of words INCREASES, the number of dimension of our vector INCREASES, computational complexity INCREASES, time taken INCREASES

1000 words (in our corpus) = 1000 dimensional vector

  • High dimensional sparse matrix don’t work well with many machine learning models

  • No semantic similarity between any of the vectors

32 of 78

Semantic Similarity

Ideally, the vectors used should ‘capture the meaning’ of our words

Man-made equipment are close together

Animals are close together

33 of 78

Types of Embeddings available

Custom Embeddings

Pre-trained Embeddings

  1. Embeddings built specifically for your dataset
  2. Build it on your own training data using word2Vec, TF-IDF etc

(Note: one-hot encoding is �also in this category!)

  1. Already trained on a large dataset
  2. Use the pre-defined vectors corresponding to the words in our dataset

Examples:GloVe, Word2Vec

34 of 78

What is Word2Vec?

Single-layered Neural Network based embedding model created in 2013 by researchers at Google

35 of 78

CBOW

In Continuous Bag of Words,

  1. The words before & after the ‘target’ word are given as inputs to the model�
  2. Model predicts the probability of the ‘target’ word

I

really

love

studying

data

science

I

really

love

studying

data

science

36 of 78

Skip-gram

In Skip-gram,

  • The ‘target’ word is given as inputs to the model�
  • Model predicts the probability of the neighbouring words before & after it

I

really

love

studying

data

science

I

really

love

studying

data

science

37 of 78

How do we generate vectors?

In both models, we are interested in the weights of the hidden layer!

38 of 78

What is GloVe?

In addition to semantic similarity, it takes into account the frequency of each word in the corpus

Created in 2014 by Stanford

39 of 78

Let’s work it out in our Colab file:�

tinyurl.com/nlp2colab�

40 of 78

RNN

03

Recurrent Neural Networks

41 of 78

Recurrent Neural Networks (RNN)

  • RNN is a special type of artificial neural network that is adapted to work for time series or data that involves sequences

  • RNNs have the concept of “memory” that helps them store the states or information of previous inputs to generate the next output of the sequence

42 of 78

Recurrent Neural Network

  • Standard neural networks go from input to output in one direction

RNN

xt

yt

ht

  • In contrast, RNNs has loops in them which allows information to persists over time.

  • At time step t, the RNN takes in as input xt, and computes yt. In addition to the output, it computes an internal state ht, and passes this information from this time step to the next

43 of 78

Unfolding an RNN

RNN

ht

xt

yt

RNN

xt-1

yt-1

RNN

xt

yt

RNN

xt+1

yt+1

ht-1

ht

44 of 78

Recurrence relation

A recurrence relation is applied at every time step to process a sequence:

State at previous time step

The same function and set of parameters are used at every time step

45 of 78

Backpropagation Through Time (BPTT)

RNN

ht

xt

yt

RNN

xt-1

yt-1

RNN

xt

yt

RNN

xt+1

yt+1

ht-1

ht

L0

L1

L2

L

46 of 78

Let’s go back to the Colab and see how a simple RNN model is implemented!

47 of 78

LSTM

04

Long Short Term Memory

48 of 78

Problems with RNNs

Backpropagation Through Time (BPTT)

Gradient

Gradient

Gradient

Gradient Slowly Disappears!!

49 of 78

So what?

Neural Nets learn using the gradient

So, what happens if the gradient vanishes??

The Network Doesn’t Learn!

50 of 78

Specifically, the network becomes focused on the last few inputs

Remembers this

Forgets this

The network has an extremely short term memory!

51 of 78

How do we fix this?

We have to add some memory!

“Forget” the less useful stuff, and implement some memory!

52 of 78

RNNs

h1

h2

h3

Information carried along by the hidden state

Hidden state is dominated by recency bias

53 of 78

Implementing Memory

Key Idea

h

c

While passing along the hidden state, pass along a “cell state” as well!

Carries local info

Carries important info from the past!

54 of 78

Anatomy of an LSTM Cell

55 of 78

Seems complicated!

Let’s break it down.

3 parts to it!

Forget

Input

Output

  • Get rid of irrelevant info
  • Update Cell State
  • Get new hidden vector to pass along

56 of 78

Forget Gate

hi-1

xi

hi-1+ xi

Sigmoid

fi

Vector encodes what previous info we need to forget!

Forget vector

57 of 78

Input Gate

hi-1

xi

hi-1+ xi

Sigmoid

ii

Input vector

tanh

c’i

Candidate vector

ii *c’i

Update applied to cell state

58 of 78

Applying the Update

fi

ci-1

fi*ci-1

ii *c’i

fi*ci-1+ ii *c’i

ci

New cell state

59 of 78

Output Gate

ci

hi-1+ xi

tanh

Sigmoid

Multiply

hi

60 of 78

Putting it all together

X

X

+

X

Forget Gate

Input Gate

Output Gate

ci-1

hi-1

xi

hi-1+ xi

hi-1+ xi

hi-1+ xi

fi

ii

c’i

ii *c’i

ci

hi

ci

61 of 78

Time for a practical demo!

62 of 78

Transformers

05

63 of 78

Pay Attention!

LSTMs are great, but…..

They take ages to train

Still have some issues in training longer sequences

They still use ALL the info of the sequence!

64 of 78

Wait what?

Consider the following:

I

am

a

human

Input

Eu

sou

um

humano

Output

Machine

Translation

65 of 78

Position matters!

For a sequence to sequence mapping, the order and position matters!

I

am

a

human

Input

Output

This matters ..

…more than this!

We need to account for each individual input along with the context!

66 of 78

Encoder - Decoder Models Explained

Key Idea: Use 2 LSTMs!

Encoder LSTM

Converts Input to Fixed Length representation

Decoder LSTM

Converts Fixed Length Representation to Output

Great for sequence to sequence tasks!

67 of 78

A 2-part model

Encoder

Decoder

Context

Input

Output

+ All hidden states!

Why transformers are so good

68 of 78

Local and Global!

Idea: Use context vector + hidden state at each timestep in the decoder!

Decoder hidden state

h1

h2

h3

Hidden states at every time

Compute context

69 of 78

How Do We Know What’s Relevant?

We get inspired by search engines!

Query

Keys

Values

Mapped onto

Get Best Matches

Multiply

Score

From Encoder

From Decoder

70 of 78

Computing an Output

For a particular output:

Assign a score to each hidden state!

Softmax the scores

Scale the hidden states by the softmaxed scores

+

+

Add the hidden states!

71 of 78

Making a prediction!

+

+

Hidden State: Contains global info

Context Vector - contains local info

+

Output

72 of 78

Let’s Move to the Colab

73 of 78

Current State of Transformer Models

Transformers are the current state of the art models for NLP

  • 2017- Transformers introduced
  • 2018 - Improved encoding with BERT
  • 2019 - XLNet and ERNIE, as well as GPT2
  • 2020 - GPT3 released

Further research is still ongoing!

All of these models use the attention mechanism!

74 of 78

Conclusion

06

75 of 78

Summary of Results

Simple RNN

LSTM

Basic NN

78%

57%

87%

Keep in mind that all these networks are shallow!

Transformers

93%

76 of 78

Quick Summary

Recap on Deep Learning & NLP 1 Workshop��Embeddings��RNNs

LSTMs

77 of 78

What’s next?

Recap on Deep Learning & NLP 1 Workshop��Embeddings��RNNs

LSTMs

Own NLP Project?

78 of 78

Any Further Questions for us?

Slido Code: #4299